Re: [Linaro-validation] Deployment report/fallout

29 Mar 2012


      On Wed, 28 Mar 2012 22:49:51 -0500, Paul Larson paul.larson@linaro.org wrote:
...
On Wed, Mar 28, 2012 at 8:31 PM, Michael Hudson-Doyle <
michael.hudson@linaro.org> wrote:
...
Hi,
Just a quick note about today's deployment.
The good news is that things are all working :)
First off, thanks a LOT for doing this.
Zygmunt did most of the heavy lifting this time, fwiw :)
...
...
There were a few wrinkles along the way.  A few were due to the fact
that replacing which package provides a certain module
(lava.utils.interface in this case, which moved from lava-server to its
own package) appears to confuse things a bit.
Once the new code was in place, I prepared to run health check jobs.  I
used the new admin action to mark all 'passing' boards as 'unknown'
health state, cleared out /linaro/image enough to have plenty of disk
space while the health jobs ran and then turned the scheduler back on.
As desired, all boards started running health check jobs.  This meant
that there were 30 odd processes uncompressing disk images, loopback
mounting them and tarring bits of them up -- load went up to 90, and
stayed there for a few hours!  Although the machine was still quite
usable in this state, it's probably not a good situation.  I filed a
couple of bugs about this:
keep in mind, that in 24 hours from then they will all kick off health jobs
again.
Yes, that's an excellent point.  Although they will spread out a bit
over time.
...
Perhaps we should set the state to unknown for 1 board every 10
min. or so over the next day so that we can fan them out a bit?
I guess we should ask the usual question: what are we looking to achieve
here?
Basically, we're trying to make sure that the new code we've deployed
doesn't have any terrible bugs that mean all jobs will fail.
Running the health checks we have defined today on each board is both
too much work and too little: too much, because there's no reason to run
this on _every_ board, just one board of each type would be enough and
too little because the existing health jobs do not excercise large parts
of the dispatcher because they do not actually run any tests.
(totally thinking aloud here) We could add a device _type_ health
status, where no jobs would run on a device type until a health job had
completed on one device of that type... hmm feels complicated.  Maybe we
should just be stricter about testing things out on staging.
...
Another thing that should help with this is the cloudification stuff
that zyga has been doing.
Yeah.  I do also think though that we should make the health checks as
cheap to run as possible -- it's just a waste of time and electricity to
uncompress these images each time.
...
...
1 job on origen03 failed:
http://validation.linaro.org/lava-server/scheduler/job/16951
but because the board was spamming the serial line when the job started,
I haven't actually loaded enough of the log to find the problem
yet... (maybe we should hard reset the board when we terminate a badly
behaving job?).
+1 to that. In general, hard reset is a better option.  As we migrate
toward having a separate hardreset function for the board, we should create
a simple script that does something annoying like beeping and putting up a
message on the console to push the reset button.  This could be used for
testing locally, while the lab machines would simply always use hard
reset.  Alex has been pushing for this, and we've discussed before.  I
think we have filled in some of the things that we need in order to do it
now though.
Yeah.  Actually maybe power cycling a board should just be the first
thing the dispatcher does...
Cheers,
mwh

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Linaro-validation] Deployment report/fallout