Hi,
Just a quick note about today's deployment.
The good news is that things are all working :)
There were a few wrinkles along the way. A few were due to the fact that replacing which package provides a certain module (lava.utils.interface in this case, which moved from lava-server to its own package) appears to confuse things a bit.
Once the new code was in place, I prepared to run health check jobs. I used the new admin action to mark all 'passing' boards as 'unknown' health state, cleared out /linaro/image enough to have plenty of disk space while the health jobs ran and then turned the scheduler back on.
As desired, all boards started running health check jobs. This meant that there were 30 odd processes uncompressing disk images, loopback mounting them and tarring bits of them up -- load went up to 90, and stayed there for a few hours! Although the machine was still quite usable in this state, it's probably not a good situation. I filed a couple of bugs about this:
https://bugs.launchpad.net/lava-scheduler/+bug/967849 https://bugs.launchpad.net/lava-dispatcher/+bug/967877
Thankfully, the large majority of health jobs passed. We had a few issues:
3 jobs on panda hit the serial truncation thing:
http://validation.linaro.org/lava-server/scheduler/job/16932/log_file http://validation.linaro.org/lava-server/scheduler/job/16946/log_file http://validation.linaro.org/lava-server/scheduler/job/16956/log_file
The last was a different command though -- "mount /dev/disk/by-label/testbo" -- could it be getting worse?
1 job on origen02 (had deployment time out for reasons that are not clear:
http://validation.linaro.org/lava-server/scheduler/job/16961/log_file
1 job on origen03 failed:
http://validation.linaro.org/lava-server/scheduler/job/16951
but because the board was spamming the serial line when the job started, I haven't actually loaded enough of the log to find the problem yet... (maybe we should hard reset the board when we terminate a badly behaving job?).
Cheers, mwh