FYI:
https://wiki.linaro.org/Internal/LAVA/Incidents/Reports/2012-08-21-health-ch...
Thanks
Dave
On 21 Aug 2012, at 10:19, Alexander Sack wrote:
On Tue, Aug 21, 2012 at 10:37 AM, Dave Pigott dave.pigott@linaro.org wrote:
beaglexm02
http://validation.linaro.org/lava-server/scheduler/job/29737
Absolutely enormous log file. The board was in a very strange state, spewing out loads of exceptions. Went onto the board and it was still throwing out exceptions. Did a hard reset and it came back cleanly. Not clear why hard reset didn't work from the LAVA session.
Put back online to retest.
origen04
http://validation.linaro.org/lava-server/scheduler/job/28745
Failed to get root.tgz. I may be missing something, but if you look at http://validation.linaro.org/lava-server/scheduler/job/28745/log_file#entry1... you'll see that it says it's waiting 60 seconds to retry, but doesn't seem to actually retry. Anyone any ideas?
Put back online to retest
panda01-05/09/10/12/14-23
http://validation.linaro.org/lava-server/scheduler/job/29825 (as an example)
This is just odd. It says it couldn't get the android artefact (http://validation.linaro.org/lava-server/scheduler/job/29825/log_file#entry2...) but it doesn't appear to have even issued a wget!
Looking at the time stamps, something happened between 14:00UTC and 20:00UTC that stopped things working. Whatever it was, I'm retesting panda01 to see if it went away, or if (as I suspect) all the other boards will fail when they run their health check.
Could we maintain an easy to find trackrecord about what was deployed when? This might also help us to attach a check list that people run through and sign off before pushing the production button (e.g. all health jobs must have succeeded on staging before rolling out etc.).
-- Alexander Sack Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog