Dave Pigott
Validation Engineer
T: +44 1223 40 00 63 | M +44 7940 45 93 44
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

On 21 Aug 2012, at 10:19, Alexander Sack wrote:

On Tue, Aug 21, 2012 at 10:37 AM, Dave Pigott <dave.pigott@linaro.org> wrote:
-----------------
beaglexm02
-----------------
http://validation.linaro.org/lava-server/scheduler/job/29737

Absolutely enormous log file. The board was in a very strange state, spewing
out loads of exceptions. Went onto the board and it was still throwing out
exceptions. Did a hard reset and it came back cleanly. Not clear why hard
reset didn't work from the LAVA session.

Put back online to retest.

------------
origen04
------------
http://validation.linaro.org/lava-server/scheduler/job/28745

Failed to get root.tgz. I may be missing something, but if you look at
http://validation.linaro.org/lava-server/scheduler/job/28745/log_file#entry14
you'll see that it says it's waiting 60 seconds to retry, but doesn't seem
to actually retry. Anyone any ideas?

Put back online to retest

--------------------------------------
panda01-05/09/10/12/14-23
--------------------------------------
http://validation.linaro.org/lava-server/scheduler/job/29825 (as an example)

This is just odd. It says it couldn't get the android artefact
(http://validation.linaro.org/lava-server/scheduler/job/29825/log_file#entry22)
but it doesn't appear to have even issued a wget!

Looking at the time stamps, something happened between 14:00UTC and 20:00UTC
that stopped things working. Whatever it was, I'm retesting panda01 to see
if it went away, or if (as I suspect) all the other boards will fail when
they run their health check.


Could we maintain an easy to find trackrecord about what was deployed
when? This might also help us to attach a check list that people run
through and sign off before pushing the production button (e.g. all
health jobs must have succeeded on staging before rolling out etc.).

-- 

+1 to that, but in this case, it turns out that the android image we use for testing comes from snapshots, and that particular snapshot was retired yesterday. None of the releases seem stable enough to use for health checks so, for the moment, we're re-baslining on a new working snapshot (liuyq is working on this at the moment.)

For the future, we're discussing holding the health check images cached locally, so that we're not hampered by these issues again, and we can choose when and how to re-baseline.

Dave