Dave Pigott dpigott@gmail.com writes:
http://staging.validation.linaro.org/scheduler/job/35505/log_file
This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point.
That's pretty odd, I agree. It seems to be waiting for a $ prompt to appear -- but one has. Still, if it's a 1 in a 10k type event I propose not worrying too much for now.
Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.
Thoughts?
Well, so my thinking for health job reliability is this: we should, on some iteration period (a month?) gather data on what causes most health failures, address that and only that, and then repeat. I think we're now back at the point where we need to deploy what we did last week and patiently gather more data again.
Cheers, mwh