http://staging.validation.linaro.org/scheduler/job/35505/log_file
This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point. Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.
Thoughts?
Dave
Oh, and that was the only failure in the last 24 hours out of 219 jobs (so far). I'm letting staging run looping until 14:00UTC because of the 2 hour downtime yesterday, so hopefully we'll then have a true figure of our failure rate.
Thanks
Dave
On 6 Nov 2012, at 08:12, Dave Pigott dave.pigott@linaro.org wrote:
http://staging.validation.linaro.org/scheduler/job/35505/log_file
This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point. Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.
Thoughts?
Dave
OK. 24 hours has passed since staging went back up and all boards started looping tests. The results report that:
Only one failure, the same one as before: http://staging.validation.linaro.org/scheduler/job/35505
If we could fix that one, we'd have had 100%.
So the score at 2:15 was 248/249 passes.
That's > 99.5% - no way of knowing how much higher without looping for a week, but I think we can safely say that job failures are now far more likely to be because of the job, and not because of lava, and that a health check failure is likely to point to a problem board. A much better state of affairs.
Thanks
Dave
On 6 Nov 2012, at 08:15, Dave Pigott dave.pigott@linaro.org wrote:
Oh, and that was the only failure in the last 24 hours out of 219 jobs (so far). I'm letting staging run looping until 14:00UTC because of the 2 hour downtime yesterday, so hopefully we'll then have a true figure of our failure rate.
Thanks
Dave
On 6 Nov 2012, at 08:12, Dave Pigott dave.pigott@linaro.org wrote:
http://staging.validation.linaro.org/scheduler/job/35505/log_file
This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point. Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.
Thoughts?
Dave
Dave Pigott dpigott@gmail.com writes:
http://staging.validation.linaro.org/scheduler/job/35505/log_file
This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point.
That's pretty odd, I agree. It seems to be waiting for a $ prompt to appear -- but one has. Still, if it's a 1 in a 10k type event I propose not worrying too much for now.
Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.
Thoughts?
Well, so my thinking for health job reliability is this: we should, on some iteration period (a month?) gather data on what causes most health failures, address that and only that, and then repeat. I think we're now back at the point where we need to deploy what we did last week and patiently gather more data again.
Cheers, mwh
linaro-validation@lists.linaro.org