OK. 24 hours has passed since staging went back up and all boards started looping tests. The results report that:
Only one failure, the same one as before: http://staging.validation.linaro.org/scheduler/job/35505
If we could fix that one, we'd have had 100%.
So the score at 2:15 was 248/249 passes.
That's > 99.5% - no way of knowing how much higher without looping for a week, but I think we can safely say that job failures are now far more likely to be because of the job, and not because of lava, and that a health check failure is likely to point to a problem board. A much better state of affairs.
Thanks
Dave
On 6 Nov 2012, at 08:15, Dave Pigott dave.pigott@linaro.org wrote:
Oh, and that was the only failure in the last 24 hours out of 219 jobs (so far). I'm letting staging run looping until 14:00UTC because of the 2 hour downtime yesterday, so hopefully we'll then have a true figure of our failure rate.
Thanks
Dave
On 6 Nov 2012, at 08:12, Dave Pigott dave.pigott@linaro.org wrote:
http://staging.validation.linaro.org/scheduler/job/35505/log_file
This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point. Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.
Thoughts?
Dave