Staging failure

List overview All Threads
Download

newer

older

panda24 pulled out of looping mode...

LAVA runs fails for...

Dave Pigott

6 Nov 2012 6 Nov '12

8:12 a.m.

http://staging.validation.linaro.org/scheduler/job/35505/log_file

This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point. Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.

Thoughts?

Dave

Show replies by date

Dave Pigott

6 Nov 6 Nov

8:15 a.m.

Oh, and that was the only failure in the last 24 hours out of 219 jobs (so far). I'm letting staging run looping until 14:00UTC because of the 2 hour downtime yesterday, so hopefully we'll then have a true figure of our failure rate.

Thanks

Dave

On 6 Nov 2012, at 08:12, Dave Pigott dave.pigott@linaro.org wrote:

...

http://staging.validation.linaro.org/scheduler/job/35505/log_file

This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point. Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.

Thoughts?

Dave

Dave Pigott

2:18 p.m.

OK. 24 hours has passed since staging went back up and all boards started looping tests. The results report that:

Only one failure, the same one as before: http://staging.validation.linaro.org/scheduler/job/35505

If we could fix that one, we'd have had 100%.

So the score at 2:15 was 248/249 passes.

That's > 99.5% - no way of knowing how much higher without looping for a week, but I think we can safely say that job failures are now far more likely to be because of the job, and not because of lava, and that a health check failure is likely to point to a problem board. A much better state of affairs.

Thanks

Dave

On 6 Nov 2012, at 08:15, Dave Pigott dave.pigott@linaro.org wrote:

...

Oh, and that was the only failure in the last 24 hours out of 219 jobs (so far). I'm letting staging run looping until 14:00UTC because of the 2 hour downtime yesterday, so hopefully we'll then have a true figure of our failure rate.

Thanks

Dave

On 6 Nov 2012, at 08:12, Dave Pigott dave.pigott@linaro.org wrote:

...
http://staging.validation.linaro.org/scheduler/job/35505/log_file

This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point. Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.

Thoughts?

Dave

Michael Hudson-Doyle

7 Nov 7 Nov

10:03 p.m.

Dave Pigott dpigott@gmail.com writes:

...

http://staging.validation.linaro.org/scheduler/job/35505/log_file

This is a bit odd. It got confused when we were in the u-boot prompt while trying to boot up the android test image. It may be some code flaw, though I can't see what, other than it took 5 minutes to get from reboot to that point.

That's pretty odd, I agree. It seems to be waiting for a $ prompt to appear -- but one has. Still, if it's a 1 in a 10k type event I propose not worrying too much for now.

...

Perhaps this calls for a similar approach to booting test images, i.e. if it fails, try a couple more times. May be an edge case, but would put our reliability way up. Along with that, we may have to up the timeouts for booting, given we might do it 3 times.

Thoughts?

Well, so my thinking for health job reliability is this: we should, on some iteration period (a month?) gather data on what causes most health failures, address that and only that, and then repeat. I think we're now back at the point where we need to deploy what we did last week and patiently gather more data again.

Cheers, mwh

4668

days inactive

4669

days old

linaro-validation@lists.linaro.org

3 comments

participants

tags (0)

participants (2)

Dave Pigott
Michael Hudson-Doyle