Overnight failure

List overview All Threads
Download

newer

older

Out on Tuesday

Re: [Linaro-validation] Toolchain...

Dave Pigott

10 Nov 2012 10 Nov '12

9:17 a.m.

Just the one…

------------ panda03 ------------ http://validation.linaro.org/lava-server/scheduler/job/38289

Looks like the board locked up just after the startup animation completed. Went onto the board, and it was indeed locked. Hardreset and it came back. Put it down to a one off glitch.

Dave

Show replies by date

Andy Doan

12 Nov 12 Nov

3:33 p.m.

On 11/10/2012 03:17 AM, Dave Pigott wrote:

...

Just the one…

panda03

http://validation.linaro.org/lava-server/scheduler/job/38289

Looks like the board locked up just after the startup animation completed. Went onto the board, and it was indeed locked. Hardreset and it came back. Put it down to a one off glitch.

Thanks for looking into this. With the new "newline" code the failure pattern looked different and I wasn't sure what went wrong.

I think we've had this type of failure occur 3 times in the past week on Panda. I think its becoming our #2 failure reason (more days needed to really be sure).

Dave Pigott

3:44 p.m.

On 12 Nov 2012, at 15:33, Andy Doan andy.doan@linaro.org wrote:

...

On 11/10/2012 03:17 AM, Dave Pigott wrote:

...
Just the one…

panda03

http://validation.linaro.org/lava-server/scheduler/job/38289

Looks like the board locked up just after the startup animation completed. Went onto the board, and it was indeed locked. Hardreset and it came back. Put it down to a one off glitch.

Thanks for looking into this. With the new "newline" code the failure pattern looked different and I wasn't sure what went wrong.

I think we've had this type of failure occur 3 times in the past week on Panda. I think its becoming our #2 failure reason (more days needed to really be sure).

I suspect you're right, and underlining what Michael said the other day: We should define a period (week/month) over which we collect stats, and then the biggest problem in that period gets the attention, and then iterate until the failures are negligibly small and we move to the next highest sample unit (month/quarter -> quarter/year). The data set is now small enough it might warrant a month cycle, but I'm happy to review each week and see where we are.

Thanks

Dave

Andy Doan

3:53 p.m.

On 11/12/2012 09:44 AM, Dave Pigott wrote:

...

On 12 Nov 2012, at 15:33, Andy Doan andy.doan@linaro.org wrote:

...
On 11/10/2012 03:17 AM, Dave Pigott wrote:

...
Just the one…

------------ panda03 ------------ http://validation.linaro.org/lava-server/scheduler/job/38289

Looks like the board locked up just after the startup animation completed. Went onto the board, and it was indeed locked. Hardreset and it came back. Put it down to a one off glitch.

Thanks for looking into this. With the new "newline" code the failure pattern looked different and I wasn't sure what went wrong.

I think we've had this type of failure occur 3 times in the past week on Panda. I think its becoming our #2 failure reason (more days needed to really be sure).

I suspect you're right, and underlining what Michael said the other day: We should define a period (week/month) over which we collect stats, and then the biggest problem in that period gets the attention, and then iterate until the failures are negligibly small and we move to the next highest sample unit (month/quarter -> quarter/year). The data set is now small enough it might warrant a month cycle, but I'm happy to review each week and see where we are.

And since we have new people on the team, I keep the stats here:

https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AnxpY5uv-BlNdG9zYT...

so its easy to determine what our issues are.

Dave Pigott

3:53 p.m.

On 12 Nov 2012, at 15:44, Dave Pigott dpigott@gmail.com wrote:

...

On 12 Nov 2012, at 15:33, Andy Doan andy.doan@linaro.org wrote:

...
On 11/10/2012 03:17 AM, Dave Pigott wrote:

...
Just the one…

panda03

http://validation.linaro.org/lava-server/scheduler/job/38289

Looks like the board locked up just after the startup animation completed. Went onto the board, and it was indeed locked. Hardreset and it came back. Put it down to a one off glitch.

Thanks for looking into this. With the new "newline" code the failure pattern looked different and I wasn't sure what went wrong.

I think we've had this type of failure occur 3 times in the past week on Panda. I think its becoming our #2 failure reason (more days needed to really be sure).

I suspect you're right, and underlining what Michael said the other day: We should define a period (week/month) over which we collect stats, and then the biggest problem in that period gets the attention, and then iterate until the failures are negligibly small and we move to the next highest sample unit (month/quarter -> quarter/year). The data set is now small enough it might warrant a month cycle, but I'm happy to review each week and see where we are.

Thoughts (conjecture):

We don't do a clean boot in two ways: (1) We do a soft reboot (2) We use the master image u-boot

Both of these could, in theory, contribute to image lock up

We can fix (1) easily enough, but until we have a reliable sd-mux solution and/or switch to boot from USB thumb drive, there's not much we can do about that. It might be worth running a soak test on staging with a fix for (1) and see if we see any improvements.

Thanks

Dave

4911

days inactive

4913

days old

linaro-validation@lists.linaro.org

4 comments

participants

tags (0)

participants (2)

Andy Doan
Dave Pigott