Lab health checks

List overview All Threads
Download

newer

older

Lab health

Resolv.conf file missing in test...

Dave Pigott

19 Sep 2012 19 Sep '12

8:02 a.m.

------------ origen01 ------------ http://validation.linaro.org/lava-server/scheduler/job/32532

wget timeout - restarted.

------------ origen07 ------------ http://validation.linaro.org/lava-server/scheduler/job/32582

Couldn't ping - rechecked and it seems ok now.

------------ origen09 ------------ http://validation.linaro.org/lava-server/scheduler/job/32321

Stream of "timeout waiting for hardware interrupt messages". When I connected to the board it was still producing them and was locked out. Hard reset and it seems ok. Put back online to see if we have a glitch or a failing board.

---------------- snowball03 ---------------- http://validation.linaro.org/lava-server/scheduler/job/32644

eth0 didn't come up, so network failure. Went on the board, did a net restart and all seems ok. Put back online to re-test.

Dave Pigott Validation Engineer T: +44 1223 40 00 63 | M +44 7940 45 93 44 Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog

Attachments:

attachment.html (text/html — 3.7 KB)

Show replies by date

Alexander Sack

19 Sep 19 Sep

11:01 a.m.

On Wed, Sep 19, 2012 at 10:02 AM, Dave Pigott dave.pigott@linaro.org wrote:

...

origen01

http://validation.linaro.org/lava-server/scheduler/job/32532

wget timeout - restarted.

origen07

http://validation.linaro.org/lava-server/scheduler/job/32582

Couldn't ping - rechecked and it seems ok now.

Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?

...

origen09

http://validation.linaro.org/lava-server/scheduler/job/32321

Stream of "timeout waiting for hardware interrupt messages". When I connected to the board it was still producing them and was locked out. Hard reset and it seems ok. Put back online to see if we have a glitch or a failing board.

snowball03

http://validation.linaro.org/lava-server/scheduler/job/32644

eth0 didn't come up, so network failure. Went on the board, did a net restart and all seems ok. Put back online to re-test.

Dave Pigott Validation Engineer T: +44 1223 40 00 63 | M +44 7940 45 93 44 Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog

linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation

-- Alexander Sack Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog

Dave Pigott

11:23 a.m.

On 19 Sep 2012, at 12:01, Alexander Sack wrote:

...

On Wed, Sep 19, 2012 at 10:02 AM, Dave Pigott dave.pigott@linaro.org wrote:

...

origen01

http://validation.linaro.org/lava-server/scheduler/job/32532

wget timeout - restarted.

origen07

http://validation.linaro.org/lava-server/scheduler/job/32582

Couldn't ping - rechecked and it seems ok now.

Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?

I have a theory, but that's all it is, and I'm trying to think how to check it out. My theory is that this sort of failure happens when there is heavy internal network load, or load on control. I suppose to either prove or discount this we could put a snooper to run and see if a failure corresponds to high activity, and likewise have something that traces load over time on control, and see if we get any correlation?

The snowball one is different, however. It doesn't happen often, but when it does it is because eth0 doesn't come up. This may be something wrong in the master image. We should ask the ST-E landing team if there was a known networking issue that got fixed? IIRC they had one at the tail end of last year as well.

Dave

...

...

origen09

http://validation.linaro.org/lava-server/scheduler/job/32321

Stream of "timeout waiting for hardware interrupt messages". When I connected to the board it was still producing them and was locked out. Hard reset and it seems ok. Put back online to see if we have a glitch or a failing board.

snowball03

http://validation.linaro.org/lava-server/scheduler/job/32644

eth0 didn't come up, so network failure. Went on the board, did a net restart and all seems ok. Put back online to re-test.

Dave Pigott Validation Engineer T: +44 1223 40 00 63 | M +44 7940 45 93 44 Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog

linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation

-- Alexander Sack Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog

Michael Hudson-Doyle

20 Sep 20 Sep

1:44 a.m.

Dave Pigott dave.pigott@linaro.org writes:

...

On 19 Sep 2012, at 12:01, Alexander Sack wrote:

...
On Wed, Sep 19, 2012 at 10:02 AM, Dave Pigott dave.pigott@linaro.org wrote:

...

origen01

http://validation.linaro.org/lava-server/scheduler/job/32532

wget timeout - restarted.

origen07

http://validation.linaro.org/lava-server/scheduler/job/32582

Couldn't ping - rechecked and it seems ok now.

Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?

I have a theory, but that's all it is, and I'm trying to think how to check it out. My theory is that this sort of failure happens when there is heavy internal network load, or load on control. I suppose to either prove or discount this we could put a snooper to run and see if a failure corresponds to high activity, and likewise have something that traces load over time on control,

Like the graphs on monitis? (Get Andy to give you the password).

Cheers, mwh

Andy Doan

19 Sep 19 Sep

3:18 p.m.

On 09/19/2012 06:01 AM, Alexander Sack wrote:

...

Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?

We attempted to add some debugging in the past, but so far nothing has helped much. The biggest problem we have now is repeatability of this issue. If you ignore the TC2 failures which are skewing the results a little now, we have about a 5% failure rate (for a 2 week period we actually had 1%!). Of that 5%, 50% are network related:

pinging control fails downloading *.tgz in master fails

So out of 100 runs we get about 2 "wget" type failues and 2 "ping" type failures. Regardless of how small the number is, its _half_ of our issues, so we do get a good bang for our buck by improving it.

Alexander Sack

3:42 p.m.

We need a way to better investigate whats going on.

Are we using dhclient or how are we getting the IP?

Anyway, I feel to nail this down we need first make this reproducible. For that having a "(re-)connect to eth on ubuntu" test case in our enablement suite that we just run 10000 times for a test job without rebooting and see if it breaks might do it...

Maybe someone from Ricardo's team or naresh/soumya (all three CCed) can help write a simple lava-test that does that and integrate that in our enablement testsuite for lava-test? Once we can force reproduction we have a better chance to nail this down and we can ensure that priority is there to fix this in kernel etc.

Note: not saying this has to happen now, but we need to improve how we deal with rarely happen issues in a way that we nail them down. Otherwise we will never reach 99.999% health job success :).

On Wed, Sep 19, 2012 at 5:18 PM, Andy Doan andy.doan@linaro.org wrote:

...

On 09/19/2012 06:01 AM, Alexander Sack wrote:

...
Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?

We attempted to add some debugging in the past, but so far nothing has helped much. The biggest problem we have now is repeatability of this issue. If you ignore the TC2 failures which are skewing the results a little now, we have about a 5% failure rate (for a 2 week period we actually had 1%!). Of that 5%, 50% are network related:

pinging control fails downloading *.tgz in master fails

So out of 100 runs we get about 2 "wget" type failues and 2 "ping" type failures. Regardless of how small the number is, its _half_ of our issues, so we do get a good bang for our buck by improving it.

-- Alexander Sack Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog

4817

days inactive

4818

days old

linaro-validation@lists.linaro.org

5 comments

participants

tags (0)

participants (4)

Alexander Sack
Andy Doan
Dave Pigott
Michael Hudson-Doyle