------------ origen01 ------------ http://validation.linaro.org/lava-server/scheduler/job/32532
wget timeout - restarted.
------------ origen07 ------------ http://validation.linaro.org/lava-server/scheduler/job/32582
Couldn't ping - rechecked and it seems ok now.
------------ origen09 ------------ http://validation.linaro.org/lava-server/scheduler/job/32321
Stream of "timeout waiting for hardware interrupt messages". When I connected to the board it was still producing them and was locked out. Hard reset and it seems ok. Put back online to see if we have a glitch or a failing board.
---------------- snowball03 ---------------- http://validation.linaro.org/lava-server/scheduler/job/32644
eth0 didn't come up, so network failure. Went on the board, did a net restart and all seems ok. Put back online to re-test.
Dave Pigott Validation Engineer T: +44 1223 40 00 63 | M +44 7940 45 93 44 Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog
On Wed, Sep 19, 2012 at 10:02 AM, Dave Pigott dave.pigott@linaro.org wrote:
origen01
http://validation.linaro.org/lava-server/scheduler/job/32532
wget timeout - restarted.
origen07
http://validation.linaro.org/lava-server/scheduler/job/32582
Couldn't ping - rechecked and it seems ok now.
Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?
origen09
http://validation.linaro.org/lava-server/scheduler/job/32321
Stream of "timeout waiting for hardware interrupt messages". When I connected to the board it was still producing them and was locked out. Hard reset and it seems ok. Put back online to see if we have a glitch or a failing board.
snowball03
http://validation.linaro.org/lava-server/scheduler/job/32644
eth0 didn't come up, so network failure. Went on the board, did a net restart and all seems ok. Put back online to re-test.
Dave Pigott Validation Engineer T: +44 1223 40 00 63 | M +44 7940 45 93 44 Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
On 19 Sep 2012, at 12:01, Alexander Sack wrote:
On Wed, Sep 19, 2012 at 10:02 AM, Dave Pigott dave.pigott@linaro.org wrote:
origen01
http://validation.linaro.org/lava-server/scheduler/job/32532
wget timeout - restarted.
origen07
http://validation.linaro.org/lava-server/scheduler/job/32582
Couldn't ping - rechecked and it seems ok now.
Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?
I have a theory, but that's all it is, and I'm trying to think how to check it out. My theory is that this sort of failure happens when there is heavy internal network load, or load on control. I suppose to either prove or discount this we could put a snooper to run and see if a failure corresponds to high activity, and likewise have something that traces load over time on control, and see if we get any correlation?
The snowball one is different, however. It doesn't happen often, but when it does it is because eth0 doesn't come up. This may be something wrong in the master image. We should ask the ST-E landing team if there was a known networking issue that got fixed? IIRC they had one at the tail end of last year as well.
Dave
origen09
http://validation.linaro.org/lava-server/scheduler/job/32321
Stream of "timeout waiting for hardware interrupt messages". When I connected to the board it was still producing them and was locked out. Hard reset and it seems ok. Put back online to see if we have a glitch or a failing board.
snowball03
http://validation.linaro.org/lava-server/scheduler/job/32644
eth0 didn't come up, so network failure. Went on the board, did a net restart and all seems ok. Put back online to re-test.
Dave Pigott Validation Engineer T: +44 1223 40 00 63 | M +44 7940 45 93 44 Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
-- Alexander Sack Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog
Dave Pigott dave.pigott@linaro.org writes:
On 19 Sep 2012, at 12:01, Alexander Sack wrote:
On Wed, Sep 19, 2012 at 10:02 AM, Dave Pigott dave.pigott@linaro.org wrote:
origen01
http://validation.linaro.org/lava-server/scheduler/job/32532
wget timeout - restarted.
origen07
http://validation.linaro.org/lava-server/scheduler/job/32582
Couldn't ping - rechecked and it seems ok now.
Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?
I have a theory, but that's all it is, and I'm trying to think how to check it out. My theory is that this sort of failure happens when there is heavy internal network load, or load on control. I suppose to either prove or discount this we could put a snooper to run and see if a failure corresponds to high activity, and likewise have something that traces load over time on control,
Like the graphs on monitis? (Get Andy to give you the password).
Cheers, mwh
On 09/19/2012 06:01 AM, Alexander Sack wrote:
Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?
We attempted to add some debugging in the past, but so far nothing has helped much. The biggest problem we have now is repeatability of this issue. If you ignore the TC2 failures which are skewing the results a little now, we have about a 5% failure rate (for a 2 week period we actually had 1%!). Of that 5%, 50% are network related:
pinging control fails downloading *.tgz in master fails
So out of 100 runs we get about 2 "wget" type failues and 2 "ping" type failures. Regardless of how small the number is, its _half_ of our issues, so we do get a good bang for our buck by improving it.
We need a way to better investigate whats going on.
Are we using dhclient or how are we getting the IP?
Anyway, I feel to nail this down we need first make this reproducible. For that having a "(re-)connect to eth on ubuntu" test case in our enablement suite that we just run 10000 times for a test job without rebooting and see if it breaks might do it...
Maybe someone from Ricardo's team or naresh/soumya (all three CCed) can help write a simple lava-test that does that and integrate that in our enablement testsuite for lava-test? Once we can force reproduction we have a better chance to nail this down and we can ensure that priority is there to fix this in kernel etc.
Note: not saying this has to happen now, but we need to improve how we deal with rarely happen issues in a way that we nail them down. Otherwise we will never reach 99.999% health job success :).
On Wed, Sep 19, 2012 at 5:18 PM, Andy Doan andy.doan@linaro.org wrote:
On 09/19/2012 06:01 AM, Alexander Sack wrote:
Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?
We attempted to add some debugging in the past, but so far nothing has helped much. The biggest problem we have now is repeatability of this issue. If you ignore the TC2 failures which are skewing the results a little now, we have about a 5% failure rate (for a 2 week period we actually had 1%!). Of that 5%, 50% are network related:
pinging control fails downloading *.tgz in master fails
So out of 100 runs we get about 2 "wget" type failues and 2 "ping" type failures. Regardless of how small the number is, its _half_ of our issues, so we do get a good bang for our buck by improving it.
linaro-validation@lists.linaro.org