We need a way to better investigate whats going on.
Are we using dhclient or how are we getting the IP?
Anyway, I feel to nail this down we need first make this reproducible. For that having a "(re-)connect to eth on ubuntu" test case in our enablement suite that we just run 10000 times for a test job without rebooting and see if it breaks might do it...
Maybe someone from Ricardo's team or naresh/soumya (all three CCed) can help write a simple lava-test that does that and integrate that in our enablement testsuite for lava-test? Once we can force reproduction we have a better chance to nail this down and we can ensure that priority is there to fix this in kernel etc.
Note: not saying this has to happen now, but we need to improve how we deal with rarely happen issues in a way that we nail them down. Otherwise we will never reach 99.999% health job success :).
On Wed, Sep 19, 2012 at 5:18 PM, Andy Doan andy.doan@linaro.org wrote:
On 09/19/2012 06:01 AM, Alexander Sack wrote:
Do we know why we have regular networking issues on master images still? Can we have an effort to nail this down? How can we do that?
We attempted to add some debugging in the past, but so far nothing has helped much. The biggest problem we have now is repeatability of this issue. If you ignore the TC2 failures which are skewing the results a little now, we have about a 5% failure rate (for a 2 week period we actually had 1%!). Of that 5%, 50% are network related:
pinging control fails downloading *.tgz in master fails
So out of 100 runs we get about 2 "wget" type failues and 2 "ping" type failures. Regardless of how small the number is, its _half_ of our issues, so we do get a good bang for our buck by improving it.