Health checks

List overview All Threads
Download

newer

older

Fwd: Update for fault ref. 138424

Fwd: Your Zen Fault tracker link -...

Dave Pigott

9 Nov 2012 9 Nov '12

1:37 p.m.

Ignoring the one failure while I was getting tc2 up and running, we have the following:

------------ panda06 ------------ http://validation.linaro.org/lava-server/scheduler/job/38176

The key part is in downloading root.tgz. It gets part way through and then we get "connection reset by peer" on every single retry until we fail.

I've put it back online to retest.

Thanks

Dave

Show replies by date

Andy Doan

12 Nov 12 Nov

3:38 p.m.

On 11/09/2012 07:37 AM, Dave Pigott wrote:

...

Ignoring the one failure while I was getting tc2 up and running, we have the following:

panda06

http://validation.linaro.org/lava-server/scheduler/job/38176

The key part is in downloading root.tgz. It gets part way through and then we get "connection reset by peer" on every single retry until we fail.

I've put it back online to retest.

I think this is now our #1 failure issue in LAVA. We've looked at this in the past, added debugging, made hypothesis. However, we really haven't gotten to the bottom of this.

One data point I can add. When this happens, I've logged onto control and run wget on the failed URL and it works. So, this doesn't appear to be related to Apache or server load. I *think* I've also done wget's from another system in the lab. So, I don't think its a network/router thing either.

We already have some retry logic there, but maybe we need something more sophisticated? (my gut says "no")

-andy

Michael Hudson-Doyle

8:19 p.m.

Andy Doan andy.doan@linaro.org writes:

...

On 11/09/2012 07:37 AM, Dave Pigott wrote:

...
Ignoring the one failure while I was getting tc2 up and running, we have the following:

panda06

http://validation.linaro.org/lava-server/scheduler/job/38176

The key part is in downloading root.tgz. It gets part way through and then we get "connection reset by peer" on every single retry until we fail.

I've put it back online to retest.

I think this is now our #1 failure issue in LAVA. We've looked at this in the past, added debugging, made hypothesis. However, we really haven't gotten to the bottom of this.

One data point I can add. When this happens, I've logged onto control and run wget on the failed URL and it works. So, this doesn't appear to be related to Apache or server load. I *think* I've also done wget's from another system in the lab. So, I don't think its a network/router thing either.

Thanks for checking this. My gut already said "duff networking in the master image" but nice to have some data.

...

We already have some retry logic there, but maybe we need something more sophisticated? (my gut says "no")

Well. Rebooting the master image would almost certainly fix it. Don't know how to detect when that is the thing to do though. Interesting that we didn't see this on staging at all -- is it concentrated on particular boards? It might have a hardware aspect.

Cheers, mwh

Andy Doan

13 Nov 13 Nov

4:31 p.m.

On 11/12/2012 02:19 PM, Michael Hudson-Doyle wrote:

...

Andy Doan andy.doan@linaro.org writes:

...
On 11/09/2012 07:37 AM, Dave Pigott wrote:

...
Ignoring the one failure while I was getting tc2 up and running, we have the following:

panda06

http://validation.linaro.org/lava-server/scheduler/job/38176

The key part is in downloading root.tgz. It gets part way through and then we get "connection reset by peer" on every single retry until we fail.

I've put it back online to retest.

I think this is now our #1 failure issue in LAVA. We've looked at this in the past, added debugging, made hypothesis. However, we really haven't gotten to the bottom of this.

One data point I can add. When this happens, I've logged onto control and run wget on the failed URL and it works. So, this doesn't appear to be related to Apache or server load. I *think* I've also done wget's from another system in the lab. So, I don't think its a network/router thing either.

Thanks for checking this. My gut already said "duff networking in the master image" but nice to have some data.

...
We already have some retry logic there, but maybe we need something more sophisticated? (my gut says "no")

Well. Rebooting the master image would almost certainly fix it. Don't know how to detect when that is the thing to do though. Interesting that we didn't see this on staging at all -- is it concentrated on particular boards? It might have a hardware aspect.

It seems to have happened on panda02 a bit more recently. However, in general I think its distributed across everything. Just last night we failed downloading system.tar.bz2 for origen01.

4546

days inactive

4550

days old

linaro-validation@lists.linaro.org

3 comments

participants

tags (0)

participants (3)

Andy Doan
Dave Pigott
Michael Hudson-Doyle