Ignoring the one failure while I was getting tc2 up and running, we have the following:
------------ panda06 ------------ http://validation.linaro.org/lava-server/scheduler/job/38176
The key part is in downloading root.tgz. It gets part way through and then we get "connection reset by peer" on every single retry until we fail.
I've put it back online to retest.
Thanks
Dave
On 11/09/2012 07:37 AM, Dave Pigott wrote:
Ignoring the one failure while I was getting tc2 up and running, we have the following:
panda06
http://validation.linaro.org/lava-server/scheduler/job/38176
The key part is in downloading root.tgz. It gets part way through and then we get "connection reset by peer" on every single retry until we fail.
I've put it back online to retest.
I think this is now our #1 failure issue in LAVA. We've looked at this in the past, added debugging, made hypothesis. However, we really haven't gotten to the bottom of this.
One data point I can add. When this happens, I've logged onto control and run wget on the failed URL and it works. So, this doesn't appear to be related to Apache or server load. I *think* I've also done wget's from another system in the lab. So, I don't think its a network/router thing either.
We already have some retry logic there, but maybe we need something more sophisticated? (my gut says "no")
-andy
Andy Doan andy.doan@linaro.org writes:
On 11/09/2012 07:37 AM, Dave Pigott wrote:
Ignoring the one failure while I was getting tc2 up and running, we have the following:
panda06
http://validation.linaro.org/lava-server/scheduler/job/38176
The key part is in downloading root.tgz. It gets part way through and then we get "connection reset by peer" on every single retry until we fail.
I've put it back online to retest.
I think this is now our #1 failure issue in LAVA. We've looked at this in the past, added debugging, made hypothesis. However, we really haven't gotten to the bottom of this.
One data point I can add. When this happens, I've logged onto control and run wget on the failed URL and it works. So, this doesn't appear to be related to Apache or server load. I *think* I've also done wget's from another system in the lab. So, I don't think its a network/router thing either.
Thanks for checking this. My gut already said "duff networking in the master image" but nice to have some data.
We already have some retry logic there, but maybe we need something more sophisticated? (my gut says "no")
Well. Rebooting the master image would almost certainly fix it. Don't know how to detect when that is the thing to do though. Interesting that we didn't see this on staging at all -- is it concentrated on particular boards? It might have a hardware aspect.
Cheers, mwh
On 11/12/2012 02:19 PM, Michael Hudson-Doyle wrote:
Andy Doan andy.doan@linaro.org writes:
On 11/09/2012 07:37 AM, Dave Pigott wrote:
Ignoring the one failure while I was getting tc2 up and running, we have the following:
panda06
http://validation.linaro.org/lava-server/scheduler/job/38176
The key part is in downloading root.tgz. It gets part way through and then we get "connection reset by peer" on every single retry until we fail.
I've put it back online to retest.
I think this is now our #1 failure issue in LAVA. We've looked at this in the past, added debugging, made hypothesis. However, we really haven't gotten to the bottom of this.
One data point I can add. When this happens, I've logged onto control and run wget on the failed URL and it works. So, this doesn't appear to be related to Apache or server load. I *think* I've also done wget's from another system in the lab. So, I don't think its a network/router thing either.
Thanks for checking this. My gut already said "duff networking in the master image" but nice to have some data.
We already have some retry logic there, but maybe we need something more sophisticated? (my gut says "no")
Well. Rebooting the master image would almost certainly fix it. Don't know how to detect when that is the thing to do though. Interesting that we didn't see this on staging at all -- is it concentrated on particular boards? It might have a hardware aspect.
It seems to have happened on panda02 a bit more recently. However, in general I think its distributed across everything. Just last night we failed downloading system.tar.bz2 for origen01.
linaro-validation@lists.linaro.org