About the network and deployment problem of lava

List overview All Threads
Download

newer

older

request of infomation

Old cache clean up on lava server

YongQin Liu

7 Jun 2012 7 Jun '12

7:20 a.m.

Hi, All

I just submitted 100 jobs with panda-ics-gcc47-tilt-stable-blob#18 android images. 4 of them are failed because of the network problem. 3 on panda01 and 1 on panda06. you can see here for the details. https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AnxpY5uv-BlNdG9zYT...

For the download problem of image files, I guest we can set the retry number to 5. This time I saw one job on panda01 succeed to download the images files at the 2nd try.

For the network problem, is it just related to the specified board(like panda01), or has relation to the entire network of lab? Anyone has any thoughts about it?

And another thing I suggest is that we change to use images of panda stable images in our health job here.

Thanks, Yongqin Liu

Attachments:

attachment.html (text/html — 2.4 KB)

Show replies by date

Alexander Sack

7 Jun 7 Jun

7:33 a.m.

New subject: About the network and deployment problem of lava

On Thu, Jun 7, 2012 at 9:20 AM, YongQin Liu yongqin.liu@linaro.org wrote:

...

Hi, All

I just submitted 100 jobs with panda-ics-gcc47-tilt-stable-blob#18 android images. 4 of them are failed because of the network problem. 3 on panda01 and 1 on panda06. you can see here for the details. https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AnxpY5uv-BlNdG9zYT...

For the download problem of image files, I guest we can set the retry number to 5. This time I saw one job on panda01 succeed to download the images files at the 2nd try.

For the network problem, is it just related to the specified board(like panda01), or has relation to the entire network of lab? Anyone has any thoughts about it?

Interesting find! Is it easy to add feature to retry?

What kernel/image are we using as master image? Are we using the same for _all_ panda boards?

Otherwise, Dave might want to take a look what's different. Maybe it's a different SoC rev etc.?

...

And another thing I suggest is that we change to use images of panda stable images in our health job here.

You want to move our panda health jobs to use the build you tested (panda-ics-gcc47-tilt-stable-blob#18 )?

-- Alexander Sack Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog

YongQin Liu

8 Jun 8 Jun

2:15 a.m.

New subject: About the network and deployment problem of lava

On 7 June 2012 15:33, Alexander Sack asac@linaro.org wrote:

...

On Thu, Jun 7, 2012 at 9:20 AM, YongQin Liu yongqin.liu@linaro.org wrote:

...
Hi, All

I just submitted 100 jobs with panda-ics-gcc47-tilt-stable-blob#18

android

...
images. 4 of them are failed because of the network problem. 3 on panda01 and 1 on panda06. you can see here for the details.

https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AnxpY5uv-BlNdG9zYT...

...
For the download problem of image files, I guest we can set the retry

number

...
to 5. This time I saw one job on panda01 succeed to download the images files

at

...
the 2nd try.

For the network problem, is it just related to the specified board(like panda01), or has relation to the entire network of lab? Anyone has any thoughts about it?

Interesting find! Is it easy to add feature to retry?

What kernel/image are we using as master image? Are we using the same for _all_ panda boards?

Otherwise, Dave might want to take a look what's different. Maybe it's a different SoC rev etc.?

Yeah, we can ask Dave about the information later.

And we can also wait a week to see the trend as the same error occurs, may be we can find the reason that time.

Thanks, Yongqin Liu

...

...
And another thing I suggest is that we change to use images of panda

stable images in our health job here.

You want to move our panda health jobs to use the build you tested (panda-ics-gcc47-tilt-stable-blob#18 )?

-- Alexander Sack Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog

Andy Doan

7 Jun 7 Jun

7:40 p.m.

On 06/07/2012 02:20 AM, YongQin Liu wrote:

...

I just submitted 100 jobs with panda-ics-gcc47-tilt-stable-blob#18 android images. 4 of them are failed because of the network problem. 3 on panda01 and 1 on panda06. you can see here for the details. https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AnxpY5uv-BlNdG9zYT...

For the download problem of image files, I guest we can set the retry number to 5. This time I saw one job on panda01 succeed to download the images files at the 2nd try.

For the network problem, is it just related to the specified board(like panda01), or has relation to the entire network of lab? Anyone has any thoughts about it?

And another thing I suggest is that we change to use images of panda stable images in our health job here.

I'm more than ready to update our health job to use this image. At a minimum its an improvement from what we have now. Last we talked, you were hesitant to make the change. However, it sounds like you feel good about it now.

Please bring this up in our meeting later today (if I don't). I'd like to make the change as soon as possible.

-andy

Michael Hudson-Doyle

10:01 p.m.

New subject: About the network and deployment problem of lava

YongQin Liu yongqin.liu@linaro.org writes:

...

Hi, All

I just submitted 100 jobs with panda-ics-gcc47-tilt-stable-blob#18 android images. 4 of them are failed because of the network problem. 3 on panda01 and 1 on panda06. you can see here for the details. https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AnxpY5uv-BlNdG9zYT...

For the download problem of image files, I guest we can set the retry number to 5.

Makes sense. I think waiting 5 minutes between retries is probably a touch excessive too, maybe we should scale that down too.

...

This time I saw one job on panda01 succeed to download the images files at the 2nd try.

I've seen this happen a few times too.

...

For the network problem, is it just related to the specified board(like panda01), or has relation to the entire network of lab? Anyone has any thoughts about it?

I don't know what's going on at all. There is flakiness somewhere, but I don't know if it's in the panda master kernel, the hardware of some of the pandas or somewhere in our lab's setup. It's interesting that it happens more on particular devices though, suggests a more hardware-sided problem (whether with the panda or a loose cable or something else).

...

And another thing I suggest is that we change to use images of panda stable images in our health job here.

Cheers, mwh

YongQin Liu

8 Jun 8 Jun

2:24 a.m.

New subject: About the network and deployment problem of lava

On 8 June 2012 06:01, Michael Hudson-Doyle michael.hudson@linaro.orgwrote:

...

YongQin Liu yongqin.liu@linaro.org writes:

...
Hi, All

I just submitted 100 jobs with panda-ics-gcc47-tilt-stable-blob#18

android

...
images. 4 of them are failed because of the network problem. 3 on panda01 and 1 on panda06. you can see here for the details.

https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AnxpY5uv-BlNdG9zYT...

...
For the download problem of image files, I guest we can set the retry number to 5.

Makes sense. I think waiting 5 minutes between retries is probably a touch excessive too, maybe we should scale that down too.

...
This time I saw one job on panda01 succeed to download the images files

at

...
the 2nd try.

I've seen this happen a few times too.

I have filed a bug about this. https://bugs.launchpad.net/lava-dispatcher/+bug/1010285

do you think setting the wait time to 1 minute is ok?

Thanks, Yongqin Liu

...

...
For the network problem, is it just related to the specified board(like panda01), or has relation to the entire network of lab? Anyone has any thoughts about it?

I don't know what's going on at all. There is flakiness somewhere, but I don't know if it's in the panda master kernel, the hardware of some of the pandas or somewhere in our lab's setup. It's interesting that it happens more on particular devices though, suggests a more hardware-sided problem (whether with the panda or a loose cable or something else).

...
And another thing I suggest is that we change to use images of panda

stable

...
images in our health job here.

+1

Cheers, mwh

Michael Hudson-Doyle

3:10 a.m.

New subject: About the network and deployment problem of lava

YongQin Liu yongqin.liu@linaro.org writes:

...

On 8 June 2012 06:01, Michael Hudson-Doyle michael.hudson@linaro.orgwrote:

...
YongQin Liu yongqin.liu@linaro.org writes:

...
Hi, All

I just submitted 100 jobs with panda-ics-gcc47-tilt-stable-blob#18

android

...
images. 4 of them are failed because of the network problem. 3 on panda01 and 1 on panda06. you can see here for the details.

https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AnxpY5uv-BlNdG9zYT...

...
For the download problem of image files, I guest we can set the retry number to 5.

Makes sense. I think waiting 5 minutes between retries is probably a touch excessive too, maybe we should scale that down too.

...
This time I saw one job on panda01 succeed to download the images files

at

...
the 2nd try.

I've seen this happen a few times too.

I have filed a bug about this. https://bugs.launchpad.net/lava-dispatcher/+bug/1010285

Thanks.

...

do you think setting the wait time to 1 minute is ok?

Yeah, that sounds about right.

Cheers, mwh

5054

days inactive

5055

days old

linaro-validation@lists.linaro.org

6 comments

participants

tags (0)

participants (4)

Alexander Sack
Andy Doan
Michael Hudson-Doyle
YongQin Liu