Re: [Linaro-validation] deploy timeouts

List overview All Threads
Download

newer

older

LAVA master image scripts

Forced reboots are harmful

Dave Pigott

20 Mar 2012 20 Mar '12

4:54 p.m.

As per Zygmunt's request, adding linaro-validation for discussion.

Dave

Dave Pigott Validation Engineer T: +44 1223 45 00 24 | M +44 7940 45 93 44 Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog

On 20 Mar 2012, at 14:33, Zygmunt Krynicki wrote:

...

W dniu 20.03.2012 15:22, Dave Pigott pisze:

...
Hi,

I think we need to seriously look at allowing a timeout parameter around "deploy_linaro_image", because vexpress is going to need around 4 hours to deploy a full desktop, and it seems crazy that we should hard code that big a timeout into master.py for *all* platforms.

Could we move that to linaro-validation please?

...
I realise there is an issue as to which part the timeout is applied to and how it's apportioned, but even if we just allowed to pass through the whole timeout to each stage it would be better than what we have at the moment.

Thoughts?

Dave Pigott Validation Engineer T: +44 1223 45 00 24 | M +44 7940 45 93 44 Linaro.org http://www.linaro.org/***│ *Open source software for ARM SoCs Follow *Linaro: *Facebook http://www.facebook.com/pages/Linaro| Twitter http://twitter.com/#!/linaroorg | Blog http://www.linaro.org/linaro-blog/

-- Zygmunt Krynicki Linaro Validation Team

Attachments:

attachment.html (text/html — 6.6 KB)

Show replies by date

Zygmunt Krynicki

20 Mar 20 Mar

5:39 p.m.

New subject: deploy timeouts

W dniu 20.03.2012 17:54, Dave Pigott pisze:

...

As per Zygmunt's request, adding linaro-validation for discussion.

Dave

Dave Pigott Validation Engineer T: +44 1223 45 00 24 | M +44 7940 45 93 44 Linaro.org http://www.linaro.org/***│ *Open source software for ARM SoCs Follow *Linaro: *Facebook http://www.facebook.com/pages/Linaro| Twitter http://twitter.com/#!/linaroorg | Blog http://www.linaro.org/linaro-blog/

On 20 Mar 2012, at 14:33, Zygmunt Krynicki wrote:

...
W dniu 20.03.2012 15:22, Dave Pigott pisze:

...
Hi,

I think we need to seriously look at allowing a timeout parameter around "deploy_linaro_image", because vexpress is going to need around 4 hours to deploy a full desktop, and it seems crazy that we should hard code that big a timeout into master.py for *all* platforms.

Could we move that to linaro-validation please?

...
I realise there is an issue as to which part the timeout is applied to and how it's apportioned, but even if we just allowed to pass through the whole timeout to each stage it would be better than what we have at the moment.

I think that we have two general problems with timeouts:

1) We've pulled most of the initial values out of a hat 2) Timeouts are expressions, not constants.

I'm very glad that with our health jobs we're actually looking at the constants we're using. I'd like to see a more scientific and thorough approach to this problem:

-> Keep a shared google doc spreadsheet with timeouts for various actions that we put in our health jobs -> Track that per board -> Track the age and cycle count for each SD we purchase and allocate in the lab -> Benchmark the SD periodically

Given that data we could turn timeout constants into timeout expressions that can use the following variables:

$normalized_cpu_time $average_sd_speed

Having that we could say that, for example: deploy_linaro_image takes 2GB of writes and an 30 minutes of normalized CPU. The expression would then evaluate to actual values for each test job. We could also track how average_sd_speed changes over time.

Thanks ZK

-- Zygmunt Krynicki Linaro Validation Team

Paul Larson

7:48 p.m.

New subject: deploy timeouts

On Tue, Mar 20, 2012 at 12:39 PM, Zygmunt Krynicki < zygmunt.krynicki@linaro.org> wrote:

...

I think that we have two general problems with timeouts:

We've pulled most of the initial values out of a hat

A hat of trial and error to try to come up with reasonable defaults. We don't want to be waiting for 5 hours for a reboot to happen if it's failed, nor do we want to only give it 3 seconds.

...

Timeouts are expressions, not constants.

Parameters actually, with defaults

...

I'm very glad that with our health jobs we're actually looking at the constants we're using. I'd like to see a more scientific and thorough approach to this problem:

-> Keep a shared google doc spreadsheet with timeouts for various actions that we put in our health jobs -> Track that per board -> Track the age and cycle count for each SD we purchase and allocate in the lab -> Benchmark the SD periodically

Given that data we could turn timeout constants into timeout expressions that can use the following variables:

$normalized_cpu_time $average_sd_speed

Unfortunately, there are more variables than that. In an idea world, I would agree with you, but in the case of vexpress, we have an operation that should normally take 30 min. taking more like 5 hours! In this case, the ARM lt is looking into the performance angle to see if there's something that can be improved there. What Dave is trying to get at though, is that we support a timeout parameter for many other operations, but not for deployment.

First off, the reason we *don't* have a timeout parameter for this operation is because the meaning is a bit ambiguous. Other operations are a bit simpler. For instance, if I tell it the timeout for running a test should be 3600 (seconds... 1 hour), it's clear that if the test takes more than an hour to run from the time it does lava-test run... to the time it gets a result back and lava-test exits, it should timeout. For deployment though, what does the timeout mean? The time to download the images? The time to create the image? the time to extract the rootfs/bootfs tarballs? the time to push the boot image to the board? rootfs? (userdata also for android?). I suppose one thing we could do is make it a *total* timeout. So if we call the deploy action on vexpress and give it a timeout of 5 hours, it first downloads the image with a total timeout, calculates the time used so far, then for the next step we subtract that from the total timeout, and so on. The problem here is obvious I think. It should never ever take 5 hours to download the image. Even if it's not cached, it shouldn't take that long. So we could still wind up doing something insanely stupid there.

I think a better option is to actually apply it *just* to the portion where we write the image to the card. That's the only part that's done through pexpect I think, so the only one where we can easily apply the timeout anyway. All deployments of image components would share the timeout parameter, so we would only subtract the time spent for each preceding part (boot.tgz, etc). Timeouts are a pain, but unfortunately we're always dealing with some operations that *could* hang at an inopportune time, rather than fail with a proper error.

Thanks, Paul Larson

Zygmunt Krynicki

8:54 p.m.

New subject: deploy timeouts

W dniu 20.03.2012 20:48, Paul Larson pisze:

...

On Tue, Mar 20, 2012 at 12:39 PM, Zygmunt Krynicki <zygmunt.krynicki@linaro.org mailto:zygmunt.krynicki@linaro.org> wrote:
I think that we have two general problems with timeouts:

1) We've pulled most of the initial values out of a hat
A hat of trial and error to try to come up with reasonable defaults. We don't want to be waiting for 5 hours for a reboot to happen if it's failed, nor do we want to only give it 3 seconds.
2) Timeouts are expressions, not constants.
Parameters actually, with defaults
I'm very glad that with our health jobs we're actually looking at
the constants we're using. I'd like to see a more scientific and
thorough approach to this problem:

-> Keep a shared google doc spreadsheet with timeouts for various
actions that we put in our health jobs
-> Track that per board
-> Track the age and cycle count for each SD we purchase and
allocate in the lab
-> Benchmark the SD periodically

Given that data we could turn timeout constants into timeout
expressions that can use the following variables:

$normalized_cpu_time
$average_sd_speed
Unfortunately, there are more variables than that. In an idea world, I would agree with you, but in the case of vexpress, we have an operation that should normally take 30 min. taking more like 5 hours! In this case, the ARM lt is looking into the performance angle to see if there's something that can be improved there. What Dave is trying to get at though, is that we support a timeout parameter for many other operations, but not for deployment.

I don't see the problem. If it takes that long on vexpress then $average_sd_speed will be very very low. This will be per-device mind you.

...

First off, the reason we *don't* have a timeout parameter for this operation is because the meaning is a bit ambiguous. Other operations are a bit simpler. For instance, if I tell it the timeout for running a test should be 3600 (seconds... 1 hour), it's clear that if the test takes more than an hour to run from the time it does lava-test run... to the time it gets a result back and lava-test exits, it should timeout. For deployment though, what does the timeout mean? The time to download the images? The time to create the image? the time to extract the rootfs/bootfs tarballs? the time to push the boot image to the board? rootfs? (userdata also for android?). I suppose one thing we could do is make it a *total* timeout. So if we call the deploy action on vexpress and give it a timeout of 5 hours, it first downloads the image with a total timeout, calculates the time used so far, then for the next step we subtract that from the total timeout, and so on. The problem here is obvious I think. It should never ever take 5 hours to download the image. Even if it's not cached, it shouldn't take that long. So we could still wind up doing something insanely stupid there.

I think a better option is to actually apply it *just* to the portion where we write the image to the card. That's the only part that's done through pexpect I think, so the only one where we can easily apply the timeout anyway. All deployments of image components would share the timeout parameter, so we would only subtract the time spent for each preceding part (boot.tgz, etc). Timeouts are a pain, but unfortunately we're always dealing with some operations that *could* hang at an inopportune time, rather than fail with a proper error.

Thanks, Paul Larson

-- Zygmunt Krynicki Linaro Validation Team

Paul Larson

21 Mar 21 Mar

1 a.m.

New subject: deploy timeouts

On Tue, Mar 20, 2012 at 3:54 PM, Zygmunt Krynicki < zygmunt.krynicki@linaro.org> wrote:

...

I don't see the problem. If it takes that long on vexpress then $average_sd_speed will be very very low. This will be per-device mind you.

Ah, I see. So are you suggesting it should be constantly measured for

every job and adaptive? or simply just look at a few manually and get an average from a sampling of the data?

Michael Hudson-Doyle

12:47 a.m.

New subject: deploy timeouts

On Tue, 20 Mar 2012 16:54:41 +0000, Dave Pigott dave.pigott@linaro.org wrote:

...

...
...
I think we need to seriously look at allowing a timeout parameter around "deploy_linaro_image", because vexpress is going to need around 4 hours to deploy a full desktop, and it seems crazy that we should hard code that big a timeout into master.py for *all* platforms.

Makes sense to me. What Zygmunt says makes sense, but as a stop gap, sure...

Cheers, mwh

5096

days inactive

5097

days old

linaro-validation@lists.linaro.org

5 comments

participants

tags (0)

participants (4)

Dave Pigott
Michael Hudson-Doyle
Paul Larson
Zygmunt Krynicki