On Tue, Mar 20, 2012 at 12:39 PM, Zygmunt Krynicki < zygmunt.krynicki@linaro.org> wrote:
I think that we have two general problems with timeouts:
- We've pulled most of the initial values out of a hat
A hat of trial and error to try to come up with reasonable defaults. We don't want to be waiting for 5 hours for a reboot to happen if it's failed, nor do we want to only give it 3 seconds.
- Timeouts are expressions, not constants.
Parameters actually, with defaults
I'm very glad that with our health jobs we're actually looking at the constants we're using. I'd like to see a more scientific and thorough approach to this problem:
-> Keep a shared google doc spreadsheet with timeouts for various actions that we put in our health jobs -> Track that per board -> Track the age and cycle count for each SD we purchase and allocate in the lab -> Benchmark the SD periodically
Given that data we could turn timeout constants into timeout expressions that can use the following variables:
$normalized_cpu_time $average_sd_speed
Unfortunately, there are more variables than that. In an idea world, I would agree with you, but in the case of vexpress, we have an operation that should normally take 30 min. taking more like 5 hours! In this case, the ARM lt is looking into the performance angle to see if there's something that can be improved there. What Dave is trying to get at though, is that we support a timeout parameter for many other operations, but not for deployment.
First off, the reason we *don't* have a timeout parameter for this operation is because the meaning is a bit ambiguous. Other operations are a bit simpler. For instance, if I tell it the timeout for running a test should be 3600 (seconds... 1 hour), it's clear that if the test takes more than an hour to run from the time it does lava-test run... to the time it gets a result back and lava-test exits, it should timeout. For deployment though, what does the timeout mean? The time to download the images? The time to create the image? the time to extract the rootfs/bootfs tarballs? the time to push the boot image to the board? rootfs? (userdata also for android?). I suppose one thing we could do is make it a *total* timeout. So if we call the deploy action on vexpress and give it a timeout of 5 hours, it first downloads the image with a total timeout, calculates the time used so far, then for the next step we subtract that from the total timeout, and so on. The problem here is obvious I think. It should never ever take 5 hours to download the image. Even if it's not cached, it shouldn't take that long. So we could still wind up doing something insanely stupid there.
I think a better option is to actually apply it *just* to the portion where we write the image to the card. That's the only part that's done through pexpect I think, so the only one where we can easily apply the timeout anyway. All deployments of image components would share the timeout parameter, so we would only subtract the time spent for each preceding part (boot.tgz, etc). Timeouts are a pain, but unfortunately we're always dealing with some operations that *could* hang at an inopportune time, rather than fail with a proper error.
Thanks, Paul Larson