Re: [Linaro-validation] deploy timeouts

20 Mar 2012

      On Tue, Mar 20, 2012 at 12:39 PM, Zygmunt Krynicki <
zygmunt.krynicki@linaro.org> wrote:
...
I think that we have two general problems with timeouts:

We've pulled most of the initial values out of a hat

A hat of trial and error to try to come up with reasonable defaults.  We
don't want to be waiting for 5 hours for a reboot to happen if it's failed,
nor do we want to only give it 3 seconds.
...

Timeouts are expressions, not constants.

Parameters actually, with defaults
...
I'm very glad that with our health jobs we're actually looking at the
constants we're using. I'd like to see a more scientific and thorough
approach to this problem:
-> Keep a shared google doc spreadsheet with timeouts for various actions
that we put in our health jobs
-> Track that per board
-> Track the age and cycle count for each SD we purchase and allocate in
the lab
-> Benchmark the SD periodically
Given that data we could turn timeout constants into timeout expressions
that can use the following variables:
$normalized_cpu_time
$average_sd_speed
Unfortunately, there are more variables than that. In an idea world, I
would agree with you, but in the case of vexpress, we have an operation
that should normally take 30 min. taking more like 5 hours!  In this case,
the ARM lt is looking into the performance angle to see if there's
something that can be improved there.  What Dave is trying to get at
though, is that we support a timeout parameter for many other operations,
but not for deployment.
First off, the reason we *don't* have a timeout parameter for this
operation is because the meaning is a bit ambiguous.  Other operations are
a bit simpler.   For instance, if I tell it the timeout for running a test
should be 3600 (seconds... 1 hour), it's clear that if the test takes more
than an hour to run from the time it does lava-test run... to the time it
gets a result back and lava-test exits, it should timeout.  For deployment
though, what does the timeout mean? The time to download the images? The
time to create the image? the time to extract the rootfs/bootfs tarballs?
the time to push the boot image to the board? rootfs? (userdata also for
android?).  I suppose one thing we could do is make it a *total* timeout.
So if we call the deploy action on vexpress and give it a timeout of 5
hours, it first downloads the image with a total timeout, calculates the
time used so far, then for the next step we subtract that from the total
timeout, and so on.  The problem here is obvious I think.  It should never
ever take 5 hours to download the image.  Even if it's not cached, it
shouldn't take that long.  So we could still wind up doing something
insanely stupid there.
I think a better option is to actually apply it *just* to the portion where
we write the image to the card.  That's the only part that's done through
pexpect I think, so the only one where we can easily apply the timeout
anyway.  All deployments of image components would share the timeout
parameter, so we would only subtract the time spent for each preceding part
(boot.tgz, etc).  Timeouts are a pain, but unfortunately we're always
dealing with some operations that *could* hang at an inopportune time,
rather than fail with a proper error.
Thanks,
Paul Larson

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Linaro-validation] deploy timeouts