On Thu, 10 Nov 2011 16:48:47 -0600, Paul Larson paul.larson@linaro.org wrote:
On Thu, Nov 10, 2011 at 4:46 PM, Paul Larson paul.larson@linaro.org wrote:
On Thu, Nov 10, 2011 at 4:32 PM, Michael Hudson-Doyle < michael.hudson@linaro.org> wrote: ...
After all this thinking and typing I think I've come to some conclusions:
- There is a category of error where we should just stop. As far as I
know, this isn't really handled today.
- There is another category of error where we should report failure,
but not execute any other actions. This makes me thing that having 'report results' as an action is a bit strange -- perhaps the dashboard and bundle stream should really be directly in the job, rather than an action?
ISTR we defined some exceptions in lava as being Fatal, or Non-Fatal - the idea being that there would be subclasses of those to add detail. That way we don't need to decide on every single error, just how far to pass it up before someone can take action on it. The fatal ones of course would be the ones where we just can't reasonably expect to proceed and gain anything from it (ex. image fails to deploy).
Ah yeah, there is a CriticalError class. Come to think of it, I think I'm mostly unsettled[1] by how _other_ errors are handled -- basically I think they should cause the dispatcher to exit, immediately, with code != 0. Instead we currently usually try to send a bundle to the dashboard, which if we're in some bizarro situation often fails and fails in a way that obscures the original problem!
[1] most of the rest is the fact that we currently don't even think about errors in the job file.
- I need to write another, probably equally long, email about
dependencies between actions :-)
Ah yes, we spoke a bit about that recently. I'd love to hear your ideas on it.
I'll get my thinking cap on then!
I guess I forgot to add some things to the previous email... I'm mainly interested in 2 things when it comes to any of these errors:
- highlighting them in a way that makes it easy for us to find out when
something goes wrong.
Yes, that's true (a good bit of user focus!). I guess it's worth thinking about who cares about a particular class of error...
category 1 -- bugs -- that'd be us, the validation team
category 2 -- errors in the job file -- whoever submitted the job
category 3 -- failing tests -- whoever submitted the job
category 4 -- tests that fail to even run -- this one is harder to call I guess. probably whoever submitted the tests as a first port of call though, it's kinda similar to a failing test.
category 5 -- board hangs -- depends when it happens, and even what the test is testing. if it's during boot_linaro_image of a kernel ci test, then it should be the kernel team. if it's booting a supposedly known good image for toolchain testing, then it's probably a lab issue. I guess _most_ of the time it's going to be a failure of what is being tested.
category 6 -- infrastructure failure -- hard to say again. l-m-c fails both because of bugs but also sometimes due to duff hwpacks.
category 7 -- recoverable errors -- the validation team might care, certainly noone else will.
So that was worth thinking through.
I think in general errors that should be interpreted by the job submitter should be reported primarily in the dashboard, although I don't really have a good idea for is how to report errors in the job file.
One way of distinguishing things like l-m-c bugs from duff input data is to look across jobs -- if all deploy steps are failing, something is probably wrong in the lab. This obviously requires a wider view of what's going on than the dispatcher has :-) but perhaps means we should think about hooking the dispatcher up to statsd somehow or other.
I think some of this, perhaps, goes along with another conversation about parsing the serial log and splitting it up into sections.
Yeah, and submitting these sections to the dashboard I think (see above -- I don't think people should look at the scheduler pages most of the time).
- capturing the full backtrace (we're better about this now I think, I
have had much less frustration with this lately)
Yeah, I think I mostly squished that problem :-)
Cheers, mwh