error handling and reporting in the dispatcher
michael.hudson at linaro.org
Thu Nov 10 23:16:07 UTC 2011
On Thu, 10 Nov 2011 16:48:47 -0600, Paul Larson <paul.larson at linaro.org> wrote:
> On Thu, Nov 10, 2011 at 4:46 PM, Paul Larson <paul.larson at linaro.org> wrote:
> > On Thu, Nov 10, 2011 at 4:32 PM, Michael Hudson-Doyle <
> > michael.hudson at linaro.org> wrote:
> > ...
> >> After all this thinking and typing I think I've come to some
> >> conclusions:
> >> 1) There is a category of error where we should just stop. As far as I
> >> know, this isn't really handled today.
> >> 2) There is another category of error where we should report failure,
> >> but not execute any other actions. This makes me thing that having
> >> 'report results' as an action is a bit strange -- perhaps the
> >> dashboard and bundle stream should really be directly in the job,
> >> rather than an action?
> > ISTR we defined some exceptions in lava as being Fatal, or Non-Fatal -
> > the idea being that there would be subclasses of those to add detail. That
> > way we don't need to decide on every single error, just how far to pass it
> > up before someone can take action on it. The fatal ones of course would be
> > the ones where we just can't reasonably expect to proceed and gain anything
> > from it (ex. image fails to deploy).
Ah yeah, there is a CriticalError class. Come to think of it, I think
I'm mostly unsettled by how _other_ errors are handled -- basically I
think they should cause the dispatcher to exit, immediately, with code
!= 0. Instead we currently usually try to send a bundle to the
dashboard, which if we're in some bizarro situation often fails and
fails in a way that obscures the original problem!
 most of the rest is the fact that we currently don't even think
about errors in the job file.
> >> 3) I need to write another, probably equally long, email about
> >> dependencies between actions :-)
> > Ah yes, we spoke a bit about that recently. I'd love to hear your ideas
> > on it.
I'll get my thinking cap on then!
> I guess I forgot to add some things to the previous email... I'm mainly
> interested in 2 things when it comes to any of these errors:
> 1. highlighting them in a way that makes it easy for us to find out when
> something goes wrong.
Yes, that's true (a good bit of user focus!). I guess it's worth
thinking about who cares about a particular class of error...
category 1 -- bugs -- that'd be us, the validation team
category 2 -- errors in the job file -- whoever submitted the job
category 3 -- failing tests -- whoever submitted the job
category 4 -- tests that fail to even run -- this one is harder to call
I guess. probably whoever submitted the tests as a first
port of call though, it's kinda similar to a failing test.
category 5 -- board hangs -- depends when it happens, and even what the
test is testing. if it's during boot_linaro_image of a
kernel ci test, then it should be the kernel team. if
it's booting a supposedly known good image for toolchain
testing, then it's probably a lab issue. I guess _most_
of the time it's going to be a failure of what is being
category 6 -- infrastructure failure -- hard to say again. l-m-c fails
both because of bugs but also sometimes due to duff
category 7 -- recoverable errors -- the validation team might care,
certainly noone else will.
So that was worth thinking through.
I think in general errors that should be interpreted by the job
submitter should be reported primarily in the dashboard, although I
don't really have a good idea for is how to report errors in the job
One way of distinguishing things like l-m-c bugs from duff input data is
to look across jobs -- if all deploy steps are failing, something is
probably wrong in the lab. This obviously requires a wider view of
what's going on than the dispatcher has :-) but perhaps means we should
think about hooking the dispatcher up to statsd somehow or other.
> I think some of this, perhaps, goes along with another conversation about
> parsing the serial log and splitting it up into sections.
Yeah, and submitting these sections to the dashboard I think (see
above -- I don't think people should look at the scheduler pages most of
> 2. capturing the full backtrace (we're better about this now I think, I
> have had much less frustration with this lately)
Yeah, I think I mostly squished that problem :-)
More information about the linaro-dev