I'm working on the dispatcher again, and as usual I'm getting a bit grumpy. Partly this is because the abstractions I added a couple of weeks ago aren't really working, but that's my problem. What I want to complain about is the error handling.
I've always found the error handling in the dispatcher to be a little strange, so in this mail I'm going to try to set out the issue and maybe propose a solution.
A lot of the confusion comes from the fact that there are many things that could be called an error in the dispatcher, and some overloading in the way we try to handle and report errors.
Kinds of error that I can think of immediately:
1) Flat out bugs in the dispatcher 2) Errors in the job file (e.g. typoing an action name) 3) Failing tests 4) Tests that fail to even run 5) The board hanging at some point 6) Infrastructure failures, like l-m-c or deployment failing 7) Recoverable errors, e.g. where we want to be in the master image but might be in the linaro image
Category 1 bugs we can't really do *too* much about -- if we have incorrect code, who's to say that the error handling code will be correct? -- but in general dispatcher bugs should IMO cause the dispatcher to exit with an exit code != 0 (we talked at the connect about having the scheduler display jobs that exited like this differently). Currently because we have lots of "except:"s all over the place, I think we risk swallowing genuine bugs (indeed, we had one of these for a while, where we said "self.in_master_shell()" instead of "self.client.in_master_shell()" but the AttributeError was caught).
If we hit this sort of thing, I don't think we should try to continue to execute other actions and in particular I'm not too bothered if this kind of error does not result in a bundle being submitted to the dashboard.
Category 2 -- errors in the job file -- we should try to detect as early as possible. Maybe we should even create the action objects and do a simple argument check before doing anything else? I think this sort of error should also result in an exit code != 0. To the extent that its reasonable, invalid job files should be rejected by the submit_job api call.
As for bugs, I don't think we should try to continue to execute other actions and I'm not too bothered if this kind of error does not result in a bundle being submitted to the dashboard.
Category 3 -- failing tests -- is not something the dispatcher itself should care about, at all.
Category 4 -- tests that fail to run, I guess this includes things like lava-test hanging or crashing -- I think this should result in a failure in the magic 'lava' test run of the submitted bundle (and I think this is the case today). If this happens, I don't know if we should try to continue to execute other actions (apart from submit_results I guess).
Category 5 -- things like the board hanging at some point -- I guess this depends a bit when this happens. If it's during deployment say, that's pretty bad -- certainly we shouldn't try to execute other non-reporting actions if this happens.
Category 6 -- infrastructure failures -- seem to fall into the same kind of area as the previous 2: we should report a failure and stop executing any other actions.
Category 7 -- recoverable failures -- are a bit different. We should log that they have happened and continue.
After all this thinking and typing I think I've come to some conclusions:
1) There is a category of error where we should just stop. As far as I know, this isn't really handled today.
2) There is another category of error where we should report failure, but not execute any other actions. This makes me thing that having 'report results' as an action is a bit strange -- perhaps the dashboard and bundle stream should really be directly in the job, rather than an action?
3) I need to write another, probably equally long, email about dependencies between actions :-)
Cheers, mwh
On Thu, Nov 10, 2011 at 4:32 PM, Michael Hudson-Doyle < michael.hudson@linaro.org> wrote: ...
After all this thinking and typing I think I've come to some conclusions:
- There is a category of error where we should just stop. As far as I
know, this isn't really handled today.
- There is another category of error where we should report failure,
but not execute any other actions. This makes me thing that having 'report results' as an action is a bit strange -- perhaps the dashboard and bundle stream should really be directly in the job, rather than an action?
ISTR we defined some exceptions in lava as being Fatal, or Non-Fatal - the
idea being that there would be subclasses of those to add detail. That way we don't need to decide on every single error, just how far to pass it up before someone can take action on it. The fatal ones of course would be the ones where we just can't reasonably expect to proceed and gain anything from it (ex. image fails to deploy).
- I need to write another, probably equally long, email about
dependencies between actions :-)
Ah yes, we spoke a bit about that recently. I'd love to hear your ideas on it.
On Thu, Nov 10, 2011 at 4:46 PM, Paul Larson paul.larson@linaro.org wrote:
On Thu, Nov 10, 2011 at 4:32 PM, Michael Hudson-Doyle < michael.hudson@linaro.org> wrote: ...
After all this thinking and typing I think I've come to some conclusions:
- There is a category of error where we should just stop. As far as I
know, this isn't really handled today.
- There is another category of error where we should report failure,
but not execute any other actions. This makes me thing that having 'report results' as an action is a bit strange -- perhaps the dashboard and bundle stream should really be directly in the job, rather than an action?
ISTR we defined some exceptions in lava as being Fatal, or Non-Fatal -
the idea being that there would be subclasses of those to add detail. That way we don't need to decide on every single error, just how far to pass it up before someone can take action on it. The fatal ones of course would be the ones where we just can't reasonably expect to proceed and gain anything from it (ex. image fails to deploy).
- I need to write another, probably equally long, email about
dependencies between actions :-)
Ah yes, we spoke a bit about that recently. I'd love to hear your ideas on it.
I guess I forgot to add some things to the previous email... I'm mainly interested in 2 things when it comes to any of these errors: 1. highlighting them in a way that makes it easy for us to find out when something goes wrong. I think some of this, perhaps, goes along with another conversation about parsing the serial log and splitting it up into sections.
2. capturing the full backtrace (we're better about this now I think, I have had much less frustration with this lately)
Thanks, Paul Larson
re-re-replying because I got a bounce on zymunt's[sic] email address. :)
On Thu, Nov 10, 2011 at 4:48 PM, Paul Larson paul.larson@linaro.org wrote:
On Thu, Nov 10, 2011 at 4:46 PM, Paul Larson paul.larson@linaro.orgwrote:
On Thu, Nov 10, 2011 at 4:32 PM, Michael Hudson-Doyle < michael.hudson@linaro.org> wrote: ...
After all this thinking and typing I think I've come to some conclusions:
- There is a category of error where we should just stop. As far as I
know, this isn't really handled today.
- There is another category of error where we should report failure,
but not execute any other actions. This makes me thing that having 'report results' as an action is a bit strange -- perhaps the dashboard and bundle stream should really be directly in the job, rather than an action?
ISTR we defined some exceptions in lava as being Fatal, or Non-Fatal -
the idea being that there would be subclasses of those to add detail. That way we don't need to decide on every single error, just how far to pass it up before someone can take action on it. The fatal ones of course would be the ones where we just can't reasonably expect to proceed and gain anything from it (ex. image fails to deploy).
- I need to write another, probably equally long, email about
dependencies between actions :-)
Ah yes, we spoke a bit about that recently. I'd love to hear your ideas on it.
I guess I forgot to add some things to the previous email... I'm mainly interested in 2 things when it comes to any of these errors:
- highlighting them in a way that makes it easy for us to find out when
something goes wrong. I think some of this, perhaps, goes along with another conversation about parsing the serial log and splitting it up into sections.
- capturing the full backtrace (we're better about this now I think, I
have had much less frustration with this lately)
Thanks, Paul Larson
On Thu, 10 Nov 2011 16:48:47 -0600, Paul Larson paul.larson@linaro.org wrote:
On Thu, Nov 10, 2011 at 4:46 PM, Paul Larson paul.larson@linaro.org wrote:
On Thu, Nov 10, 2011 at 4:32 PM, Michael Hudson-Doyle < michael.hudson@linaro.org> wrote: ...
After all this thinking and typing I think I've come to some conclusions:
- There is a category of error where we should just stop. As far as I
know, this isn't really handled today.
- There is another category of error where we should report failure,
but not execute any other actions. This makes me thing that having 'report results' as an action is a bit strange -- perhaps the dashboard and bundle stream should really be directly in the job, rather than an action?
ISTR we defined some exceptions in lava as being Fatal, or Non-Fatal - the idea being that there would be subclasses of those to add detail. That way we don't need to decide on every single error, just how far to pass it up before someone can take action on it. The fatal ones of course would be the ones where we just can't reasonably expect to proceed and gain anything from it (ex. image fails to deploy).
Ah yeah, there is a CriticalError class. Come to think of it, I think I'm mostly unsettled[1] by how _other_ errors are handled -- basically I think they should cause the dispatcher to exit, immediately, with code != 0. Instead we currently usually try to send a bundle to the dashboard, which if we're in some bizarro situation often fails and fails in a way that obscures the original problem!
[1] most of the rest is the fact that we currently don't even think about errors in the job file.
- I need to write another, probably equally long, email about
dependencies between actions :-)
Ah yes, we spoke a bit about that recently. I'd love to hear your ideas on it.
I'll get my thinking cap on then!
I guess I forgot to add some things to the previous email... I'm mainly interested in 2 things when it comes to any of these errors:
- highlighting them in a way that makes it easy for us to find out when
something goes wrong.
Yes, that's true (a good bit of user focus!). I guess it's worth thinking about who cares about a particular class of error...
category 1 -- bugs -- that'd be us, the validation team
category 2 -- errors in the job file -- whoever submitted the job
category 3 -- failing tests -- whoever submitted the job
category 4 -- tests that fail to even run -- this one is harder to call I guess. probably whoever submitted the tests as a first port of call though, it's kinda similar to a failing test.
category 5 -- board hangs -- depends when it happens, and even what the test is testing. if it's during boot_linaro_image of a kernel ci test, then it should be the kernel team. if it's booting a supposedly known good image for toolchain testing, then it's probably a lab issue. I guess _most_ of the time it's going to be a failure of what is being tested.
category 6 -- infrastructure failure -- hard to say again. l-m-c fails both because of bugs but also sometimes due to duff hwpacks.
category 7 -- recoverable errors -- the validation team might care, certainly noone else will.
So that was worth thinking through.
I think in general errors that should be interpreted by the job submitter should be reported primarily in the dashboard, although I don't really have a good idea for is how to report errors in the job file.
One way of distinguishing things like l-m-c bugs from duff input data is to look across jobs -- if all deploy steps are failing, something is probably wrong in the lab. This obviously requires a wider view of what's going on than the dispatcher has :-) but perhaps means we should think about hooking the dispatcher up to statsd somehow or other.
I think some of this, perhaps, goes along with another conversation about parsing the serial log and splitting it up into sections.
Yeah, and submitting these sections to the dashboard I think (see above -- I don't think people should look at the scheduler pages most of the time).
- capturing the full backtrace (we're better about this now I think, I
have had much less frustration with this lately)
Yeah, I think I mostly squished that problem :-)
Cheers, mwh