error handling and reporting in the dispatcher - linaro-dev

10 Nov 2011


      I'm working on the dispatcher again, and as usual I'm getting a bit
grumpy.  Partly this is because the abstractions I added a couple of
weeks ago aren't really working, but that's my problem.  What I want to
complain about is the error handling.
I've always found the error handling in the dispatcher to be a little
strange, so in this mail I'm going to try to set out the issue and maybe
propose a solution.
A lot of the confusion comes from the fact that there are many things
that could be called an error in the dispatcher, and some overloading in
the way we try to handle and report errors.
Kinds of error that I can think of immediately:
1) Flat out bugs in the dispatcher
2) Errors in the job file (e.g. typoing an action name)
3) Failing tests
4) Tests that fail to even run
5) The board hanging at some point
6) Infrastructure failures, like l-m-c or deployment failing
7) Recoverable errors, e.g. where we want to be in the master image but
   might be in the linaro image
Category 1 bugs we can't really do *too* much about -- if we have
incorrect code, who's to say that the error handling code will be
correct? -- but in general dispatcher bugs should IMO cause the
dispatcher to exit with an exit code != 0 (we talked at the connect
about having the scheduler display jobs that exited like this
differently).  Currently because we have lots of "except:"s all over the
place, I think we risk swallowing genuine bugs (indeed, we had one of
these for a while, where we said "self.in_master_shell()" instead of
"self.client.in_master_shell()" but the AttributeError was caught).
If we hit this sort of thing, I don't think we should try to continue to
execute other actions and in particular I'm not too bothered if this
kind of error does not result in a bundle being submitted to the
dashboard.
Category 2 -- errors in the job file -- we should try to detect as early
as possible.  Maybe we should even create the action objects and do a
simple argument check before doing anything else?  I think this sort of
error should also result in an exit code != 0.  To the extent that its
reasonable, invalid job files should be rejected by the submit_job api
call.
As for bugs, I don't think we should try to continue to execute other
actions and I'm not too bothered if this kind of error does not result
in a bundle being submitted to the dashboard.
Category 3 -- failing tests -- is not something the dispatcher itself
should care about, at all.
Category 4 -- tests that fail to run, I guess this includes things like
lava-test hanging or crashing -- I think this should result in a failure
in the magic 'lava' test run of the submitted bundle (and I think this
is the case today).  If this happens, I don't know if we should try to
continue to execute other actions (apart from submit_results I guess).
Category 5 -- things like the board hanging at some point -- I guess
this depends a bit when this happens.  If it's during deployment say,
that's pretty bad -- certainly we shouldn't try to execute other
non-reporting actions if this happens.
Category 6 -- infrastructure failures -- seem to fall into the same kind
of area as the previous 2: we should report a failure and stop executing
any other actions.
Category 7 -- recoverable failures -- are a bit different.  We should
log that they have happened and continue.
After all this thinking and typing I think I've come to some
conclusions:
1) There is a category of error where we should just stop.  As far as I
   know, this isn't really handled today.
2) There is another category of error where we should report failure,
   but not execute any other actions.  This makes me thing that having
   'report results' as an action is a bit strange -- perhaps the
   dashboard and bundle stream should really be directly in the job,
   rather than an action?
3) I need to write another, probably equally long, email about
   dependencies between actions :-)
Cheers,
mwh