I'm working on the dispatcher again, and as usual I'm getting a bit grumpy. Partly this is because the abstractions I added a couple of weeks ago aren't really working, but that's my problem. What I want to complain about is the error handling.
I've always found the error handling in the dispatcher to be a little strange, so in this mail I'm going to try to set out the issue and maybe propose a solution.
A lot of the confusion comes from the fact that there are many things that could be called an error in the dispatcher, and some overloading in the way we try to handle and report errors.
Kinds of error that I can think of immediately:
1) Flat out bugs in the dispatcher 2) Errors in the job file (e.g. typoing an action name) 3) Failing tests 4) Tests that fail to even run 5) The board hanging at some point 6) Infrastructure failures, like l-m-c or deployment failing 7) Recoverable errors, e.g. where we want to be in the master image but might be in the linaro image
Category 1 bugs we can't really do *too* much about -- if we have incorrect code, who's to say that the error handling code will be correct? -- but in general dispatcher bugs should IMO cause the dispatcher to exit with an exit code != 0 (we talked at the connect about having the scheduler display jobs that exited like this differently). Currently because we have lots of "except:"s all over the place, I think we risk swallowing genuine bugs (indeed, we had one of these for a while, where we said "self.in_master_shell()" instead of "self.client.in_master_shell()" but the AttributeError was caught).
If we hit this sort of thing, I don't think we should try to continue to execute other actions and in particular I'm not too bothered if this kind of error does not result in a bundle being submitted to the dashboard.
Category 2 -- errors in the job file -- we should try to detect as early as possible. Maybe we should even create the action objects and do a simple argument check before doing anything else? I think this sort of error should also result in an exit code != 0. To the extent that its reasonable, invalid job files should be rejected by the submit_job api call.
As for bugs, I don't think we should try to continue to execute other actions and I'm not too bothered if this kind of error does not result in a bundle being submitted to the dashboard.
Category 3 -- failing tests -- is not something the dispatcher itself should care about, at all.
Category 4 -- tests that fail to run, I guess this includes things like lava-test hanging or crashing -- I think this should result in a failure in the magic 'lava' test run of the submitted bundle (and I think this is the case today). If this happens, I don't know if we should try to continue to execute other actions (apart from submit_results I guess).
Category 5 -- things like the board hanging at some point -- I guess this depends a bit when this happens. If it's during deployment say, that's pretty bad -- certainly we shouldn't try to execute other non-reporting actions if this happens.
Category 6 -- infrastructure failures -- seem to fall into the same kind of area as the previous 2: we should report a failure and stop executing any other actions.
Category 7 -- recoverable failures -- are a bit different. We should log that they have happened and continue.
After all this thinking and typing I think I've come to some conclusions:
1) There is a category of error where we should just stop. As far as I know, this isn't really handled today.
2) There is another category of error where we should report failure, but not execute any other actions. This makes me thing that having 'report results' as an action is a bit strange -- perhaps the dashboard and bundle stream should really be directly in the job, rather than an action?
3) I need to write another, probably equally long, email about dependencies between actions :-)
Cheers, mwh