On Thu, 29 Mar 2012 11:07:32 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
W dniu 29.03.2012 06:33, Michael Hudson-Doyle pisze:
On Mon, 26 Mar 2012 19:55:59 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
I basically like this. Let's do it.
Glad to hear that.
:)
I think we can implement this incrementally by making a LavaAgentUsingClient or something in the dispatcher, although we'll have to make changes to the dispatcher too -- for example, the lava_test_run actions become a little... different (I guess the job file becomes a little less imperative? Or maybe there is a distinction between actions that execute on the host and those that execute on the board? Or something else?). But nothing impossible.
I'll write a more concrete proposal. I'm +1 on starting small and doing iterations but I want to make it clear that the goal is to have something entirely different than what we have today, we'll start with the dispatcher source code but let's keep our heads open ;-)
The one part of this that we can't really take this attitude on is the interface we present to our users, i.e. the format of the job file. Do we want to/need to change that? Can we provide a transition plan if we do need to change it?
I'm not amazingly attached to the current format, it's a bit procedural and redundant in some ways, a format that allows are users to express what they _mean_ a bit more would be good -- I think with the current format we're showing our guts to the world a bit :) Slightly separate discussion, I suppose.
I'd like to lay down a plan on how the implementation will evolve, at each milestone we should be good to deploy this to production with full confidence.
I think there is a second side to incremental -- we don't want to have to redo the master image for each board for each iteration (and I don't see any reason below why we might have to, just making the point).
+0.0 (current dispatcher tree)
+0.1, replace external programs with lava-serial, serial config is not constrained to serial class (direct, networked) and constructor data
Side note: with lava-device-manager the dispatcher would get this from the outside and would be thus 100% config-free
+0.2, add mini master agent to master rootfs, make it accept shell commands over IP, master image is scripted with one RPC method similar to subprocess.Popen(). Dispatcher learns of the board IP over serial.
+0.3, add improved master image agent, extra specialized methods for deployment, no shell during image deployment (download and copy to partition driven from python)
Yeah, so with the current reasons for health job failures, this is the bit I really want to see asap :)
+0.4, add mini test agent to test image before reboot, mounts testrootfs and unpacks a tarball from master image (so that agent code is synchronized to master image version), test agent supplements current serial scripted code with simple methods (IP discovery, maybe shell execution as in +0.2)
+0.5, test agent drives the whole test process, dispatcher job copied by the master agent, data saved to testrootfs partition (TODO: maybe we should pick a better location?)
I think the test agent should store the test results on its own rootfs -- what else could it do?
+0.6, master agent takes over the part of sending the data back to the dispatcher, no hacky/racy webservers, clean code on both ends
How does this sound? I just wrote it from the top of my head, no deeper thoughts yet.
I think it's basically fine.
The only thing I'd change is that I don't really see a reason to *not* spam the test output over the serial line and show said spam in the scheduler web UI. We should also store it in neat files on the test image and make sure that those files are what the dashboard's view of events are based on.
Thinking about it yeah, we may just spam the serial line for now. Ideally I'd like to be able to get perfect separation of sources without loosing correlated time. Imagine a scheduler page that has filters for kernel messages (without relying on flaky pattern matching) and can highlight them perfectly in the application run log.
Oh yes, that would be totally awesome. But we shouldn't have a less useful page in the mean time for no good reason.
Cheers, mwh