W dniu 30.03.2012 00:00, Michael Hudson-Doyle pisze:
On Thu, 29 Mar 2012 11:07:32 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
W dniu 29.03.2012 06:33, Michael Hudson-Doyle pisze:
On Mon, 26 Mar 2012 19:55:59 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
I basically like this. Let's do it.
Glad to hear that.
:)
I think we can implement this incrementally by making a LavaAgentUsingClient or something in the dispatcher, although we'll have to make changes to the dispatcher too -- for example, the lava_test_run actions become a little... different (I guess the job file becomes a little less imperative? Or maybe there is a distinction between actions that execute on the host and those that execute on the board? Or something else?). But nothing impossible.
I'll write a more concrete proposal. I'm +1 on starting small and doing iterations but I want to make it clear that the goal is to have something entirely different than what we have today, we'll start with the dispatcher source code but let's keep our heads open ;-)
The one part of this that we can't really take this attitude on is the interface we present to our users, i.e. the format of the job file. Do we want to/need to change that? Can we provide a transition plan if we do need to change it?
I don't think we need to change that yet. In fact, I don't want to allow changing that until we deal with this set of problems. Everything so far should be backwards compatible.
I'm not amazingly attached to the current format, it's a bit procedural and redundant in some ways, a format that allows are users to express what they _mean_ a bit more would be good -- I think with the current format we're showing our guts to the world a bit :) Slightly separate discussion, I suppose.
I agree with your points and I even don't mind discussing that now in a separate thread. I think the biggest issues currently is that people expect to express themselves with a GUI, documentation or examples. We kind of lack all three that are not "look at the existing jobs".
I'd like to lay down a plan on how the implementation will evolve, at each milestone we should be good to deploy this to production with full confidence.
I think there is a second side to incremental -- we don't want to have to redo the master image for each board for each iteration (and I don't see any reason below why we might have to, just making the point).
Yeah, I knew you'd bring that up. For the moment I don't see a better way. I think we should rotate master images every month regardless. We may cheat this by just running an upgrade script on the master if that's a time saver. Right now I'd rather be correct than convenient.
+0.0 (current dispatcher tree)
+0.1, replace external programs with lava-serial, serial config is not constrained to serial class (direct, networked) and constructor data
Side note: with lava-device-manager the dispatcher would get this from the outside and would be thus 100% config-free
+0.2, add mini master agent to master rootfs, make it accept shell commands over IP, master image is scripted with one RPC method similar to subprocess.Popen(). Dispatcher learns of the board IP over serial.
+0.3, add improved master image agent, extra specialized methods for deployment, no shell during image deployment (download and copy to partition driven from python)
Yeah, so with the current reasons for health job failures, this is the bit I really want to see asap :)
Anyone interested in fast models?
+0.4, add mini test agent to test image before reboot, mounts testrootfs and unpacks a tarball from master image (so that agent code is synchronized to master image version), test agent supplements current serial scripted code with simple methods (IP discovery, maybe shell execution as in +0.2)
+0.5, test agent drives the whole test process, dispatcher job copied by the master agent, data saved to testrootfs partition (TODO: maybe we should pick a better location?)
I think the test agent should store the test results on its own rootfs -- what else could it do?
Dedicated partition -- we still have two in MBR extended table, would have nice properties, could even store data in looped mode. I don't want to add too many problems but using rootfs for that feels unsafe.
BTW: I'm not terribly up-to-date on Android tests, do they need to store log files on the target? If not we can just reuse the sdcard partition.
+0.6, master agent takes over the part of sending the data back to the dispatcher, no hacky/racy webservers, clean code on both ends
How does this sound? I just wrote it from the top of my head, no deeper thoughts yet.
I think it's basically fine.
Then let's start cranking work items and blueprints for dedicated projects. Some of the work will happen in master image scripts, the rest can be in the dispatcher. I'm happy with keeping the master scripts simply pip install something from the dispatcher (hit: dispatcher[master-agent] maybe?)
The only thing I'd change is that I don't really see a reason to *not* spam the test output over the serial line and show said spam in the scheduler web UI. We should also store it in neat files on the test image and make sure that those files are what the dashboard's view of events are based on.
Thinking about it yeah, we may just spam the serial line for now. Ideally I'd like to be able to get perfect separation of sources without loosing correlated time. Imagine a scheduler page that has filters for kernel messages (without relying on flaky pattern matching) and can highlight them perfectly in the application run log.
Oh yes, that would be totally awesome. But we shouldn't have a less useful page in the mean time for no good reason.
You're right, we'll keep the logs for now.
Thanks for the feedback ZK