Re: [Linaro-validation] De-emphasize serial connection for LAVA Dispatcher, introduce LAVA Master Agent, LAVA Test Agent

30 Mar 2012


      W dniu 30.03.2012 00:00, Michael Hudson-Doyle pisze:
...
On Thu, 29 Mar 2012 11:07:32 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
...
W dniu 29.03.2012 06:33, Michael Hudson-Doyle pisze:
...
On Mon, 26 Mar 2012 19:55:59 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
...
I've registered another blueprint. This time on LAVA Dispatcher. The
goal is similar as the previous blueprint about device manager, to
gather feedback and to propose some small steps sub-blueprints that
could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more
reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
I basically like this.  Let's do it.
Glad to hear that.
:)
...
...
I think we can implement this incrementally by making a
LavaAgentUsingClient or something in the dispatcher, although we'll have
to make changes to the dispatcher too -- for example, the lava_test_run
actions become a little... different (I guess the job file becomes a
little less imperative? Or maybe there is a distinction between actions
that execute on the host and those that execute on the board?  Or
something else?).  But nothing impossible.
I'll write a more concrete proposal. I'm +1 on starting small and doing
iterations but I want to make it clear that the goal is to have
something entirely different than what we have today, we'll start with
the dispatcher source code but let's keep our heads open ;-)
The one part of this that we can't really take this attitude on is the
interface we present to our users, i.e. the format of the job file.  Do
we want to/need to change that?  Can we provide a transition plan if we
do need to change it?
I don't think we need to change that yet. In fact, I don't want to allow
changing that until we deal with this set of problems. Everything so far
should be backwards compatible.
...
I'm not amazingly attached to the current format, it's a bit procedural
and redundant in some ways, a format that allows are users to express
what they _mean_ a bit more would be good -- I think with the current
format we're showing our guts to the world a bit :) Slightly separate
discussion, I suppose.
I agree with your points and I even don't mind discussing that now in a
separate thread. I think the biggest issues currently is that people
expect to express themselves with a GUI, documentation or examples. We
kind of lack all three that are not "look at the existing jobs".
...
...
I'd like to lay down a plan on how the implementation will evolve, at
each milestone we should be good to deploy this to production with full
confidence.
I think there is a second side to incremental -- we don't want to have
to redo the master image for each board for each iteration (and I don't
see any reason below why we might have to, just making the point).
Yeah, I knew you'd bring that up. For the moment I don't see a better
way. I think we should rotate master images every month regardless. We
may cheat this by just running an upgrade script on the master if that's
a time saver. Right now I'd rather be correct than convenient.
...
...
+0.0 (current dispatcher tree)
+0.1, replace external programs with lava-serial, serial config is not
constrained to serial class (direct, networked) and constructor data
Side note: with lava-device-manager the dispatcher would get this from
the outside and would be thus 100% config-free
+0.2, add mini master agent to master rootfs, make it accept shell
commands over IP, master image is scripted with one RPC method similar
to subprocess.Popen(). Dispatcher learns of the board IP over serial.
+0.3, add improved master image agent, extra specialized methods for
deployment, no shell during image deployment (download and copy to
partition driven from python)
Yeah, so with the current reasons for health job failures, this is the
bit I really want to see asap :)
Anyone interested in fast models?
...
...
+0.4, add mini test agent to test image before reboot, mounts testrootfs
and unpacks a tarball from master image (so that agent code is
synchronized to master image version), test agent supplements current
serial scripted code with simple methods (IP discovery, maybe shell
execution as in +0.2)
+0.5, test agent drives the whole test process, dispatcher job copied by
the master agent, data saved to testrootfs partition (TODO: maybe we
should pick a better location?)
I think the test agent should store the test results on its own
rootfs -- what else could it do?
Dedicated partition -- we still have two in MBR extended table, would
have nice properties, could even store data in looped mode. I don't want
to add too many problems but using rootfs for that feels unsafe.
BTW: I'm not terribly up-to-date on Android tests, do they need to store
log files on the target? If not we can just reuse the sdcard partition.
...
...
+0.6, master agent takes over the part of sending the data back to the
dispatcher, no hacky/racy webservers, clean code on both ends
How does this sound? I just wrote it from the top of my head, no deeper
thoughts yet.
I think it's basically fine.
Then let's start cranking work items and blueprints for dedicated
projects. Some of the work will happen in master image scripts, the rest
can be in the dispatcher. I'm happy with keeping the master scripts
simply pip install something from the dispatcher (hit:
dispatcher[master-agent] maybe?)
...
...
...
The only thing I'd change is that I don't really see a reason to *not*
spam the test output over the serial line and show said spam in the
scheduler web UI.  We should also store it in neat files on the test
image and make sure that those files are what the dashboard's view of
events are based on.
Thinking about it yeah, we may just spam the serial line for now.
Ideally I'd like to be able to get perfect separation of sources without
loosing correlated time. Imagine a scheduler page that has filters for
kernel messages (without relying on flaky pattern matching) and can
highlight them perfectly in the application run log.
Oh yes, that would be totally awesome.  But we shouldn't have a less
useful page in the mean time for no good reason.
You're right, we'll keep the logs for now.
Thanks for the feedback
ZK
-- 
Zygmunt Krynicki
Linaro Validation Team

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Linaro-validation] De-emphasize serial connection for LAVA Dispatcher, introduce LAVA Master Agent, LAVA Test Agent