W dniu 08.12.2011 02:22, Michael Hudson-Doyle pisze:
On Wed, 7 Dec 2011 17:01:25 +0100, Zygmunt Krynickizygmunt.krynicki@linaro.org wrote:
Hi, sorry for the topic, I wanted to catch your attention.
This is a quick brain dump based on my own observations/battle with master images last week.
- Running code via serial on the master image is a mess. It is very
fragile.
Is it really? It's a bit of a pain, but it seems this part actually works ok for us. It also has the advantage that all the logs are in one place.
<<CONMUX DISCONNECTED>>
@#!@$ !@ serial output, without any sensible way to break it down (and I don't count matching "# echo LAVA DISPATCHER: now doing foo" as sensible.
A few random reasons for not using serial the way we do it today:
1) Random console message breaks our system of tracking state and invoking commands. 2) We could put pppd on the serial line to get early networking for our agent, we could assume we can download stuff in the master image without ethernet (not that it would be much useful at the speed). We could use TCP to have networked API on the master image. 3) Serial console slows down stuff a LOT. Check how fast you can boot without serial console (hint, much faster). We can still keep all the logs around by other means.
Also, I don't see anything in your proposals that would get us away from having to talk to the bootloader over the serial line. Also getting the boot log for a failed boot seems somehow essential and I don't know how we can do that without a serial connection (this is different from running commands, though).
I'm not saying "we should not use the serial line". I'm saying "we should not use the serial line for everything in the most crude form possible".
For the time being (until I patch u-boot to talk to LAVA) the boot loader will stay as is. For the vast amount of wall clock time spent after the boot loader we can do smarter things without waiting for the sun to eclipse and origen networking to work.
You can send a series of commands to a device. Get return codes back, without parsing, reliably. You can do structured logging (where the device keeps logs for each command it receives), and it will be never confused by funny output pattern. We can ask the device to reboot while other tasks are hanging. We can download stuff without putting wget on the board and piping it to tar for crying out loud.
We need an agent on the board instead of a random master image+serial shell. The agent will expose board identity, capabilities and standard APIs to LAVA (notably the dispatcher). The same API, if done sensibly, will work for software emulators and hardware boards. Agent API for a software emulator can do different things. Dispatcher should be based on agent API instead of ramming the serial line.
Well, I just rewrote chunks of the dispatcher to work for software emulators, albeit taking a different approach. Not sure the approach you propose is really any different, although perhaps it would be easier to distribute to different machines.
I don't want to deprecate your work. What I'm doing here (apart from hand waving and shouting) is discussing how it should work to be more reliable and future proof. I'm sure that implementing this will take a lot of time in practice and that dispatcher maintenance is as relevant as it was yesterday. I need to dig deeper into current dispatcher code to be able to judge this. Still I think that dispatcher is orthogonal. You can build the dispatcher on top of what it currently does or on top of a board API object. Both code variants can coexist for a long while.
- The master image, as we know it today, should be booting remotely.
The boot loader can stay on the board until we can push it over USB. The only thing that absolutely has to stay in the card is the lava board identity file which would be generated from the web UI. There is no reason to keep rootfs/kernel/initrd there. This means that a single small card can fit all tests as well. It also means we can reset the master image (as currently it is writeable by the board and can be corrupted) before booting to ensure consistent behaviour. I did some work on that and I managed to boot panda over NFS. Ideally I want to boot over nbd (netblock device) which is much faster and with proper "master image" init script we can expose a single read only net block device to _all_ the boards.
This sounds good.
- With agent on each board, identity file on the SD card LAVA will
know if cloning happened. We could do dynamic board detection (unplug the board -> it goes away, plug it back -> it shows up). We could move a board from system to system and have 0config transitions.
I'm not sure about this though. How do you tell the difference between the agent going away because booting into the test image failed and it being unplugged at a particular time?"
Good point. The state of a device is a little bit more complicated than I presented. I wanted to point out that we could do discovery in a reliable way, something that we currently cannot do (and this prevents us from having foolproof provisioning of additional (or very first) devices.
For actual state we'd still have a few "in flux" moments like when doing a power cycle, transitioning from boot loader to kernel+userspace context etc.
As for totally unpluging devices. If you require a USB connection then you know your device went away ;-) That's what most people will do (one device + laptop) and that's what we'll eventually have to do (no dedicated serial / ethernet on devices, everything muxed through USB). Snowball is just a very simple example of that.
- Dispatcher should drop all configuration files. Sure it made sense
12 months ago when the idea was to run it standalone. Now all of that configuration should be in the database and should be provided by the scheduler to the dispatcher as a big serialized argument (or a file descriptor or a temporary file on disk). Setting up the dispatcher for a new instance is a pain and unless you can copy stuff from the validation server and ask everyone around for help it's very hard to get right.
If you're using a type of board that has support 'upstream' it's actually pretty easy, you basically just need to create a file per device that indicates which type it is.
That's good.
Apart from the fact that it's all a bit all over the place, I don't see how setting up things in the django admin interface is actually easier than setting it up in the filesystem.
It is not easier except that you can do the UI in Django and then touching filesystem directly is not an option. I want to get to a point where I can click through some wizards to get my panda working without having to open a console. With a few extra services the system will even _tell_ you that you've got a panda plugged in that needs provisioning.
Having said all of that, I agree with this goal :)
If master images could be constructed programmatically and with a agent on each "master image" lava would just get that configuration for free.
- We should drop conmux. As in the lab we already have TCP/IP sockets
for the serial lines we could just provide my example serial->tcp script as lava-serial service that people with directly attached boards would use. We could get a similar lava-power service if that would make sense. The lava-serial service could be started as an instance for all USB/SERIAL adapters plugged in if we really wanted (hello upstart!). The lava-power service would be custom and would require some config but it is very rare. Only lab and me have something like that. Again it should be instance based IMHO so I can say: 'start lava-power CONF=/etc/lava-power/magic-hack.conf' and see LAVA know about a power service. One could then say that a particular board uses a particular serial and power services.
I agree here. conmux is useful, but we don't need the 'mux' part at all, and I find myself restarting the daemon all the damn time just to get it working again.
I had the same experience during my (very brief) contact with this.
Thanks ZK