Master images are a mess

7 Dec 2011


      Hi, sorry for the topic, I wanted to catch your attention.
This is a quick brain dump based on my own observations/battle with
master images last week.
1) Unless we use external USB/ETH adapters then cloning a master image
clones the mac address as well. This has serious consequences and I'm
100% sure that's why lava-test had to be switched to the random UUID
mode. This problem applies to the master image mode. In the test image
the software can do anything so we may run with a random MAC or with
the mac that master images' boot loader set (we should check that).
Since making master images is a mess, unless is becomes automated I
will not be convinced that people just know how to make them properly
and are not simply copying from someone. There is no reproducible
master image creation process that ensure two people with the same
board can run a single test in a reproducible way! (different starting
rootfs/bootloader/package selection/random mistakes)
2) Running code via serial on the master image is a mess. It is very
fragile. We need an agent on the board instead of a random master
image+serial shell. The agent will expose board identity, capabilities
and standard APIs to LAVA (notably the dispatcher).
The same API, if done sensibly, will work for software emulators and
hardware boards. Agent API for a software emulator can do different
things. Dispatcher should be based on agent API instead of ramming the
serial line.
3) The master image, as we know it today, should be booting remotely.
The boot loader can stay on the board until we can push it over USB.
The only thing that absolutely has to stay in the card is the lava
board identity file which would be generated from the web UI. There is
no reason to keep rootfs/kernel/initrd there. This means that a single
small card can fit all tests as well. It also means we can reset the
master image (as currently it is writeable by the board and can be
corrupted) before booting to ensure consistent behaviour. I did some
work on that and I managed to boot panda over NFS. Ideally I want to
boot over nbd (netblock device) which is much faster and with proper
"master image" init script we can expose a single read only net block
device to _all_ the boards.
4) With agent on each board, identity file on the SD card LAVA will
know if cloning happened. We could do dynamic board detection (unplug
the board -> it goes away, plug it back -> it shows up). We could move
a board from system to system and have 0config transitions.
5) Dispatcher should drop all configuration files. Sure it made sense
12 months ago when the idea was to run it standalone. Now all of that
configuration should be in the database and should be provided by the
scheduler to the dispatcher as a big serialized argument (or a file
descriptor or a temporary file on disk). Setting up the dispatcher for
a new instance is a pain and unless you can copy stuff from the
validation server and ask everyone around for help it's very hard to
get right. If master images could be constructed programmatically and
with a agent on each "master image" lava would just get that
configuration for free.
6) We should drop conmux. As in the lab we already have TCP/IP sockets
for the serial lines we could just provide my example serial->tcp
script as lava-serial service that people with directly attached
boards would use. We could get a similar lava-power service if that
would make sense. The lava-serial service could be started as an
instance for all USB/SERIAL adapters plugged in if we really wanted
(hello upstart!). The lava-power service would be custom and would
require some config but it is very rare. Only lab and me have
something like that. Again it should be instance based IMHO so I can
say: 'start lava-power CONF=/etc/lava-power/magic-hack.conf' and see
LAVA know about a power service. One could then say that a particular
board uses a particular serial and power services.
That's it.
Best regards
ZK

Cheers,