On Tue, Mar 20, 2012 at 12:52:48PM +0100, Zygmunt Krynicki wrote:
W dniu 20.03.2012 11:48, Alexander Sack pisze:
On Tue, Mar 20, 2012 at 11:41:37AM +0100, Zygmunt Krynicki wrote:
Hi
Experimenting with the dispatcher made me realize that forced reboots (on timeouts, for example) are an excellent way to damage the master image. At the very best we are forced to re-check the master image. At the very worst we may damage the superblock and generally hose the master.
Do you think it is feasible to mount the master read-only and only do r/w work on the test partitions?
I like this idea... That combined with always-poweroff-on-reboot feels like a good idea to compensate potential issues...
Just curious: why would we always poweroff on reboot? Do you mean actual power being cut or the equivalent of poweroff(8)?
That's a different background. The key of automation infrastructure is to ensure that each invidividual test is run in a controlled environment with as close to 100% reprodicibility of the state as possible.
soft rebooting the unit doesn't guarantee to bring you back into a known base state. Hence, the requirement to always hard reboot with proper time unpowered in between.
Take the approach known from live-cd into account, such as aufs/unionfs and things should work well ... maybe master image doesnt even need a partition anymore, but can be just a .img file on the fat boot partition, just like how ubuntu live-cd etc. works...
I wonder what is the complexity of this approach. I would also like to consider the memory requirements. As an alternative we could try to mount the master image from NBD. The NBD server already support "reverting to snapshot" and keeping delta for each connected client in a temporary file.
Considering that the master image is nano and boots to the console only, I don't think that the memory requirements would exceed what we target LAVA lab at.
What I don't like about NBD is that it makes the LAVA infrastructure more complex and harder to replicate. Everytime we add a new server/service that isn't the image/board itself, we diverge from something that can be validated and released efficiently/effectively a bit further.
Anyone can think of reason to not put that into the backlog?
If LAVA team decides to investigate that path, please check with DevPlatform team on how they can help...
I think we should seriously consider it as a milestone towards LAVA reliability and automation of master image construction.
Do we have a few empiric examples of the gathered list of LAVA incident that allows us to identify changes to master image (not talking about reproducability here) as a recurring source for unreliability?