W dniu 30.03.2012 03:26, Le.chi Thu pisze:
Hi
definition: DUT = device under test
I am not agree to change the dispatcher. The suggested solution will not solve the root problem LAVA is facing today (unstable system) and also put the unneccessary constraint on LAVA by only allow DUT based test scenario (I mean the tests run isolated in DUT).
There are many tests which require host / DUT communication during the test execution. Actually the test scenario is on host side and send command to target to perform action on the DUT.
In all of our current tests we don't really need to send anything, we just do because that's how we started.
Example if you test WLAN roaming, the test scenario is on the host, controlling both the WLAN simulator and the DUT.
When we cross that bridge we can think about it. I'm not convinced it's not possible to do that without talking to the machine that controls DUTs. Remember that I only want to eradicate the absolute abuse of the serial line. Not any means of communication. In a specialized test where you really absolutely have to talk to the test controller we could have a way of doing that. It still does not make the generic pattern of copying the scenario to the test image and running it there via an agent invalid.
Now our issues are:
1) Serial lines loose data 2) In our current architecture that means loosing the job (if unlucky). 3) We have very poor code running on the DUT (stuff like re-trying HTTP would be otherwise easy to perform) because it's basically limited to whatever we have in busybox/coreutils.
Other example is multiple DUT tests.
No, that's perfectly possible. If the network works all devices are free to talk to one another. I just don't want them to need to talk to the dispatcher.
The serial port problem is a side effect when LAVA server is overloaded. Same with the 'wget image' problem.
Even if serial worked 100% reliably today I'd like to get rid of it as the architecture is flaky. Sending shell across the wire and hoping for the best is not the right way to do it.
The solution is not overload the LAVA server. Possible solutions are :
- Make the scheduler more intelligent and scheduler out job evently
(it make no sense to start more jobs than the lava server can handle)
Since we don't know how much "too much" is this will never solve anything.
- Distribute heavy task to cloud instances
This will happen anyway, we need to scale to other machines.
- Update lava-dispatcher to retry if fail on some operations.
In the current implementation you cannot sensibly retry stuff over shell. You need some glimpse of API to even attempt that. It's like trying to do reliable protocol over UDP messages without getting any ack from the other side. If we loose a byte in the middle of a command (or a whole block and a ton of logging along with that) we just cannot assume it's safe to try again.
Thanks ZK