I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
Thanks ZK
I like this in general, but some things to think about: * Raw serial log is *hugely* important. Without it, some kernel bugs will simply not be visible reliably. * I'm starting to be convinced that we should have to depend on working ethernet - this is already the case with android * Speaking of android... how does this affect testing on android? It sounds as if it may be geared more towards ubuntu image testing * If we require ethernet, what happens when we want to do an ethernet enablement test (this should be coming soon) that wants to bring up/down ethernet interfaces and connect/disconnect them?
Thanks, Paul Larson
On Mon, Mar 26, 2012 at 12:55 PM, Zygmunt Krynicki < zygmunt.krynicki@linaro.org> wrote:
I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
Thanks ZK
-- Zygmunt Krynicki Linaro Validation Team
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
W dniu 27.03.2012 06:46, Paul Larson pisze:
I like this in general, but some things to think about:
- Raw serial log is *hugely* important. Without it, some kernel bugs
will simply not be visible reliably.
And we'll keep grabbing them as today. It is just the case that:
1) They will not contain test output, so kernel bumps will sand out
2) If we capture them unreliably, such as today, then nothing breaks
- I'm starting to be convinced that we should have to depend on working
ethernet - this is already the case with android
This proposal does not depend on working ethernet on the test image
- Speaking of android... how does this affect testing on android? It
sounds as if it may be geared more towards ubuntu image testing
I need to check this but I suspect we can implement a imperative agent (shell, or even java if needed) that helps us run android early initialization. AFAIK most of the work is alredy performed via adb.
- If we require ethernet, what happens when we want to do an ethernet
enablement test (this should be coming soon) that wants to bring up/down ethernet interfaces and connect/disconnect them?
As above, we don't require ethernet on the test image. If the test starts messing with ethernet then we simply loose the real-time log streaming. In either case the log is preserved and is recovered after reboot from the master image.
Thanks ZK
Thanks, Paul Larson
On Mon, Mar 26, 2012 at 12:55 PM, Zygmunt Krynicki <zygmunt.krynicki@linaro.org mailto:zygmunt.krynicki@linaro.org> wrote:
I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04. The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time. Please read and comment on the mailing list https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-emphasize-serial Thanks ZK -- Zygmunt Krynicki Linaro Validation Team _______________________________________________ linaro-validation mailing list linaro-validation@lists.linaro.org <mailto:linaro-validation@lists.linaro.org> http://lists.linaro.org/mailman/listinfo/linaro-validation
On Tue, Mar 27, 2012 at 08:59:59AM +0200, Zygmunt Krynicki wrote:
W dniu 27.03.2012 06:46, Paul Larson pisze:
I like this in general, but some things to think about:
- Raw serial log is *hugely* important. Without it, some kernel bugs
will simply not be visible reliably.
And we'll keep grabbing them as today. It is just the case that:
They will not contain test output, so kernel bumps will sand out
If we capture them unreliably, such as today, then nothing breaks
- I'm starting to be convinced that we should have to depend on working
ethernet - this is already the case with android
This proposal does not depend on working ethernet on the test image
- Speaking of android... how does this affect testing on android? It
sounds as if it may be geared more towards ubuntu image testing
I need to check this but I suspect we can implement a imperative agent (shell, or even java if needed) that helps us run android early initialization. AFAIK most of the work is alredy performed via adb.
- If we require ethernet, what happens when we want to do an ethernet
enablement test (this should be coming soon) that wants to bring up/down ethernet interfaces and connect/disconnect them?
As above, we don't require ethernet on the test image. If the test starts messing with ethernet then we simply loose the real-time log streaming. In either case the log is preserved and is recovered after reboot from the master image.
I am a bit concerned that we add more architectural complexity to LAVA by moving to ethernet and then implementing fallbacks and special cases for the times when ethernet doesn't work...
If you only had serial, what can be done? What problems need to be solved?
Are there other options?
W dniu 27.03.2012 13:13, Alexander Sack pisze:
On Tue, Mar 27, 2012 at 08:59:59AM +0200, Zygmunt Krynicki wrote:
W dniu 27.03.2012 06:46, Paul Larson pisze:
I like this in general, but some things to think about:
- Raw serial log is *hugely* important. Without it, some kernel bugs
will simply not be visible reliably.
And we'll keep grabbing them as today. It is just the case that:
They will not contain test output, so kernel bumps will sand out
If we capture them unreliably, such as today, then nothing breaks
- I'm starting to be convinced that we should have to depend on working
ethernet - this is already the case with android
This proposal does not depend on working ethernet on the test image
- Speaking of android... how does this affect testing on android? It
sounds as if it may be geared more towards ubuntu image testing
I need to check this but I suspect we can implement a imperative agent (shell, or even java if needed) that helps us run android early initialization. AFAIK most of the work is alredy performed via adb.
- If we require ethernet, what happens when we want to do an ethernet
enablement test (this should be coming soon) that wants to bring up/down ethernet interfaces and connect/disconnect them?
As above, we don't require ethernet on the test image. If the test starts messing with ethernet then we simply loose the real-time log streaming. In either case the log is preserved and is recovered after reboot from the master image.
I am a bit concerned that we add more architectural complexity to LAVA by moving to ethernet and then implementing fallbacks and special cases for the times when ethernet doesn't work...
Lava is unreliable today, this will fix it.
The base case is nothing. Everything is local. As a special exception, if Ethernet works then you also get _live_ log streaming. Normally the complete set of data would be recovered after the test is done, from the master image, and sent, reliably, to the system over Ethernet.
If you only had serial, what can be done? What problems need to be solved?
Serial is unreliable for us today. We see consistent data loss, we loose a lot of jobs as a consequence. This is because almost any serial corruption is fatal in our current environment.
Still, this is irrelevant. Even without serial the improvement is clear. The test execution is less interactive with my proposal (only the bootloader has to be scripted). Currently _everything_ is scripted on the root-logged-in serial line. That adds a lot of flakiness and a single missed byte can confuse the system.
Are there other options?
I think this one is very good. The alleged complexity is really not there if you look at how complex our current setup is.
Thanks ZK
On Mon, 26 Mar 2012 19:55:59 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
I basically like this. Let's do it.
I think we can implement this incrementally by making a LavaAgentUsingClient or something in the dispatcher, although we'll have to make changes to the dispatcher too -- for example, the lava_test_run actions become a little... different (I guess the job file becomes a little less imperative? Or maybe there is a distinction between actions that execute on the host and those that execute on the board? Or something else?). But nothing impossible.
The only thing I'd change is that I don't really see a reason to *not* spam the test output over the serial line and show said spam in the scheduler web UI. We should also store it in neat files on the test image and make sure that those files are what the dashboard's view of events are based on.
Cheers, mwh
W dniu 29.03.2012 06:33, Michael Hudson-Doyle pisze:
On Mon, 26 Mar 2012 19:55:59 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
I basically like this. Let's do it.
Glad to hear that.
I think we can implement this incrementally by making a LavaAgentUsingClient or something in the dispatcher, although we'll have to make changes to the dispatcher too -- for example, the lava_test_run actions become a little... different (I guess the job file becomes a little less imperative? Or maybe there is a distinction between actions that execute on the host and those that execute on the board? Or something else?). But nothing impossible.
I'll write a more concrete proposal. I'm +1 on starting small and doing iterations but I want to make it clear that the goal is to have something entirely different than what we have today, we'll start with the dispatcher source code but let's keep our heads open ;-)
I'd like to lay down a plan on how the implementation will evolve, at each milestone we should be good to deploy this to production with full confidence.
+0.0 (current dispatcher tree)
+0.1, replace external programs with lava-serial, serial config is not constrained to serial class (direct, networked) and constructor data
Side note: with lava-device-manager the dispatcher would get this from the outside and would be thus 100% config-free
+0.2, add mini master agent to master rootfs, make it accept shell commands over IP, master image is scripted with one RPC method similar to subprocess.Popen(). Dispatcher learns of the board IP over serial.
+0.3, add improved master image agent, extra specialized methods for deployment, no shell during image deployment (download and copy to partition driven from python)
+0.4, add mini test agent to test image before reboot, mounts testrootfs and unpacks a tarball from master image (so that agent code is synchronized to master image version), test agent supplements current serial scripted code with simple methods (IP discovery, maybe shell execution as in +0.2)
+0.5, test agent drives the whole test process, dispatcher job copied by the master agent, data saved to testrootfs partition (TODO: maybe we should pick a better location?)
+0.6, master agent takes over the part of sending the data back to the dispatcher, no hacky/racy webservers, clean code on both ends
How does this sound? I just wrote it from the top of my head, no deeper thoughts yet.
The only thing I'd change is that I don't really see a reason to *not* spam the test output over the serial line and show said spam in the scheduler web UI. We should also store it in neat files on the test image and make sure that those files are what the dashboard's view of events are based on.
Thinking about it yeah, we may just spam the serial line for now. Ideally I'd like to be able to get perfect separation of sources without loosing correlated time. Imagine a scheduler page that has filters for kernel messages (without relying on flaky pattern matching) and can highlight them perfectly in the application run log.
ZK
On Thu, 29 Mar 2012 11:07:32 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
W dniu 29.03.2012 06:33, Michael Hudson-Doyle pisze:
On Mon, 26 Mar 2012 19:55:59 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
I basically like this. Let's do it.
Glad to hear that.
:)
I think we can implement this incrementally by making a LavaAgentUsingClient or something in the dispatcher, although we'll have to make changes to the dispatcher too -- for example, the lava_test_run actions become a little... different (I guess the job file becomes a little less imperative? Or maybe there is a distinction between actions that execute on the host and those that execute on the board? Or something else?). But nothing impossible.
I'll write a more concrete proposal. I'm +1 on starting small and doing iterations but I want to make it clear that the goal is to have something entirely different than what we have today, we'll start with the dispatcher source code but let's keep our heads open ;-)
The one part of this that we can't really take this attitude on is the interface we present to our users, i.e. the format of the job file. Do we want to/need to change that? Can we provide a transition plan if we do need to change it?
I'm not amazingly attached to the current format, it's a bit procedural and redundant in some ways, a format that allows are users to express what they _mean_ a bit more would be good -- I think with the current format we're showing our guts to the world a bit :) Slightly separate discussion, I suppose.
I'd like to lay down a plan on how the implementation will evolve, at each milestone we should be good to deploy this to production with full confidence.
I think there is a second side to incremental -- we don't want to have to redo the master image for each board for each iteration (and I don't see any reason below why we might have to, just making the point).
+0.0 (current dispatcher tree)
+0.1, replace external programs with lava-serial, serial config is not constrained to serial class (direct, networked) and constructor data
Side note: with lava-device-manager the dispatcher would get this from the outside and would be thus 100% config-free
+0.2, add mini master agent to master rootfs, make it accept shell commands over IP, master image is scripted with one RPC method similar to subprocess.Popen(). Dispatcher learns of the board IP over serial.
+0.3, add improved master image agent, extra specialized methods for deployment, no shell during image deployment (download and copy to partition driven from python)
Yeah, so with the current reasons for health job failures, this is the bit I really want to see asap :)
+0.4, add mini test agent to test image before reboot, mounts testrootfs and unpacks a tarball from master image (so that agent code is synchronized to master image version), test agent supplements current serial scripted code with simple methods (IP discovery, maybe shell execution as in +0.2)
+0.5, test agent drives the whole test process, dispatcher job copied by the master agent, data saved to testrootfs partition (TODO: maybe we should pick a better location?)
I think the test agent should store the test results on its own rootfs -- what else could it do?
+0.6, master agent takes over the part of sending the data back to the dispatcher, no hacky/racy webservers, clean code on both ends
How does this sound? I just wrote it from the top of my head, no deeper thoughts yet.
I think it's basically fine.
The only thing I'd change is that I don't really see a reason to *not* spam the test output over the serial line and show said spam in the scheduler web UI. We should also store it in neat files on the test image and make sure that those files are what the dashboard's view of events are based on.
Thinking about it yeah, we may just spam the serial line for now. Ideally I'd like to be able to get perfect separation of sources without loosing correlated time. Imagine a scheduler page that has filters for kernel messages (without relying on flaky pattern matching) and can highlight them perfectly in the application run log.
Oh yes, that would be totally awesome. But we shouldn't have a less useful page in the mean time for no good reason.
Cheers, mwh
W dniu 30.03.2012 00:00, Michael Hudson-Doyle pisze:
On Thu, 29 Mar 2012 11:07:32 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
W dniu 29.03.2012 06:33, Michael Hudson-Doyle pisze:
On Mon, 26 Mar 2012 19:55:59 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
I've registered another blueprint. This time on LAVA Dispatcher. The goal is similar as the previous blueprint about device manager, to gather feedback and to propose some small steps sub-blueprints that could be scheduled in 2012.04.
The general goal is to improve the way we run tests by making them more reliable and more featureful (richer in state) at the same time.
Please read and comment on the mailing list
https://blueprints.launchpad.net/lava-dispatcher/+spec/lava-dispatcher-de-em...
I basically like this. Let's do it.
Glad to hear that.
:)
I think we can implement this incrementally by making a LavaAgentUsingClient or something in the dispatcher, although we'll have to make changes to the dispatcher too -- for example, the lava_test_run actions become a little... different (I guess the job file becomes a little less imperative? Or maybe there is a distinction between actions that execute on the host and those that execute on the board? Or something else?). But nothing impossible.
I'll write a more concrete proposal. I'm +1 on starting small and doing iterations but I want to make it clear that the goal is to have something entirely different than what we have today, we'll start with the dispatcher source code but let's keep our heads open ;-)
The one part of this that we can't really take this attitude on is the interface we present to our users, i.e. the format of the job file. Do we want to/need to change that? Can we provide a transition plan if we do need to change it?
I don't think we need to change that yet. In fact, I don't want to allow changing that until we deal with this set of problems. Everything so far should be backwards compatible.
I'm not amazingly attached to the current format, it's a bit procedural and redundant in some ways, a format that allows are users to express what they _mean_ a bit more would be good -- I think with the current format we're showing our guts to the world a bit :) Slightly separate discussion, I suppose.
I agree with your points and I even don't mind discussing that now in a separate thread. I think the biggest issues currently is that people expect to express themselves with a GUI, documentation or examples. We kind of lack all three that are not "look at the existing jobs".
I'd like to lay down a plan on how the implementation will evolve, at each milestone we should be good to deploy this to production with full confidence.
I think there is a second side to incremental -- we don't want to have to redo the master image for each board for each iteration (and I don't see any reason below why we might have to, just making the point).
Yeah, I knew you'd bring that up. For the moment I don't see a better way. I think we should rotate master images every month regardless. We may cheat this by just running an upgrade script on the master if that's a time saver. Right now I'd rather be correct than convenient.
+0.0 (current dispatcher tree)
+0.1, replace external programs with lava-serial, serial config is not constrained to serial class (direct, networked) and constructor data
Side note: with lava-device-manager the dispatcher would get this from the outside and would be thus 100% config-free
+0.2, add mini master agent to master rootfs, make it accept shell commands over IP, master image is scripted with one RPC method similar to subprocess.Popen(). Dispatcher learns of the board IP over serial.
+0.3, add improved master image agent, extra specialized methods for deployment, no shell during image deployment (download and copy to partition driven from python)
Yeah, so with the current reasons for health job failures, this is the bit I really want to see asap :)
Anyone interested in fast models?
+0.4, add mini test agent to test image before reboot, mounts testrootfs and unpacks a tarball from master image (so that agent code is synchronized to master image version), test agent supplements current serial scripted code with simple methods (IP discovery, maybe shell execution as in +0.2)
+0.5, test agent drives the whole test process, dispatcher job copied by the master agent, data saved to testrootfs partition (TODO: maybe we should pick a better location?)
I think the test agent should store the test results on its own rootfs -- what else could it do?
Dedicated partition -- we still have two in MBR extended table, would have nice properties, could even store data in looped mode. I don't want to add too many problems but using rootfs for that feels unsafe.
BTW: I'm not terribly up-to-date on Android tests, do they need to store log files on the target? If not we can just reuse the sdcard partition.
+0.6, master agent takes over the part of sending the data back to the dispatcher, no hacky/racy webservers, clean code on both ends
How does this sound? I just wrote it from the top of my head, no deeper thoughts yet.
I think it's basically fine.
Then let's start cranking work items and blueprints for dedicated projects. Some of the work will happen in master image scripts, the rest can be in the dispatcher. I'm happy with keeping the master scripts simply pip install something from the dispatcher (hit: dispatcher[master-agent] maybe?)
The only thing I'd change is that I don't really see a reason to *not* spam the test output over the serial line and show said spam in the scheduler web UI. We should also store it in neat files on the test image and make sure that those files are what the dashboard's view of events are based on.
Thinking about it yeah, we may just spam the serial line for now. Ideally I'd like to be able to get perfect separation of sources without loosing correlated time. Imagine a scheduler page that has filters for kernel messages (without relying on flaky pattern matching) and can highlight them perfectly in the application run log.
Oh yes, that would be totally awesome. But we shouldn't have a less useful page in the mean time for no good reason.
You're right, we'll keep the logs for now.
Thanks for the feedback ZK
Hi
definition: DUT = device under test
I am not agree to change the dispatcher. The suggested solution will not solve the root problem LAVA is facing today (unstable system) and also put the unneccessary constraint on LAVA by only allow DUT based test scenario (I mean the tests run isolated in DUT).
There are many tests which require host / DUT communication during the test execution. Actually the test scenario is on host side and send command to target to perform action on the DUT.
Example if you test WLAN roaming, the test scenario is on the host, controlling both the WLAN simulator and the DUT.
Other example is multiple DUT tests.
The serial port problem is a side effect when LAVA server is overloaded. Same with the 'wget image' problem.
The solution is not overload the LAVA server. Possible solutions are :
* Make the scheduler more intelligent and scheduler out job evently (it make no sense to start more jobs than the lava server can handle) * Distribute heavy task to cloud instances * Update lava-dispatcher to retry if fail on some operations.
/Chi Thu
W dniu 30.03.2012 03:26, Le.chi Thu pisze:
Hi
definition: DUT = device under test
I am not agree to change the dispatcher. The suggested solution will not solve the root problem LAVA is facing today (unstable system) and also put the unneccessary constraint on LAVA by only allow DUT based test scenario (I mean the tests run isolated in DUT).
There are many tests which require host / DUT communication during the test execution. Actually the test scenario is on host side and send command to target to perform action on the DUT.
In all of our current tests we don't really need to send anything, we just do because that's how we started.
Example if you test WLAN roaming, the test scenario is on the host, controlling both the WLAN simulator and the DUT.
When we cross that bridge we can think about it. I'm not convinced it's not possible to do that without talking to the machine that controls DUTs. Remember that I only want to eradicate the absolute abuse of the serial line. Not any means of communication. In a specialized test where you really absolutely have to talk to the test controller we could have a way of doing that. It still does not make the generic pattern of copying the scenario to the test image and running it there via an agent invalid.
Now our issues are:
1) Serial lines loose data 2) In our current architecture that means loosing the job (if unlucky). 3) We have very poor code running on the DUT (stuff like re-trying HTTP would be otherwise easy to perform) because it's basically limited to whatever we have in busybox/coreutils.
Other example is multiple DUT tests.
No, that's perfectly possible. If the network works all devices are free to talk to one another. I just don't want them to need to talk to the dispatcher.
The serial port problem is a side effect when LAVA server is overloaded. Same with the 'wget image' problem.
Even if serial worked 100% reliably today I'd like to get rid of it as the architecture is flaky. Sending shell across the wire and hoping for the best is not the right way to do it.
The solution is not overload the LAVA server. Possible solutions are :
- Make the scheduler more intelligent and scheduler out job evently
(it make no sense to start more jobs than the lava server can handle)
Since we don't know how much "too much" is this will never solve anything.
- Distribute heavy task to cloud instances
This will happen anyway, we need to scale to other machines.
- Update lava-dispatcher to retry if fail on some operations.
In the current implementation you cannot sensibly retry stuff over shell. You need some glimpse of API to even attempt that. It's like trying to do reliable protocol over UDP messages without getting any ack from the other side. If we loose a byte in the middle of a command (or a whole block and a ton of logging along with that) we just cannot assume it's safe to try again.
Thanks ZK
On Fri, 30 Mar 2012 03:26:56 +0200, "Le.chi Thu" le.chi.thu@linaro.org wrote:
The serial port problem is a side effect when LAVA server is overloaded. Same with the 'wget image' problem.
Are you sure about that? It sounds like a reasonable, but I don't see that its proven in any sense.
Cheers, mwh
Hi
I have no prove but this wget image problem happen when we introduced heath check jobs where we deploy image file instead of create the image from hwpack + rootfs with l-m-c.
The l-m-c serialize previous version of health check jobs but the new heath check start 20+ parallel jobs and performing unzip the image files and overload the lava-server.
/Chi Thu
On 2 April 2012 03:24, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
On Fri, 30 Mar 2012 03:26:56 +0200, "Le.chi Thu" le.chi.thu@linaro.org wrote:
The serial port problem is a side effect when LAVA server is overloaded. Same with the 'wget image' problem.
Are you sure about that? It sounds like a reasonable, but I don't see that its proven in any sense.
Cheers, mwh
linaro-validation@lists.linaro.org