[be wary of the cross post when replying!]
Hi all,
Seeing as LAVA isn't going to support multi-node tests or highbank in the super near future, I spent a while today hacking up a script to run some tests automatically on the calxeda nodes in the lab. You can see it in all its gory detail here:
http://bazaar.launchpad.net/~mwhudson/+junk/highbank-bench-scripts/view/head...
(probably best read from the bottom up, come to think of it)
To do things like power cycle and prepare instances in parallel, it's written in an asynchronous style using the Twisted event-driven framework. This was a bit of an experiment and I'm not sure what I think of the result -- it's /reasonably/ clear and it works, but perhaps just using one thread per node being tested and writing blocking code (and using semaphores or whatever to synchronize) would have been clearer. So I guess before I do any more hacking like this, it would be good to hear what you guys (especially Ard I suppose!) think of this style.
In general, how to express a job that consists of a number of steps, some of which can be executed in parallel and some of which have dependencies is an interesting one. I suppose my eventy one is more on the side of one-step-at-a-time by default with explicit parallelization and threads + locks would be more on the side of parallel by default and explicit serialization. This has implications for how we write the job descriptions we give to a hypothetical multi-node test supporting LAVA -- has anyone thought about this yet? I think I prefer the explicit parallelism style myself (makes me think of cilk and grand central dispatch and csp...).
Cheers, mwh
On Wed, Feb 20, 2013 at 07:56:26PM +1300, Michael Hudson-Doyle wrote:
[be wary of the cross post when replying!]
Hi all,
Seeing as LAVA isn't going to support multi-node tests or highbank in the super near future, I spent a while today hacking up a script to run some tests automatically on the calxeda nodes in the lab. You can see it in all its gory detail here:
http://bazaar.launchpad.net/~mwhudson/+junk/highbank-bench-scripts/view/head...
(probably best read from the bottom up, come to think of it)
To do things like power cycle and prepare instances in parallel, it's written in an asynchronous style using the Twisted event-driven framework. This was a bit of an experiment and I'm not sure what I think of the result -- it's /reasonably/ clear and it works, but perhaps just using one thread per node being tested and writing blocking code (and using semaphores or whatever to synchronize) would have been clearer. So I guess before I do any more hacking like this, it would be good to hear what you guys (especially Ard I suppose!) think of this style.
In general, how to express a job that consists of a number of steps, some of which can be executed in parallel and some of which have dependencies is an interesting one. I suppose my eventy one is more on the side of one-step-at-a-time by default with explicit parallelization and threads + locks would be more on the side of parallel by default and explicit serialization. This has implications for how we write the job descriptions we give to a hypothetical multi-node test supporting LAVA -- has anyone thought about this yet? I think I prefer the explicit parallelism style myself (makes me think of cilk and grand central dispatch and csp...).
My thoughts, from a LAVA standpoint.
This parallelism style is indeed very elegant, but I couldn't think of how we could take advantage of that in the existing LAVA infrastructure.
Maybe we could make the dispatcher spawn child dispatchers (one for each node involved in the test) and wait for all of them to finish.
Inside each child dispatcher invocation, there should be a primitive that says "wait until all by test budies are ready" so that after flashing and booting each once can perform its setup steps (i.e. the stuff we do before actually running tests), and wait for the others before executing its part in the distributed job. This communication might be coordinated by the "parent" dispatcher through signals. I'm not sure whether this primitive would be a new dispatcher action (and thus declared in the job description file), or a binary inside the target (and thus able to be invoked from inside lava test-shell-test test runs), or both.
To describe the tests, I thought of adding a new alternative attribute to the job description called "device_group", mutually exclusive with "device_type". This description would include a list of device specifications, including their type, any tags indicating special capabilities expected. We can then tag all nodes inside a single calxeda box with the same tag (say "calxeda-box-1"), then we can use that to request N devices in the same box for tests that require low-latency/high-bandwith networking between the participants.
Are my thought too abstract for non-LAVA people?
Antonio Terceiro antonio.terceiro@linaro.org writes:
On Wed, Feb 20, 2013 at 07:56:26PM +1300, Michael Hudson-Doyle wrote:
[be wary of the cross post when replying!]
Hi all,
Seeing as LAVA isn't going to support multi-node tests or highbank in the super near future, I spent a while today hacking up a script to run some tests automatically on the calxeda nodes in the lab. You can see it in all its gory detail here:
http://bazaar.launchpad.net/~mwhudson/+junk/highbank-bench-scripts/view/head...
(probably best read from the bottom up, come to think of it)
To do things like power cycle and prepare instances in parallel, it's written in an asynchronous style using the Twisted event-driven framework. This was a bit of an experiment and I'm not sure what I think of the result -- it's /reasonably/ clear and it works, but perhaps just using one thread per node being tested and writing blocking code (and using semaphores or whatever to synchronize) would have been clearer. So I guess before I do any more hacking like this, it would be good to hear what you guys (especially Ard I suppose!) think of this style.
In general, how to express a job that consists of a number of steps, some of which can be executed in parallel and some of which have dependencies is an interesting one. I suppose my eventy one is more on the side of one-step-at-a-time by default with explicit parallelization and threads + locks would be more on the side of parallel by default and explicit serialization. This has implications for how we write the job descriptions we give to a hypothetical multi-node test supporting LAVA -- has anyone thought about this yet? I think I prefer the explicit parallelism style myself (makes me think of cilk and grand central dispatch and csp...).
My thoughts, from a LAVA standpoint.
This parallelism style is indeed very elegant, but I couldn't think of how we could take advantage of that in the existing LAVA infrastructure.
Yeah, I guess the LAVA trend has been towards being more device-controlled (lava-test-shell and all that) and that doesn't really fit with the explicit parallelism style. Oh well. I'll get over it :-)
Maybe we could make the dispatcher spawn child dispatchers (one for each node involved in the test) and wait for all of them to finish.
I think on some level this model makes sense (whether it's subprocesses or threads or the dispatcher does some async stuff doesn't really matter for the mental model IMHO).
Inside each child dispatcher invocation, there should be a primitive that says "wait until all by test budies are ready" so that after flashing and booting each once can perform its setup steps (i.e. the stuff we do before actually running tests), and wait for the others before executing its part in the distributed job. This communication might be coordinated by the "parent" dispatcher through signals. I'm not sure whether this primitive would be a new dispatcher action (and thus declared in the job description file), or a binary inside the target (and thus able to be invoked from inside lava test-shell-test test runs), or both.
I think ... perhaps both? It seems to me that the difference is around rebooting: (currently, anyway) a lava_test_shell action implies a reboot, and one thing a lava_test_shell-invoked script _cannot_ do (well, easily, there are probably hacks) is reboot. And I can just about imagine tests that might want do some configuration that requires a reboot to take effect.
I think we should probably try to write some tests like my simple iperf test and see what API we would like.
Here's a fun problem: devices will need to know the IP addresses of the other devices in the test. I suppose we could delay starting the lava-test-shell processes on any device until they have all booted and acquired an IP address? Or we could run some service on the host running the dispatcher that can be queried and informed of IP addresses or something.
To describe the tests, I thought of adding a new alternative attribute to the job description called "device_group", mutually exclusive with "device_type". This description would include a list of device specifications, including their type, any tags indicating special capabilities expected. We can then tag all nodes inside a single calxeda box with the same tag (say "calxeda-box-1"), then we can use that to request N devices in the same box for tests that require low-latency/high-bandwith networking between the participants.
This part makes a lot of sense to me.
Are my thought too abstract for non-LAVA people?
I can't answer this question :-)
Cheers, mwh
I just find this thread excellent, please keep up this!
we can then wrap up with more feedback from members at LCA13. As Michael will be missing there, the best is to continue here as much as possible until Friday!!
/Andrea
On 25 February 2013 00:37, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Antonio Terceiro antonio.terceiro@linaro.org writes:
On Wed, Feb 20, 2013 at 07:56:26PM +1300, Michael Hudson-Doyle wrote:
[be wary of the cross post when replying!]
Hi all,
Seeing as LAVA isn't going to support multi-node tests or highbank in the super near future, I spent a while today hacking up a script to run some tests automatically on the calxeda nodes in the lab. You can see it in all its gory detail here:
http://bazaar.launchpad.net/~mwhudson/+junk/highbank-bench-scripts/view/head...
(probably best read from the bottom up, come to think of it)
To do things like power cycle and prepare instances in parallel, it's written in an asynchronous style using the Twisted event-driven framework. This was a bit of an experiment and I'm not sure what I think of the result -- it's /reasonably/ clear and it works, but perhaps just using one thread per node being tested and writing blocking code (and using semaphores or whatever to synchronize) would have been clearer. So I guess before I do any more hacking like this, it would be good to hear what you guys (especially Ard I suppose!) think of this style.
In general, how to express a job that consists of a number of steps, some of which can be executed in parallel and some of which have dependencies is an interesting one. I suppose my eventy one is more on the side of one-step-at-a-time by default with explicit parallelization and threads + locks would be more on the side of parallel by default and explicit serialization. This has implications for how we write the job descriptions we give to a hypothetical multi-node test supporting LAVA -- has anyone thought about this yet? I think I prefer the explicit parallelism style myself (makes me think of cilk and grand central dispatch and csp...).
My thoughts, from a LAVA standpoint.
This parallelism style is indeed very elegant, but I couldn't think of how we could take advantage of that in the existing LAVA infrastructure.
Yeah, I guess the LAVA trend has been towards being more device-controlled (lava-test-shell and all that) and that doesn't really fit with the explicit parallelism style. Oh well. I'll get over it :-)
Maybe we could make the dispatcher spawn child dispatchers (one for each node involved in the test) and wait for all of them to finish.
I think on some level this model makes sense (whether it's subprocesses or threads or the dispatcher does some async stuff doesn't really matter for the mental model IMHO).
Inside each child dispatcher invocation, there should be a primitive that says "wait until all by test budies are ready" so that after flashing and booting each once can perform its setup steps (i.e. the stuff we do before actually running tests), and wait for the others before executing its part in the distributed job. This communication might be coordinated by the "parent" dispatcher through signals. I'm not sure whether this primitive would be a new dispatcher action (and thus declared in the job description file), or a binary inside the target (and thus able to be invoked from inside lava test-shell-test test runs), or both.
I think ... perhaps both? It seems to me that the difference is around rebooting: (currently, anyway) a lava_test_shell action implies a reboot, and one thing a lava_test_shell-invoked script _cannot_ do (well, easily, there are probably hacks) is reboot. And I can just about imagine tests that might want do some configuration that requires a reboot to take effect.
I think we should probably try to write some tests like my simple iperf test and see what API we would like.
Here's a fun problem: devices will need to know the IP addresses of the other devices in the test. I suppose we could delay starting the lava-test-shell processes on any device until they have all booted and acquired an IP address? Or we could run some service on the host running the dispatcher that can be queried and informed of IP addresses or something.
To describe the tests, I thought of adding a new alternative attribute to the job description called "device_group", mutually exclusive with "device_type". This description would include a list of device specifications, including their type, any tags indicating special capabilities expected. We can then tag all nodes inside a single calxeda box with the same tag (say "calxeda-box-1"), then we can use that to request N devices in the same box for tests that require low-latency/high-bandwith networking between the participants.
This part makes a lot of sense to me.
Are my thought too abstract for non-LAVA people?
I can't answer this question :-)
Cheers, mwh
linaro-enterprise mailing list linaro-enterprise@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-enterprise
-- Andrea Gallo Director, Linaro Enterprise Group email: andrea.gallo@linaro.org mobile: +39 338 4075993 IRC: agallo@#linaro on irc.linaro.org Skype: agallo70
On Mon, Feb 25, 2013 at 12:37:24PM +1300, Michael Hudson-Doyle wrote:
Antonio Terceiro antonio.terceiro@linaro.org writes:
My thoughts, from a LAVA standpoint.
This parallelism style is indeed very elegant, but I couldn't think of how we could take advantage of that in the existing LAVA infrastructure.
Yeah, I guess the LAVA trend has been towards being more device-controlled (lava-test-shell and all that) and that doesn't really fit with the explicit parallelism style. Oh well. I'll get over it :-)
Maybe we could make the dispatcher spawn child dispatchers (one for each node involved in the test) and wait for all of them to finish.
I think on some level this model makes sense (whether it's subprocesses or threads or the dispatcher does some async stuff doesn't really matter for the mental model IMHO).
Inside each child dispatcher invocation, there should be a primitive that says "wait until all by test budies are ready" so that after flashing and booting each once can perform its setup steps (i.e. the stuff we do before actually running tests), and wait for the others before executing its part in the distributed job. This communication might be coordinated by the "parent" dispatcher through signals. I'm not sure whether this primitive would be a new dispatcher action (and thus declared in the job description file), or a binary inside the target (and thus able to be invoked from inside lava test-shell-test test runs), or both.
I think ... perhaps both? It seems to me that the difference is around rebooting: (currently, anyway) a lava_test_shell action implies a reboot, and one thing a lava_test_shell-invoked script _cannot_ do (well, easily, there are probably hacks) is reboot. And I can just about imagine tests that might want do some configuration that requires a reboot to take effect.
A job that requires a reboot could declare the following:
- deploy image - boot image - lava-test-shell <- setup.yaml - boot image - lava-test-shell <- run.yaml
the run.yaml lava-test-shell definition could then just call a binary that implements "wait for buddies".
(this way we don't need a "wait for budies" dispatcher action, just a binary that can be called by the test suite).
I think we should probably try to write some tests like my simple iperf test and see what API we would like.
Yep.
Here's a fun problem: devices will need to know the IP addresses of the other devices in the test. I suppose we could delay starting the lava-test-shell processes on any device until they have all booted and acquired an IP address? Or we could run some service on the host running the dispatcher that can be queried and informed of IP addresses or something.
the "wait for buddies" action could inform the node's IP to the dispatcher via a signal. When the dispatcher receives those from all nodes, then it sends a list of all IP's to each node. After receiving that list, the node can then write a /etc/hosts-like file with the IP's of the group that can be read by the tests scripts being run.
Antonio Terceiro antonio.terceiro@linaro.org writes:
On Mon, Feb 25, 2013 at 12:37:24PM +1300, Michael Hudson-Doyle wrote:
Antonio Terceiro antonio.terceiro@linaro.org writes:
My thoughts, from a LAVA standpoint.
This parallelism style is indeed very elegant, but I couldn't think of how we could take advantage of that in the existing LAVA infrastructure.
Yeah, I guess the LAVA trend has been towards being more device-controlled (lava-test-shell and all that) and that doesn't really fit with the explicit parallelism style. Oh well. I'll get over it :-)
Maybe we could make the dispatcher spawn child dispatchers (one for each node involved in the test) and wait for all of them to finish.
I think on some level this model makes sense (whether it's subprocesses or threads or the dispatcher does some async stuff doesn't really matter for the mental model IMHO).
Inside each child dispatcher invocation, there should be a primitive that says "wait until all by test budies are ready" so that after flashing and booting each once can perform its setup steps (i.e. the stuff we do before actually running tests), and wait for the others before executing its part in the distributed job. This communication might be coordinated by the "parent" dispatcher through signals. I'm not sure whether this primitive would be a new dispatcher action (and thus declared in the job description file), or a binary inside the target (and thus able to be invoked from inside lava test-shell-test test runs), or both.
I think ... perhaps both? It seems to me that the difference is around rebooting: (currently, anyway) a lava_test_shell action implies a reboot, and one thing a lava_test_shell-invoked script _cannot_ do (well, easily, there are probably hacks) is reboot. And I can just about imagine tests that might want do some configuration that requires a reboot to take effect.
A job that requires a reboot could declare the following:
- deploy image
- boot image
- lava-test-shell <- setup.yaml
- boot image
- lava-test-shell <- run.yaml
Yeah, that would work. It's kind of crummy in that there is a dependence between the structure of the job file and the repository with the yaml files in it but as it's a bit of a special case..
the run.yaml lava-test-shell definition could then just call a binary that implements "wait for buddies".
(this way we don't need a "wait for budies" dispatcher action, just a binary that can be called by the test suite).
I think we should probably try to write some tests like my simple iperf test and see what API we would like.
Yep.
Here's a fun problem: devices will need to know the IP addresses of the other devices in the test. I suppose we could delay starting the lava-test-shell processes on any device until they have all booted and acquired an IP address? Or we could run some service on the host running the dispatcher that can be queried and informed of IP addresses or something.
the "wait for buddies" action could inform the node's IP to the dispatcher via a signal. When the dispatcher receives those from all nodes, then it sends a list of all IP's to each node. After receiving that list, the node can then write a /etc/hosts-like file with the IP's of the group that can be read by the tests scripts being run.
Ah yeah. Putting it in /etc/hosts would be a neat trick -- I'd been thinking anyway that the job file should give names to the nodes it requests (origin-server-1, origin-server-2, proxy-node, load-gen-1, load-gen-2...).
Cheers, mwh
linaro-validation@lists.linaro.org