Hi,
I've been looking at moving my ghetto multinode stuff over to proper LAVA multinode on and off for a while now, and have something that I'm still not sure how best to handle: result aggregation.
The motivating case here is having load generation distributed across various machines: to compute the req/s the server is actually able to manage I want to add up the number of requests each load generator made.
I can sort of see how to do this myself, basically something like this:
1. store the data on each node 2. arbitrarily pick one node to be the one that does the aggregation 3. do tar | nc style things to get the data onto that node 4. analyze it there and store the results using lava-test-case
but I was wondering if the LAVA team have any advice here. In particular, steps 2. and 3. seem like something it would be reasonable for LAVA to provide helpers to do.
Cheers, mwh
On Wed, 27 Nov 2013 13:56:42 +1300 Michael Hudson-Doyle michael.hudson@linaro.org wrote:
I've been looking at moving my ghetto multinode stuff over to proper LAVA multinode on and off for a while now, and have something that I'm still not sure how best to handle: result aggregation.
MultiNode result bundle aggregation combines completed results after all test cases have run (specifically, during the submit_results action), at which point no further actions will be executed. Aggregation itself happens off device, not even on the dispatcher, it happens on the server. This allows each node to send their result bundle as normal (via the dispatcher over XMLRPC) and it is only the subid-zero job which needs to hang around waiting for other nodes to submit their individual results.
My question is: exactly what analysis are you needing to do *on the device under test* and can that be done via filters and image reports on the server?
If the analysis involves executing binaries compiled on the device, then that would be a reason to copy the binaries between nodes using TCP/IP (or even cache the binaries somewhere and run a second test to do the analysis) but otherwise, it's likely that the server will provide more competent analysis than the device under test. It's a question of getting the output into a suitable format.
Once a MultiNode job is complete, there is a single result bundle which can contain all of the test result data from all of the nodes, including measurements. There is scope for a custom script to optimise the parser to make the data in the result bundle easier to analyse in an image report.
This is the way that MultiNode is designed to work - each test definition massages the test result output into whatever structure is most amenable to being compared and graphed using Image Reports on the server, not on a device under test.
Using the server also means that further data mining is easy by extracting and processing the aggregated result bundle at any time including many months after the original test completed or comparing tests several weeks apart.
The motivating case here is having load generation distributed across various machines: to compute the req/s the server is actually able to manage I want to add up the number of requests each load generator made.
I can sort of see how to do this myself, basically something like this:
- store the data on each node
- arbitrarily pick one node to be the one that does the aggregation
LAVA does this arbitrarily as well - the bundles are aggregated by the job with subid zero, so 1234.0 aggregates for 1234.1 and 1234.2 etc.
- do tar | nc style things to get the data onto that node
- analyze it there and store the results using lava-test-case
Results inside a test case mean that if the analysis needs to be improved, old data cannot be re-processed.
but I was wondering if the LAVA team have any advice here. In particular, steps 2. and 3. seem like something it would be reasonable for LAVA to provide helpers to do.
The LAVA support for this would be to use filters and Image Reports on the server, not during the test when repeating the analysis means repeating the entire test (at which point the data changes under your feet).
Neil Williams codehelp@debian.org writes:
On Wed, 27 Nov 2013 13:56:42 +1300 Michael Hudson-Doyle michael.hudson@linaro.org wrote:
I've been looking at moving my ghetto multinode stuff over to proper LAVA multinode on and off for a while now, and have something that I'm still not sure how best to handle: result aggregation.
MultiNode result bundle aggregation combines completed results after all test cases have run (specifically, during the submit_results action), at which point no further actions will be executed. Aggregation itself happens off device, not even on the dispatcher, it happens on the server. This allows each node to send their result bundle as normal (via the dispatcher over XMLRPC) and it is only the subid-zero job which needs to hang around waiting for other nodes to submit their individual results.
Right. And the "aggregation" that happens at this level is really just that the test runs produced by each node are put in a list? There's no possibility for me to interfere at this stage AIUI (which I think is probably fine and sensible :-p)
My question is: exactly what analysis are you needing to do *on the device under test*
It doesn't have to be on the/a device under test really... but the prototypical example would be the one I gave in my mail, summing the req/s reporting by each loadgen node to arrive at a total req/s for the system as a whole.
and can that be done via filters and image reports on the server?
I don't know. Can filters and image reports sum the measurements across a bunch of separate test cases?
If the analysis involves executing binaries compiled on the device, then that would be a reason to copy the binaries between nodes using TCP/IP (or even cache the binaries somewhere and run a second test to do the analysis) but otherwise, it's likely that the server will provide more competent analysis than the device under test. It's a question of getting the output into a suitable format.
Once a MultiNode job is complete, there is a single result bundle which can contain all of the test result data from all of the nodes, including measurements. There is scope for a custom script to optimise the parser to make the data in the result bundle easier to analyse in an image report.
Yeah, I think this is what I was sort of asking for.
This is the way that MultiNode is designed to work - each test definition massages the test result output into whatever structure is most amenable to being compared and graphed using Image Reports on the server, not on a device under test.
Using the server also means that further data mining is easy by extracting and processing the aggregated result bundle at any time including many months after the original test completed or comparing tests several weeks apart.
Well sure, I think it's a bad idea to throw the information that you are aggregating away. But it's nice to have the aggregate req/s in the measurement field so you can get a quick idea of performance changes.
The motivating case here is having load generation distributed across various machines: to compute the req/s the server is actually able to manage I want to add up the number of requests each load generator made.
I can sort of see how to do this myself, basically something like this:
- store the data on each node
- arbitrarily pick one node to be the one that does the aggregation
LAVA does this arbitrarily as well - the bundles are aggregated by the job with subid zero, so 1234.0 aggregates for 1234.1 and 1234.2 etc.
Is there a way for the node to tell if it is running the job with subid 0?
- do tar | nc style things to get the data onto that node
- analyze it there and store the results using lava-test-case
Results inside a test case mean that if the analysis needs to be improved, old data cannot be re-processed.
Not necessarily -- for my tests I also save the entire httperf output as attachments and have scripts that analyze these to produce fancy graphs as well as putting the aggregate req/s in the measurement field. I guess what this means is that the aggregation is only a convenience really -- but probably a fairly important one.
but I was wondering if the LAVA team have any advice here. In particular, steps 2. and 3. seem like something it would be reasonable for LAVA to provide helpers to do.
The LAVA support for this would be to use filters and Image Reports on the server, not during the test when repeating the analysis means repeating the entire test (at which point the data changes under your feet).
Cheers, mwh
On Thu, 28 Nov 2013 14:35:21 +1300 Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Neil Williams codehelp@debian.org writes:
MultiNode result bundle aggregation combines completed results after all test cases have run (specifically, during the submit_results action), at which point no further actions will be executed. Aggregation itself happens off device, not even on the dispatcher, it happens on the server. This allows each node to send their result bundle as normal (via the dispatcher over XMLRPC) and it is only the subid-zero job which needs to hang around waiting for other nodes to submit their individual results.
Right. And the "aggregation" that happens at this level is really just that the test runs produced by each node are put in a list? There's no possibility for me to interfere at this stage AIUI (which I think is probably fine and sensible :-p)
Yes, processing of the aggregated data can be done via filters and image reports or by downloading the bundle and running custom scripts.
My question is: exactly what analysis are you needing to do *on the device under test*
It doesn't have to be on the/a device under test really... but the prototypical example would be the one I gave in my mail, summing the req/s reporting by each loadgen node to arrive at a total req/s for the system as a whole.
and can that be done via filters and image reports on the server?
I don't know. Can filters and image reports sum the measurements across a bunch of separate test cases?
Stevan? Does Image Reports 2.0 have this support?
If the analysis involves executing binaries compiled on the device, then that would be a reason to copy the binaries between nodes using TCP/IP (or even cache the binaries somewhere and run a second test to do the analysis) but otherwise, it's likely that the server will provide more competent analysis than the device under test. It's a question of getting the output into a suitable format.
Once a MultiNode job is complete, there is a single result bundle which can contain all of the test result data from all of the nodes, including measurements. There is scope for a custom script to optimise the parser to make the data in the result bundle easier to analyse in an image report.
Yeah, I think this is what I was sort of asking for.
:-) By custom script, I was thinking of a script on each node, written by each test writer, which prepares the output of a test routine for easier parsing in filters and image reports. How much work this needs to do depends on how the work on Image Reports 2.0 develops.
This is the way that MultiNode is designed to work - each test definition massages the test result output into whatever structure is most amenable to being compared and graphed using Image Reports on the server, not on a device under test.
Using the server also means that further data mining is easy by extracting and processing the aggregated result bundle at any time including many months after the original test completed or comparing tests several weeks apart.
Well sure, I think it's a bad idea to throw the information that you are aggregating away. But it's nice to have the aggregate req/s in the measurement field so you can get a quick idea of performance changes.
Agreed. As far as measurement changes over time are concerned, that is absolutely the role of filters and image reports.
The motivating case here is having load generation distributed across various machines: to compute the req/s the server is actually able to manage I want to add up the number of requests each load generator made.
So you would have nodes with different roles in the MultiNode job: generators and servers (or generators and surveyors). Wouldn't it also be possible for the surveyor node(s) to record the measurements during the test? This would be much like Antonio's initial suggestion of a "watching" KVM which does live monitoring & collection rather than aggregation after the event.
I can sort of see how to do this myself, basically something like this:
- store the data on each node
- arbitrarily pick one node to be the one that does the
aggregation
LAVA does this arbitrarily as well - the bundles are aggregated by the job with subid zero, so 1234.0 aggregates for 1234.1 and 1234.2 etc.
Is there a way for the node to tell if it is running the job with subid 0?
It shouldn't need to know - whichever node does your load generation calculations will feed it's results into the bundles which will be aggregated by LAVA and the whole set becomes available to filters and image reports.
- do tar | nc style things to get the data onto that node
- analyze it there and store the results using lava-test-case
Results inside a test case mean that if the analysis needs to be improved, old data cannot be re-processed.
Not necessarily -- for my tests I also save the entire httperf output as attachments and have scripts that analyze these to produce fancy graphs as well as putting the aggregate req/s in the measurement field. I guess what this means is that the aggregation is only a convenience really -- but probably a fairly important one.
There are two important advantages to doing this inside Image Reports:
1. Image Reports and are not reliant on TCP/IP connections which some devices simply don't support during test runs.
2. Image Reports can easily work retrospectively across all existing MultiNode and singlenode jobs whereas any change inside single test jobs would not be able to pull data from older tests.
netcat and tar are the quick solution to this specific problem, there remains a wider problem that LAVA should support calculations across test cases in the image reports.
On Wed, Nov 27, 2013 at 01:56:42PM +1300, Michael Hudson-Doyle wrote:
Hi,
I've been looking at moving my ghetto multinode stuff over to proper LAVA multinode on and off for a while now, and have something that I'm still not sure how best to handle: result aggregation.
The motivating case here is having load generation distributed across various machines: to compute the req/s the server is actually able to manage I want to add up the number of requests each load generator made.
I can sort of see how to do this myself, basically something like this:
- store the data on each node
- arbitrarily pick one node to be the one that does the aggregation
- do tar | nc style things to get the data onto that node
- analyze it there and store the results using lava-test-case
but I was wondering if the LAVA team have any advice here. In particular, steps 2. and 3. seem like something it would be reasonable for LAVA to provide helpers to do.
For 2. I would use a specific device (such as a kvm) with a specific role of "data analysis node" and run my analysis code there. I can't see how LAVA could provide something useful for that (besides documenting this "Use a Separate Data Analysis Node" pattern).
For 3. I think it would make sense to have an API call that you could use from your data analysis node that would retrieve a given directory from the other nodes. Something like
lava-collect PATH DEST
Collects the contents of PATH in all other devices that are part of the multinode job and store them at DEST locally. For example, the call `lava-collect /var/lib/foobar /var/tmp` would result in
/var/tmp node01/ var/lib/foobar (stuff) node02/ var/lib/foobar (stuff) (...)
On Wed, 27 Nov 2013 13:01:36 -0300 Antonio Terceiro antonio.terceiro@linaro.org wrote:
For 3. I think it would make sense to have an API call that you could use from your data analysis node that would retrieve a given directory from the other nodes. Something like
lava-collect PATH DEST
Collects the contents of PATH in all other devices that are part of the multinode job and store them at DEST locally. For example, the call `lava-collect /var/lib/foobar /var/tmp` would result in
/var/tmp node01/ var/lib/foobar (stuff) node02/ var/lib/foobar (stuff) (...)
How does the receiving node authenticate with each other node to get read access?
Data cannot go over the existing API connection, it has to be configured separately over something like TCP/IP and root on node01 does not necessarily have access to anything on node02 without node02 being explicitly configured in advance to either serve files anonymously or allow login.
I'm not sure this is appropriate as a helper across all LAVA deployments and if restricted to the same deployments as lava-network, it would require particular services to be installed and running on all nodes which lava-network does not enforce.
On Wed, Nov 27, 2013 at 04:13:14PM +0000, Neil Williams wrote:
On Wed, 27 Nov 2013 13:01:36 -0300 Antonio Terceiro antonio.terceiro@linaro.org wrote:
For 3. I think it would make sense to have an API call that you could use from your data analysis node that would retrieve a given directory from the other nodes. Something like
lava-collect PATH DEST
Collects the contents of PATH in all other devices that are part of the multinode job and store them at DEST locally. For example, the call `lava-collect /var/lib/foobar /var/tmp` would result in
/var/tmp node01/ var/lib/foobar (stuff) node02/ var/lib/foobar (stuff) (...)
How does the receiving node authenticate with each other node to get read access?
I don't think we need to worry that much about security across on disposable test systems.
Data cannot go over the existing API connection, it has to be configured separately over something like TCP/IP and root on node01 does not necessarily have access to anything on node02 without node02 being explicitly configured in advance to either serve files anonymously or allow login.
good point. my idea was that such helper would be using lava-network under the hood, but I neglected the sending side. We probably also need a matching call on the sending side (`lava-serve PATH`)?
so the data analysis node does
lava-sync processing-done lava-collect RESULTS-PATH DEST # calculate ...
and the others
# run ... lava-sync processing-done lava-serve RESULTS-PATH
(sure we could find better names than lava-collect and lava-serve)
I'm not sure this is appropriate as a helper across all LAVA deployments and if restricted to the same deployments as lava-network, it would require particular services to be installed and running on all nodes which lava-network does not enforce.
with the matching call on the sending side, `tar | nc -` ought to be good enough on the sending side and `nc | tar` should do it for the receiving side.
On Wed, 27 Nov 2013 13:29:16 -0300 Antonio Terceiro antonio.terceiro@linaro.org wrote:
On Wed, Nov 27, 2013 at 04:13:14PM +0000, Neil Williams wrote:
Data cannot go over the existing API connection, it has to be configured separately over something like TCP/IP and root on node01 does not necessarily have access to anything on node02 without node02 being explicitly configured in advance to either serve files anonymously or allow login.
good point. my idea was that such helper would be using lava-network under the hood, but I neglected the sending side. We probably also need a matching call on the sending side (`lava-serve PATH`)?
so the data analysis node does
lava-sync processing-done lava-collect RESULTS-PATH DEST # calculate ...
and the others
# run ... lava-sync processing-done lava-serve RESULTS-PATH
hmmm.... an extra sync would be required as the listener on each node must open the port before the transmission can start and it is the listener which is doing the initial tar operation, so that will take time.
(sure we could find better names than lava-collect and lava-serve)
lava-nc-listen - offers files on a known port, opens the port lava-nc-connect - makes connections to each node using the port.
lava-sync processing-done
lava-sync connection-ready lava-nc-connect RESULTS-PATH DEST DEST DEST
# calculate ...
# run ... lava-sync processing-done
lava-nc-listen RESULTS-PATH lava-sync connection-ready
Each call to lava-nc-listen would have to put the call to nc into the background on each node and blindly trust nc to open the port on each node and close when it is done. The only indication that anything happened would then be when the connecting end completes the cycle of connecting to each node.
If tests want error checking, checksums, progress indication, resume support, incremental downloads or any other "typical" support, netcat isn't going to be suitable.
I'm not sure we get much from the helpers that wouldn't be a lot easier to debug by using calls to tar and nc directly. It seems very flaky to me.
SSH or HTTPServer&wget are a lot more work to setup in a test but much more usable. Therefore, I'd be much happier with test writers doing their own connection setup using more reliable tools, rather than possibly giving a false sense of security just because there is a helper available.
After all, HTTPServer&wget is exactly how LAVA sends data between the dispatcher and the board during deployment. It means installing python (or some other http daemon) on each node and wget on the receiver.
I'm not sure this is appropriate as a helper across all LAVA deployments and if restricted to the same deployments as lava-network, it would require particular services to be installed and running on all nodes which lava-network does not enforce.
with the matching call on the sending side, `tar | nc -` ought to be good enough on the sending side and `nc | tar` should do it for the receiving side.
It's still worth investigating whether the whole analysis can be done without needing to transmit data between nodes during the test as this data will be lost at the end of the test and any chance of further analysis of the data is thrown away.
I'm not sure it's helpful to provide tools which encourage test writers to throw data away at the end of a test.
Neil Williams codehelp@debian.org writes:
(sure we could find better names than lava-collect and lava-serve)
lava-nc-listen - offers files on a known port, opens the port lava-nc-connect - makes connections to each node using the port.
lava-sync processing-done
lava-sync connection-ready lava-nc-connect RESULTS-PATH DEST DEST DEST
# calculate ...
# run ... lava-sync processing-done
lava-nc-listen RESULTS-PATH lava-sync connection-ready
Each call to lava-nc-listen would have to put the call to nc into the background on each node and blindly trust nc to open the port on each node and close when it is done. The only indication that anything happened would then be when the connecting end completes the cycle of connecting to each node.
If tests want error checking, checksums, progress indication, resume support, incremental downloads or any other "typical" support, netcat isn't going to be suitable.
I've mostly had good results with netcat, but that's on highbank where the network does tend to work well, mileage on other devices would vary I bet.
I'm not sure we get much from the helpers that wouldn't be a lot easier to debug by using calls to tar and nc directly. It seems very flaky to me.
I think it's more about the test code being able to express its intent clearly.
SSH or HTTPServer&wget are a lot more work to setup in a test but much more usable. Therefore, I'd be much happier with test writers doing their own connection setup using more reliable tools, rather than possibly giving a false sense of security just because there is a helper available.
After all, HTTPServer&wget is exactly how LAVA sends data between the dispatcher and the board during deployment. It means installing python (or some other http daemon) on each node and wget on the receiver.
busybox implements both a simple httpd and wget :-)
Cheers, mwh
Antonio Terceiro antonio.terceiro@linaro.org writes:
On Wed, Nov 27, 2013 at 01:56:42PM +1300, Michael Hudson-Doyle wrote:
Hi,
I've been looking at moving my ghetto multinode stuff over to proper LAVA multinode on and off for a while now, and have something that I'm still not sure how best to handle: result aggregation.
The motivating case here is having load generation distributed across various machines: to compute the req/s the server is actually able to manage I want to add up the number of requests each load generator made.
I can sort of see how to do this myself, basically something like this:
- store the data on each node
- arbitrarily pick one node to be the one that does the aggregation
- do tar | nc style things to get the data onto that node
- analyze it there and store the results using lava-test-case
but I was wondering if the LAVA team have any advice here. In particular, steps 2. and 3. seem like something it would be reasonable for LAVA to provide helpers to do.
For 2. I would use a specific device (such as a kvm) with a specific role of "data analysis node" and run my analysis code there. I can't see how LAVA could provide something useful for that (besides documenting this "Use a Separate Data Analysis Node" pattern).
Yeah, this had occurred to me and makes sense. Especially as an extrapolated version of my request might be to generate graphs out of the data, which would require installing packages such as matplotlib....
For 3. I think it would make sense to have an API call that you could use from your data analysis node that would retrieve a given directory from the other nodes. Something like
lava-collect PATH DEST
Collects the contents of PATH in all other devices that are part of the multinode job and store them at DEST locally. For example, the call `lava-collect /var/lib/foobar /var/tmp` would result in
/var/tmp node01/ var/lib/foobar (stuff) node02/ var/lib/foobar (stuff) (...)
Yeah, that's the sort of thing I was thinking of. I'll have a play at implementing it soon I think, I'll let you know how it goes.
Cheers, mwh
Michael Hudson-Doyle michael.hudson@linaro.org writes:
Antonio Terceiro antonio.terceiro@linaro.org writes:
For 2. I would use a specific device (such as a kvm) with a specific role of "data analysis node" and run my analysis code there. I can't see how LAVA could provide something useful for that (besides documenting this "Use a Separate Data Analysis Node" pattern).
Yeah, this had occurred to me and makes sense. Especially as an extrapolated version of my request might be to generate graphs out of the data, which would require installing packages such as matplotlib....
So I tried this and found something annoying: having this extra device around means that lava-sync is useless in the other bits of the test, because obviously this node that does not do anything until all the other nodes are done does not call lava-sync with the same arguments as the other nodes! I can work around this in a couple of ways for my use case, but it seems a bit of a blocker for a general "Use a Separate Data Analysis Node" pattern.
Cheers, mwh
On Mon, Dec 02, 2013 at 11:53:56AM +1300, Michael Hudson-Doyle wrote:
Michael Hudson-Doyle michael.hudson@linaro.org writes:
Antonio Terceiro antonio.terceiro@linaro.org writes:
For 2. I would use a specific device (such as a kvm) with a specific role of "data analysis node" and run my analysis code there. I can't see how LAVA could provide something useful for that (besides documenting this "Use a Separate Data Analysis Node" pattern).
Yeah, this had occurred to me and makes sense. Especially as an extrapolated version of my request might be to generate graphs out of the data, which would require installing packages such as matplotlib....
So I tried this and found something annoying: having this extra device around means that lava-sync is useless in the other bits of the test, because obviously this node that does not do anything until all the other nodes are done does not call lava-sync with the same arguments as the other nodes! I can work around this in a couple of ways for my use case, but it seems a bit of a blocker for a general "Use a Separate Data Analysis Node" pattern.
That's true! :-/
We probably want to use an existing node for data analysis, then, and we are back to your point of how to select a node to do that.
One guideline would be to use the one node that is different from the others in the test, e.g. in a "1 server to N clients" scenario, it would make sense to do the calculations on the server after the actual load has finished. In less trivial scenarios it is's not so obvious where to do the calculations, though.
Antonio Terceiro antonio.terceiro@linaro.org writes:
On Mon, Dec 02, 2013 at 11:53:56AM +1300, Michael Hudson-Doyle wrote:
Michael Hudson-Doyle michael.hudson@linaro.org writes:
Antonio Terceiro antonio.terceiro@linaro.org writes:
For 2. I would use a specific device (such as a kvm) with a specific role of "data analysis node" and run my analysis code there. I can't see how LAVA could provide something useful for that (besides documenting this "Use a Separate Data Analysis Node" pattern).
Yeah, this had occurred to me and makes sense. Especially as an extrapolated version of my request might be to generate graphs out of the data, which would require installing packages such as matplotlib....
So I tried this and found something annoying: having this extra device around means that lava-sync is useless in the other bits of the test, because obviously this node that does not do anything until all the other nodes are done does not call lava-sync with the same arguments as the other nodes! I can work around this in a couple of ways for my use case, but it seems a bit of a blocker for a general "Use a Separate Data Analysis Node" pattern.
That's true! :-/
We probably want to use an existing node for data analysis, then, and we are back to your point of how to select a node to do that.
One guideline would be to use the one node that is different from the others in the test, e.g. in a "1 server to N clients" scenario, it would make sense to do the calculations on the server after the actual load has finished.
Yeah, that's exactly what I am doing.
In less trivial scenarios it is's not so obvious where to do the calculations, though.
Maybe the first node reported by lava-group | sort ...
Cheers, mwh
linaro-validation@lists.linaro.org