Hi Mirsad, I'm looking at the recent edits to https://wiki.linaro.org/Platform/Validation/Specs/ValidationScheduler and wanted to start a thread to discuss. Would love to hear thoughts from others as well.
We could probably use some more in the way of implementation details, but this is starting to take shape pretty well, good work. I have a few comments below:
Admin users can also cancel any scheduled jobs.
Job submitters should be allowed to cancel their own jobs too, right?
I think in general, the user stories need tweaking. Many of them center around automatic scheduling of jobs based on some event (adding a machine, adding a test, etc). Based on the updated design, this kind of logic would be in the piece we were referring to as the driver. The scheduler shouldn't be making those decisions on its own, but it should provide an interface for both humans to schedule jobs (web, cli) as well as and api for machines (driver) to do this.
should we avoid scheduling image tests twice because a hwpack is coming in
after images or vv. Is this a question? Again, I don't think that's the scheduler's call. The scheduler isn't deciding what tests to run, and what to run them on. In this case, assuming we have the resources to pull it off, running the new image with the old, and the new hwpack would be good to do.
Test job definition
Is this different from the job definition used by the dipatcher? Please tell me if I'm missing something here, but I think to schedule something, you only really need two blobs of information: 1a. specific host to run on -OR- 1b. (any/every system matching given criteria) This one is tricky, and though it sounds really useful, my personally feeling is that it is of questionable value. In theory, it lets you make more efficient use of your hardware when you have multiple identical machines. In practice, what I've seen on similar systems is that humans typically know exactly which machine they want to run something on. Where it might really come in to play is later when we have a driver automatically scheduling jobs for us. 2. job file - this is the piece that the job dispatcher consumes. It could be handwritten, machine generated, or created based on a web form where the user selects what they want.
Test job status
One distinction I want to make here is job status vs. test result. A failed test can certainly have a "complete" job status. Incomplete, as a job status, just means that the dispatcher was unable to finsish all the steps in the job. For instance, a better example would be if we had a test that required an image to be deployed, booted, and a test run on it. If we tried to deploy the image and hit a kernel panic on reboot, that is an incomplete job because it never made it far enough to run the specified test.
Link to test results in launch-control
If we tie this closely enough with launch-control, it seems we could just communicate the job id to the dispatcher so that it gets rolled up with the bundle. That way the dashboard would have a backlink to the job, and could create the link to the bundle once it is deserialized. Just a different option if it's easier. I don't see an obvious advantage to either approach.
Thanks, Paul Larson
On 4 February 2011 21:53, Paul Larson paul.larson@linaro.org wrote:
Hi Mirsad, I'm looking at the recent edits to https://wiki.linaro.org/Platform/Validation/Specs/ValidationScheduler and wanted to start a thread to discuss. Would love to hear thoughts from others as well.
We could probably use some more in the way of implementation details, but this is starting to take shape pretty well, good work. I have a few comments below:
Admin users can also cancel any scheduled jobs.
Job submitters should be allowed to cancel their own jobs too, right?
I think in general, the user stories need tweaking. Many of them center around automatic scheduling of jobs based on some event (adding a machine, adding a test, etc). Based on the updated design, this kind of logic would be in the piece we were referring to as the driver. The scheduler shouldn't be making those decisions on its own, but it should provide an interface for both humans to schedule jobs (web, cli) as well as and api for machines (driver) to do this.
I'd like to add as user stories: Dave wants to rerun a test on a particular machine to see if a failure is machine specific. Dave wants to run the same test on a set of machines to compare the results.
I'd also like for there to be history available for each machine stuff has run on; e.g. knowing that a machine has just been reinstalled or been updated might help you understand a failure.
Dave
On 7 February 2011 02:05, David Gilbert david.gilbert@linaro.org wrote:
On 4 February 2011 21:53, Paul Larson paul.larson@linaro.org wrote:
Hi Mirsad, I'm looking at the recent edits to https://wiki.linaro.org/Platform/Validation/Specs/ValidationSchedulerand wanted to start a thread to discuss. Would love to hear thoughts from others as well.
We could probably use some more in the way of implementation details, but this is starting to take shape pretty well, good work. I have a few comments below:
Admin users can also cancel any scheduled jobs.
Job submitters should be allowed to cancel their own jobs too, right?
I think in general, the user stories need tweaking. Many of them center around automatic scheduling of jobs based on some event (adding a
machine,
adding a test, etc). Based on the updated design, this kind of logic
would
be in the piece we were referring to as the driver. The scheduler
shouldn't
be making those decisions on its own, but it should provide an interface
for
both humans to schedule jobs (web, cli) as well as and api for machines (driver) to do this.
I'd like to add as user stories: Dave wants to rerun a test on a particular machine to see if a failure is machine specific.
An initial idea we had was to run jobs based on machine type, i.e. BeagleBoard, not on a particular machine, i.e. BeagleBoard_ID001. The dispatcher would choose on which particular machine to run, depending on availability. I understand your point when running on a particular machine is desirable, but maybe this feature should be enabled for admins trying to track a deviating hardware? Or maybe this is a user story for dashboard, to have a feature comparing and presenting results from all machines of the same type, or even in broader aspect for chosen/all machine types we support?
Dave wants to run the same test on a set of machines to compare the
results.
This is almost same as first. Maybe the better solution, as I wrote above, is to go to dashboard and compare all the existing results there instead? This assumes of course that there are results already reported for wanted hardware, which I think would be a case if looking at weekly execution intervals, but probably not daily. What do you think, is this reasonable enough or am I missing something important?
I'd also like for there to be history available for each machine stuff has run on; e.g. knowing that a machine has just been reinstalled or been updated might help you understand a failure.
Exactly, I agree. I think this will be solved by the dispatcher when reporting test results to the dashboard. The results in the dashboard should include that information, and even keep history, so I guess it is only to present the information in the desirable format.
Dave
Thanks for your comments Dave!
On 10 February 2011 12:19, Mirsad Vojnikovic mirsad.vojnikovic@linaro.org wrote: <snip
That I wrote:
I'd like to add as user stories: Dave wants to rerun a test on a particular machine to see if a failure is machine specific.
An initial idea we had was to run jobs based on machine type, i.e. BeagleBoard, not on a particular machine, i.e. BeagleBoard_ID001. The dispatcher would choose on which particular machine to run, depending on availability. I understand your point when running on a particular machine is desirable, but maybe this feature should be enabled for admins trying to track a deviating hardware? Or maybe this is a user story for dashboard, to have a feature comparing and presenting results from all machines of the same type, or even in broader aspect for chosen/all machine types we support?
I'm talking here of the case where the user has run a set of tests and one is showing up as bad and they are trying to work out why; lets say they run the test again and it works on a different machine; they might reasonably want to see if the original machine fails. Then the second subcase is that we've identified that a particular machine always fails a particular test but no one can explain why; you've been given the job of debugging the test and figuring out why it always fails on that machine. This might not be a hardware/admin issue - it might be something really subtle.
Dave wants to run the same test on a set of machines to compare the results.
This is almost same as first. Maybe the better solution, as I wrote above, is to go to dashboard and compare all the existing results there instead? This assumes of course that there are results already reported for wanted hardware, which I think would be a case if looking at weekly execution intervals, but probably not daily. What do you think, is this reasonable enough or am I missing something important?
OK, there were a few cases I was thinking here: 1) A batch of new machines arrives in the data centre; they are apparently identical - you want to run a benchmark on them all and make sure the variance between them is within the expected range. 2) Some upgrade has happened to a set of machines (e.g. new kernel/new linaro release) rolled out to them all - do they still all behave as expected? 3) You've got a test, it's results seem to vary wildly from run to run - is it consistent across machines in the farm?
Note these set of requirements come from using a similar testing farm.
Dave
On 10 February 2011 04:30, David Gilbert david.gilbert@linaro.org wrote:
On 10 February 2011 12:19, Mirsad Vojnikovic mirsad.vojnikovic@linaro.org wrote: <snip
That I wrote:
I'd like to add as user stories: Dave wants to rerun a test on a particular machine to see if a failure is machine specific.
An initial idea we had was to run jobs based on machine type, i.e. BeagleBoard, not on a particular machine, i.e. BeagleBoard_ID001. The dispatcher would choose on which particular machine to run, depending on availability. I understand your point when running on a particular
machine
is desirable, but maybe this feature should be enabled for admins trying
to
track a deviating hardware? Or maybe this is a user story for dashboard,
to
have a feature comparing and presenting results from all machines of the same type, or even in broader aspect for chosen/all machine types we support?
I'm talking here of the case where the user has run a set of tests and one is showing up as bad and they are trying to work out why; lets say they run the test again and it works on a different machine; they might reasonably want to see if the original machine fails. Then the second subcase is that we've identified that a particular machine always fails a particular test but no one can explain why; you've been given the job of debugging the test and figuring out why it always fails on that machine. This might not be a hardware/admin issue - it might be something really subtle.
I understand what you aim at. The question is then to allow or not allow users to submit jobs to particular machine(s). I have no particular problem with allowing it, we can include it in our solution. We can have both choices: run on particular machine(s) or let the system choose one or more from given machine type(s). Anyone else, comments on this?
Dave wants to run the same test on a set of machines to compare the results.
This is almost same as first. Maybe the better solution, as I wrote
above,
is to go to dashboard and compare all the existing results there instead? This assumes of course that there are results already reported for wanted hardware, which I think would be a case if looking at weekly execution intervals, but probably not daily. What do you think, is this reasonable enough or am I missing something important?
OK, there were a few cases I was thinking here:
- A batch of new machines arrives in the data centre; they are
apparently identical - you want to run a benchmark on them all and make sure the variance between them is within the expected range. 2) Some upgrade has happened to a set of machines (e.g. new kernel/new linaro release) rolled out to them all - do they still all behave as expected? 3) You've got a test, it's results seem to vary wildly from run to run - is it consistent across machines in the farm?
OK, I understand better now. For me this is still at test result level, i.e. dashboard (launch-control) should produce such kind of reports. Cannot see where this fits on scheduler level? When we give the possibility to run jobs on specific boards, it should be easy to retrieve all needed test reports from the dashboard.
Note these set of requirements come from using a similar testing farm.
Dave
On 10 February 2011 13:14, Mirsad Vojnikovic mirsad.vojnikovic@linaro.org wrote:
On 10 February 2011 04:30, David Gilbert david.gilbert@linaro.org wrote:
OK, there were a few cases I was thinking here: 1) A batch of new machines arrives in the data centre; they are apparently identical - you want to run a benchmark on them all and make sure the variance between them is within the expected range. 2) Some upgrade has happened to a set of machines (e.g. new kernel/new linaro release) rolled out to them all - do they still all behave as expected? 3) You've got a test, it's results seem to vary wildly from run to run - is it consistent across machines in the farm?
OK, I understand better now. For me this is still at test result level, i.e. dashboard (launch-control) should produce such kind of reports. Cannot see where this fits on scheduler level? When we give the possibility to run jobs on specific boards, it should be easy to retrieve all needed test reports from the dashboard.
My only reason for thinking it was a scheduling issue is that you need a way to cause the same test to happen on all the machines in a group; not necessarily at the same time.
Dave
Note these set of requirements come from using a similar testing farm.
Dave
On 10 February 2011 05:41, David Gilbert david.gilbert@linaro.org wrote:
On 10 February 2011 13:14, Mirsad Vojnikovic mirsad.vojnikovic@linaro.org wrote:
On 10 February 2011 04:30, David Gilbert david.gilbert@linaro.org
wrote:
OK, there were a few cases I was thinking here:
- A batch of new machines arrives in the data centre; they are
apparently identical - you want to run a benchmark on them all and make sure the variance between them is within the expected range. 2) Some upgrade has happened to a set of machines (e.g. new kernel/new linaro release) rolled out to them all - do they still all behave as expected? 3) You've got a test, it's results seem to vary wildly from run to run
is it consistent across machines in the farm?
OK, I understand better now. For me this is still at test result level,
i.e.
dashboard (launch-control) should produce such kind of reports. Cannot
see
where this fits on scheduler level? When we give the possibility to run
jobs
on specific boards, it should be easy to retrieve all needed test reports from the dashboard.
My only reason for thinking it was a scheduling issue is that you need a way to cause the same test to happen on all the machines in a group; not necessarily at the same time.
Ah, you are correct, that's excellent - a user story in the scheduler should then look something like this: "Dave wants to rerun a previous test job from test job history". Comments?
Dave
Note these set of requirements come from using a similar testing farm.
Dave
Ah, you are correct, that's excellent - a user story in the scheduler should then look something like this: "Dave wants to rerun a previous test job from test job history". Comments?
This would basically be a "resubmit job" feature, that would replicate the
same parameters as the original job.
As for some of the other things mentioned here, yes it has always been our intent that you should be able to request a job to run on a specific machine. This is, in fact, the simplest thing to do. As a step up from that, we should provide a way to say "Run this on any single panda board", and the scheduler should be able to pick one from the list that has the least (hopefully 0) jobs queued.
Thanks, Paul Larson
I have now updated the spec with your comments, thanks!
https://wiki.linaro.org/Platform/Validation/Specs/ValidationScheduler
On 11 February 2011 06:52, Paul Larson paul.larson@linaro.org wrote:
Ah, you are correct, that's excellent - a user story in the scheduler should then look something like this: "Dave wants to rerun a previous test job from test job history". Comments?
This would basically be a "resubmit job" feature, that would replicate the
same parameters as the original job.
As for some of the other things mentioned here, yes it has always been our intent that you should be able to request a job to run on a specific machine. This is, in fact, the simplest thing to do. As a step up from that, we should provide a way to say "Run this on any single panda board", and the scheduler should be able to pick one from the list that has the least (hopefully 0) jobs queued.
Thanks, Paul Larson
On 4 February 2011 13:53, Paul Larson paul.larson@linaro.org wrote:
Hi Mirsad, I'm looking at the recent edits to https://wiki.linaro.org/Platform/Validation/Specs/ValidationScheduler and wanted to start a thread to discuss. Would love to hear thoughts from others as well.
We could probably use some more in the way of implementation details, but this is starting to take shape pretty well, good work. I have a few comments below:
Admin users can also cancel any scheduled jobs.
Job submitters should be allowed to cancel their own jobs too, right?
Correct, that's described for 'normal users': "Normal users will be able to define a test job, submit it for execution, cancel an ongoing job...". I will clarify this more explicitly in the spec.
I think in general, the user stories need tweaking. Many of them center
around automatic scheduling of jobs based on some event (adding a machine, adding a test, etc). Based on the updated design, this kind of logic would be in the piece we were referring to as the driver. The scheduler shouldn't be making those decisions on its own, but it should provide an interface for both humans to schedule jobs (web, cli) as well as and api for machines (driver) to do this.
Agree, we had some discussion about the driver part which didn't end with any specific conclusion, so I just kept the driver user stories in the scheduler. I will remove the driver specific user stories and develop in more detail the scheduler API definition and usage.
should we avoid scheduling image tests twice because a hwpack is coming
in after images or vv. Is this a question? Again, I don't think that's the scheduler's call. The scheduler isn't deciding what tests to run, and what to run them on. In this case, assuming we have the resources to pull it off, running the new image with the old, and the new hwpack would be good to do.
Agree, will remove this.
Test job definition
Is this different from the job definition used by the dipatcher? Please tell me if I'm missing something here, but I think to schedule something, you only really need two blobs of information: 1a. specific host to run on -OR- 1b. (any/every system matching given criteria) This one is tricky, and though it sounds really useful, my personally feeling is that it is of questionable value. In theory, it lets you make more efficient use of your hardware when you have multiple identical machines. In practice, what I've seen on similar systems is that humans typically know exactly which machine they want to run something on. Where it might really come in to play is later when we have a driver automatically scheduling jobs for us. 2. job file - this is the piece that the job dispatcher consumes. It could be handwritten, machine generated, or created based on a web form where the user selects what they want.
This is the same test job definition that the dispatcher will use. The idea here is that the end-users define a test job, which is then pushed to the dispatcher in some way. The scheduler will provide the web form you mention under 2. Is it something I'm missing here?
Test job status
One distinction I want to make here is job status vs. test result. A failed test can certainly have a "complete" job status. Incomplete, as a job status, just means that the dispatcher was unable to finsish all the steps in the job. For instance, a better example would be if we had a test that required an image to be deployed, booted, and a test run on it. If we tried to deploy the image and hit a kernel panic on reboot, that is an incomplete job because it never made it far enough to run the specified test.
Exactly, that was my idea in the beginning. I will try to expand the spec about this issue. One question here is if we want to collect logs from the failed test job and make them visible from the scheduler? Guess we don't want these logs/results pushed to dashboard.
Link to test results in launch-control
If we tie this closely enough with launch-control, it seems we could just communicate the job id to the dispatcher so that it gets rolled up with the bundle. That way the dashboard would have a backlink to the job, and could create the link to the bundle once it is deserialized. Just a different option if it's easier. I don't see an obvious advantage to either approach.
I like the backlink idea and solution, will put it in the spec (dashboard pointing to test job in scheduler). And test job ID is maybe all that scheduler needs to produce a link to test results in the dashboard. Zygmunt, do you have any comments on this?
Thanks, Paul Larson
Thanks for the comments, Paul!