I noticed in the reports view, several jobs which have been stuck for a while:
http://validation.linaro.org/lava-server/scheduler/job/33203 ------------------------------------------------------------------------------- origen02 ------------ A health check running for 4 days. Nothing in the log. I cancelled it, but it was stuck in cancelling. So I went into admin, put it offline, and then online to run a health check again. The job itself is still showing as not finished. How do I track it down on control so that we can kill it properly?
http://validation.linaro.org/lava-server/scheduler/job/33382 ------------------------------------------------------------------------------- origen04 ------------ A regular job that failed, pushed its result bundle and then never quite stopped running. Same deal as 33203, but I can't get it to run its health check. Any clues?
http://validation.linaro.org/lava-server/scheduler/job/33372 ------------------------------------------------------------------------------- panda09 ------------ Same as origen04 - can't get health check to run.
Dave
Dave Pigott Validation Engineer T: +44 1223 40 00 63 | M +44 7940 45 93 44 Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog
On 09/27/2012 07:08 AM, Dave Pigott wrote:
I noticed in the reports view, several jobs which have been stuck for a while:
http://validation.linaro.org/lava-server/scheduler/job/33203
origen02
A health check running for 4 days. Nothing in the log. I cancelled it, but it was stuck in cancelling. So I went into admin, put it offline, and then online to run a health check again. The job itself is still showing as not finished. How do I track it down on control so that we can kill it properly?
http://validation.linaro.org/lava-server/scheduler/job/33382
origen04
A regular job that failed, pushed its result bundle and then never quite stopped running. Same deal as 33203, but I can't get it to run its health check. Any clues?
http://validation.linaro.org/lava-server/scheduler/job/33372
panda09
Same as origen04 - can't get health check to run.
I don't have the best answer for this, but I'll share what I do.
1) run some "ps -ef| grep" type commands to see if a scheduler or dispatcher process is still running for that board. I then kill those.
2) usually the job and board get left a bit out of sync. So I run my "cancel-job.py" script on control:/home/doanac/lava-scripts. It looks like:
#!/srv/lava/instances/production/bin/py
import sys import lava_scheduler_app.models as models
for jid in sys.argv[1:]: jid = int(jid) print "canceling: %d" % jid job = models.TestJob.objects.get(pk=jid) job.status = job.CANCELED job.save()
I suspect when mwhudson logs in he may have a better answer.
Andy Doan andy.doan@linaro.org writes:
On 09/27/2012 07:08 AM, Dave Pigott wrote:
I noticed in the reports view, several jobs which have been stuck for a while:
http://validation.linaro.org/lava-server/scheduler/job/33203
origen02
A health check running for 4 days. Nothing in the log. I cancelled it, but it was stuck in cancelling. So I went into admin, put it offline, and then online to run a health check again. The job itself is still showing as not finished. How do I track it down on control so that we can kill it properly?
This looks like a job that failed to start properly. There should be stuff in the scheduler log about this...
http://validation.linaro.org/lava-server/scheduler/job/33382
origen04
A regular job that failed, pushed its result bundle and then never quite stopped running. Same deal as 33203, but I can't get it to run its health check. Any clues?
This is https://bugs.launchpad.net/lava-scheduler/+bug/1043059. I've gradually been adding more log output to home in on the cause but it's pretty mysterious. For this case though, the clean up is dead easy: find the scheduler monitor process (ps aux | grep origen04) and send SIGINT to it. This seems to poke the monitor into noticing that the dispatcher has exited.
It seems someone has cleaned this one up in a more aggressive manner, so we'll need to fix up the status in the admin panel.
I don't know why the health job isn't being run.
http://validation.linaro.org/lava-server/scheduler/job/33372
panda09
Same as origen04 - can't get health check to run.
Same same.
I don't have the best answer for this, but I'll share what I do.
- run some "ps -ef| grep" type commands to see if a scheduler or
dispatcher process is still running for that board. I then kill those.
Please try to kill them with SIGINT before the more violent signals. It really seems to help for some reason.
- usually the job and board get left a bit out of sync. So I run my
"cancel-job.py" script on control:/home/doanac/lava-scripts. It looks like:
#!/srv/lava/instances/production/bin/py
import sys import lava_scheduler_app.models as models
for jid in sys.argv[1:]: jid = int(jid) print "canceling: %d" % jid job = models.TestJob.objects.get(pk=jid) job.status = job.CANCELED job.save()
You should also set the status of the device the job was running on to IDLE.
I suspect when mwhudson logs in he may have a better answer.
HTH, a bit.
Cheers, mwh
linaro-validation@lists.linaro.org