On 09/27/2012 07:08 AM, Dave Pigott wrote:
I noticed in the reports view, several jobs which have been stuck for a while:
http://validation.linaro.org/lava-server/scheduler/job/33203
origen02
A health check running for 4 days. Nothing in the log. I cancelled it, but it was stuck in cancelling. So I went into admin, put it offline, and then online to run a health check again. The job itself is still showing as not finished. How do I track it down on control so that we can kill it properly?
http://validation.linaro.org/lava-server/scheduler/job/33382
origen04
A regular job that failed, pushed its result bundle and then never quite stopped running. Same deal as 33203, but I can't get it to run its health check. Any clues?
http://validation.linaro.org/lava-server/scheduler/job/33372
panda09
Same as origen04 - can't get health check to run.
I don't have the best answer for this, but I'll share what I do.
1) run some "ps -ef| grep" type commands to see if a scheduler or dispatcher process is still running for that board. I then kill those.
2) usually the job and board get left a bit out of sync. So I run my "cancel-job.py" script on control:/home/doanac/lava-scripts. It looks like:
#!/srv/lava/instances/production/bin/py
import sys import lava_scheduler_app.models as models
for jid in sys.argv[1:]: jid = int(jid) print "canceling: %d" % jid job = models.TestJob.objects.get(pk=jid) job.status = job.CANCELED job.save()
I suspect when mwhudson logs in he may have a better answer.