Andy Doan writes:
On 09/27/2012 07:08 AM, Dave Pigott wrote:
I noticed in the reports view, several jobs which have been stuck for a while:
A health check running for 4 days. Nothing in the log. I cancelled it, but it was stuck in cancelling. So I went into admin, put it offline, and then online to run a health check again. The job itself is still showing as not finished. How do I track it down on control so that we can kill it properly?
This looks like a job that failed to start properly. There should be stuff in the scheduler log about this...
A regular job that failed, pushed its result bundle and then never quite stopped running. Same deal as 33203, but I can't get it to run its health check. Any clues?
This is I've gradually been adding more log output to home in on the cause but it's pretty mysterious. For this case though, the clean up is dead easy: find the scheduler monitor process (ps aux | grep origen04) and send SIGINT to it. This seems to poke the monitor into noticing that the dispatcher has exited.
It seems someone has cleaned this one up in a more aggressive manner, so we'll need to fix up the status in the admin panel.
I don't know why the health job isn't being run.
Same as origen04 - can't get health check to run.
Same same.
I don't have the best answer for this, but I'll share what I do.
- run some "ps -ef| grep" type commands to see if a scheduler or
dispatcher process is still running for that board. I then kill those.
Please try to kill them with SIGINT before the more violent signals. It really seems to help for some reason.
- usually the job and board get left a bit out of sync. So I run my
"" script on control:/home/doanac/lava-scripts. It looks like:
import sys import lava_scheduler_app.models as models
for jid in sys.argv[1:]: jid = int(jid) print "canceling: %d" % jid job = models.TestJob.objects.get(pk=jid) job.status = job.CANCELED
You should also set the status of the device the job was running on to IDLE.
I suspect when mwhudson logs in he may have a better answer.
HTH, a bit.
Cheers, mwh