On Mon, May 19, 2014 at 07:47:09PM +0100, Milosz Wasilewski wrote:
Hi,
I'm trying to submit job for TC2 now and I'm in the long queue. There seem to be a few multinode Android jobs that run on dummy-ssh and vexpress-tc2 (workload automation). We only have one dummy-ssh device so there is no way that more than one TC2 is going to be used with dummy-ssh at the same time. On top of that we have vexpress-tc2-benchmark which also can run multinode jobs with dummy-ssh. For some reason if there are couple of multinode jobs requested for dummy-ssh + vexpress-tc2, the TC2 boards get reserved and there is no way to submit any other jobs there. While I understand that 1 board might be in reserved state, there is no point to reserve all 3 (there is only one dummy-ssh). IMHO this is a bug in multinode.
This is a known issue. The only way we found of not letting multinode jobs starve waiting for devices forever is to reserve their devices as they become available instead of waiting for a moment when all of their requested devices would be available simultaneously.
We did not figure out a way of not letting multinode jobs deadlock that wouldn't involve a far more complicated mechanism.
Current status is:
dummy-ssh: 7 jobs in the queue vexpress-tc2: 3 reserved + 3 jobs in the queue
I know that proper solution should be moving with WA to dynamically allocated VMs, but unfortunately licensing is in the way.
Actually I am working right now on a patch to allow multiple dummy-ssh devices on the same host, which might solve this specific problem (assuming WA licensing allow multiple simultaneous uses withing the same host).