Re: [Lava-users] Deadlock in scheduler

23 Apr 2018


      On 23 April 2018 at 11:21, Quentin Schulz quentin.schulz@bootlin.com
wrote:
...
Hi Neil,
Thanks for your prompt answer.
On Fri, Apr 20, 2018 at 07:56:29AM +0100, Neil Williams wrote:
...
On 19 April 2018 at 20:11, Quentin Schulz quentin.schulz@bootlin.com
wrote:
...
Hi all,
I've encountered a deadlock in my LAVA server with the following
scheme.
...
...
I have an at91rm9200ek in my lab that got submitted a lot of multi-node
...
jobs requesting an other "board" (a laptop of type dummy-ssh).
All of my other boards in the lab have received the same multi-node
jobs
...
...
requesting the same and only laptop.
That is the source of the resource starvation - multiple requirements of
a
...
single device. The scheduler needs to be greedy and grab whatever
suitable
...
devices it can as soon as it can to be able to run MultiNode. The primary
ordering of scheduling is the Test Job ID which is determined at
submission.
...
Why would you order test jobs without knowing if the boards it depends
on are available when it's going to be scheduled? What am I missing?
To avoid the situation where a MultiNode job is constantly waiting for all
devices to be available at exactly the same time. Instances frequently have
long queues of submitted test jobs, a mix of single node and MultiNode. The
MultiNode jobs must be able to grab whatever device is available, in order
of submit time, and then wait for the other part to be available.
Otherwise, all devices would run all single node test jobs in the entire
queue before any MultiNode test jobs could start. Many instances constantly
have a queue of single node test jobs.
...
...
If you have an imbalance between the number of machines which can be
available and then submit MultiNode jobs which all rely on the starved
resource, there is not much LAVA can do currently. We are looking at a
way
...
to reschedule MultiNode test jobs but it is very complex and low
priority.
...
What version of lava-server and lava-dispatcher are you running?
lava-dispatcher                   2018.2.post3-1+jessie
lava-server                       2018.2-1+jessie
lava-coordinator                 0.1.7-1
(You need to upgrade to Stretch - there will be no fixes or upgrades
available for Jessie. All development work must only happen on Stretch. See
the lava-announce mailing list archive.)
...
...
What is the structure of your current lab?
MultiNode is complex - not just at the test job synchronization level but
also at the lab structure / administrative level.
I have two machines. One acting as LAVA server and one as LAVA slave.
The LAVA slave is handling all boards in our lab.
I have one laptop (an actual x86 laptop for which we know the NIC driver
works reliably at high (~1Gbps) speeds) that we use for MultiNode jobs
(actually requesting the laptop and one board at the same time only) to
test network. This laptop is seen as a board by LAVA, there is nothing
LAVA-related on this board (it should be seen as a device).
Then you need more LAVA devices to replicate the role played by the laptop.
Exactly one device for each MultiNode test job which can be submitted at
any one time. Then use device tags to allocate one of the "laptop" devices
to each of the other boards involved in the MultiNode test jobs.
Alternatively, you need to manage both the submissions and the device
availability.
Think of just the roles played by the devices.
There are N client role devices (not in Retired state) and there are X
server role devices where the server role is what the laptop is currently
doing.
You need to have N == X to solve the imbalance in the queue.
If N > 1 (and there are more than one device-type in the count 'N') then
you also need to use device tags so that each device-type has a dedicated
pool of server role devices where the number of devices in the server role
pool exactly matches the number of devices of the device-type using the
specified device tag.
...
...
...
I had to take the at91rm9200ek out of the lab because it was behaving.
...
However, LAVA is still scheduling multi-node jobs on the laptop which
requires the at91rm9200ek as the other part of the job, while its
status
...
...
is clearly Maintenance.
A device in Maintenance is still available for scheduling - only Retired
is
...
excluded - test jobs submitted to a Retired device are rejected.
Why is that? The device is explicitely in Maintenance, which IMHO tells
that the board shouldn't be used.
Not for the scheduler - the scheduler can still accept submissions and
queue them up until the device comes out of Maintenance.
This prevents test jobs being rejected during certain kinds of maintenance.
(Wider scale maintenance would involve taking down the UI on the master at
which point submissions would get a 404 but that is up to admins to
schedule and announce etc.)
This is about uptime for busy instances which frequently get batches of
submissions out of operations like cron. The available devices quickly get
swamped but the queue needs to continue accepting jobs until admins decide
that devices which need work are going to be unavailable for long enough
that the resulting queue would be impractical. i.e. when the length of time
to wait for the job exceeds the useful window of the results from the job
or when the number of test jobs in the queue exceeds the ability of the
available devices to keep on top of the queue and avoid ever increasing
queues.
...
...
Once a test job has been submitted, it will be either scheduled or
cancelled.
Yes, that's understood and that makes sense to me. However, for "normal"
jobs, if you can't find a board of device type X that is available, it
does not get scheduled, right? Why can't we do the same for MultiNode
jobs?
Because the MultiNode job will never find all devices in the right state at
the same time once there is a mixed queue of single node and MultiNode jobs.
All devices defined in the MultiNode test job must be available at exactly
the same time. Once there are single node jobs in the queue, that never
happens.
A is running
B is Idle
MultiNode submitted for A & B
single node submitted for A
single node submitted for B
scheduler considers the queue - MultiNode cannot start (A is busy), so move
on and start the single node job on B (because the single node test job on
B may actually complete before the job on A finishes, so it is inefficient
to keep B idle when it could be doing useful stuff for another user).
A is running
B is running
no actions
A completes and goes to Idle
B is still running
and so the pattern continues for as long as there are any single node test
jobs for either A or B in the queue.
The MultiNode test job never starts because A and B are never Idle at the
same time until the queue is completely empty (which *never* happens in
many instances).
So the scheduler must grab B while it is Idle to prevent the single node
test job starting. Then when A completes, the scheduler must also grab A
before that single node test job starts running.
A is running
B is Idle
MultiNode submitted for A & B
single node submitted for A
single node submitted for B
B is transitioned to Scheduling and is unavailable for the single node test
job.
A is running
B is scheduling
no actions
A completes and goes to Idle
B is scheduling
Scheduler transitions A into scheduling - that test job can now start.
(Now consider MultiNode test jobs covering a dozen devices in an instance
with a hundred mixed devices and permanent queues of single node test jobs.)
The scheduler also needs to be very fast, so the actual decisions need to
be made on quite simple criteria - specifically, without going back to the
database to find out about what else might be in the queue or trying to
second-guess when test jobs might end.
...
...
Now, until I put the at91rm9200ek back in the lab, all my boards are
...
reserved and scheduling for a multi-node job and thus, my lab is
basically dead.
The correct fix here is to have enough devices of the device-type of the
starved resource such that one of each other device-type can use that
resource simultaneously and then use device-tags to match up groups of
devices so that submitting lots of jobs for one type all at the same time
does not simply consume all of the available resources.
e.g. four device-types - phone, hikey, qemu and panda. Each multinode job
wants a single QEMU with each of the others, so the QEMU type becomes
starved, depending on how jobs are submitted. If two hikey-qemu jobs are
submitted together, then 1 QEMU gets scheduled, waiting for the hikey to
become free after running the first job. If each QEMU has device-tags,
then
...
the second hikey-qemu job will wait not only for the hikey but will also
wait for the one QEMU which has the hikey device tag. This way, only
those
...
jobs would then wait for a QEMU device. There would be three QEMU
devices,
...
one with a device tag like "phone", one with "hikey" and one with
"panda".
...
If another panda device is added, another QEMU with the "panda" device
tag
...
would be required. The number of QEMU devices required is the sum of the
number of devices of each other device-type which may be required in a
MultiNode test job.
This is a structural problem within your lab.
You would need one "laptop" for each other device-type which can use that
device-type in your lab. Then each "laptop" gets unique a device-tag .
Each
...
test job for at91rm9200ek must specify that the "laptop" device must have
the matching device tag. Each test job for each other device-type uses
the
...
matching device-tag for that device-type. We had this problem in the
Harston lab for a long time when using V1 and had to implement just such
a
...
structure of matched devices and device tags. However, the need for this
disappeared when the Harston lab transitioned all devices and test jobs
to
...
LAVA V2.
I strongly disagree with your statement. A software problem can often be
dealt with by adding more resources but I'm not willing to spend
thousands on something that can be fixed on the software side.
We've been through these loops within the team for many years and have
millions of test jobs which demonstrate the problems and the fix. I'm
afraid you are misunderstanding the problem if you think that there is a
software solution for a queue containing both MultiNode and single node
test jobs - other than the solution we now use in the LAVA scheduler. The
process has been tried and tested over 8 years and millions of test jobs
across dozens of mixed use case instances and has proven to be the most
efficient use of resources across all those models.
Each test job in a MultiNode test is considered separately - if one or more
devices are Idle, then those are immediately put into Scheduling. Only when
all are in Scheduling can any of those jobs start. The status of other test
jobs in the MultiNode group can only be handled at the point when at least
one test job in that MultiNode group is in Scheduling.
...
Aside from a non-negligeable financial and time (to setup and maintain)
effort to buy a board with a stable and reliable NIC for each and every
board in my lab, it just isn't our use case.
If I would do such a thing, then my network would be the bottleneck to
my network tests and I'd have to spend a lot (financially and on time or
maintenance) to have a top notch network infrastructure for tests I
don't care if they run one after the other. I can't have a separate
network for each and every board as well, simply because my boards often
have a single Ethernet port, thus I can't separate the test network from
the lab network for, e.g. images downloading that are part of the
booting process, hence I can't do reliable network testing even by
multiplying "laptop" devices.
I can understand it's not your typical use case at Linaro and you've
dozens and dozens of the same board and a huge infrastructure to handle
the whole LAVA lab and maybe people working full-time on LAVA, the lab,
the boards, the infrastructure. But that's the complete opposite of our
use case.
Maybe you can understand ours where we have only one board of each
device type, being part of KernelCI to test and report kernel booting
status and having occasional custom tests (like network) on upstream or
custom branches/repositories. We sporadically work on the lab, fixing
the issues we're seeing with the boards but that's not what we do for a
living.
I do understand and I personally run a lab in much the same way. However,
the code needs to work the same way in that lab as it does in the larger
labs. It is the local configuration and resource availability which must
change to suit.
For now, the best thing is to put devices into Retired so that submissions
are rejected and then you will also have to manage your submissions and
your queue.
We're looking at what the Maintenance state means for MultiNode in
https://projects.linaro.org/browse/LAVA-1299 but it is not acceptable to
refuse submissions when devices are not Retired. Users have an expectation
that devices which are being fixed will come back online at some point - or
will go into retired. There is also
https://projects.linaro.org/browse/LAVA-595 but that work has not yet been
scoped. It could be a long time before that work starts and will take
months of work once it does start.
The problem is a structural one in the physical resources available in your
local lab. It is a problem we have faced more than once in our own
instances and we have gone down all the various routes until we've come to
the current implementation.
...
We also work actively on the kernel and thus, we take boards (which we
own only once) out of the lab to work on it and then put it into the
lab once we've finished working. This is where we put it in Maintenance
mode as, IMHO, Retired does not cover this use case.
This "Maintenance" can take seconds, days or months.
For me, you're ignoring an issue that is almost inexistent in your case
It is an issue which has had months of investigation, discussion and
intervention in our use cases. We have spent a very long time going through
all of the permutations.
...
because you've dealt with it by adding as much resource as you could to
make the probability to happen to be close to zero. That does not mean
it does not exist. I'm not criticizing the way to deal with it, I'm just
saying this way isn't a path we can take personally.
Then you need to manage the queue on your instance in ways that allow for
your situation.
...
...
...
Let me know if I can be of any help debugging this thing or testing a
possible fix. I'd have a look at the scheduler but you, obviously
knowing the code base way better than I do, might have a quick patch on
hand.
Patches would be a bad solution for a structural problem.
As a different approach, why do you need MultiNode with a "laptop" type
device in the first place? Can the test jobs be reconfigured to use LXC
which does not use MultiNode? What is the "laptop" device-type doing that
cannot be done in an LXC? LXC is created on-the-fly, one for each device,
when the test job requests one. This solved the resource starvation
problem
...
with the majority of MultiNode issues because the work previously done in
the generic QEMU / "laptop" role can just as easily be done in an LXC.
We're testing Gigabit NICs can actually handle Gbps transfers. We need a
fully available Gbps NIC for each and every test we do to make the results
reliable and consistent.
Then as that resource is limited, you must create a way that only one test
job of this kind can ever actually run at a time. That can be done by
working at the stage prior to submission or it can be done by changing the
device availablity such that the submission is rejected. Critically, there
must also be a way to prevent jobs entering the queue if one of the
device-types is not available. That can be easily determined using the
XML-RPC API prior to submission. Once submitted, LAVA must attempt to run
the test job as quickly as possible, under the expectation that devices
which have not been Retired will become available again within a reasonable
amount of time. If that is not the case then those devices should be
Retired. (Devices can be brought out of Retired as easily as going in, it
doesn't have to be a permanent state, nothing is actually deleted from the
device configuration.)
...
...
What you are describing sounds like a misuse of MultiNode resulting in
resource starvation and the fix is to have enough of the limited resource
to prevent starvation - either by adding hardware and changing the
current
...
test jobs to use device-tags or relocating the work done on the starved
resource into an LXC so that every device can have a dedicated
"container"
...
to do things which cannot be easily done on the device.
Neither of those options are possible in our use case.
I understand the MultiNode scheduler is complex and low priority.
We've modestly contributed to LAVA before, we're not telling you to fix
it ASAP but rather to help or guide us to fix this issue in a way it
could be accepted in the upstream version of LAVA.
If you still stand strong against a patch or if it's a lengthy complete
rework of the scheduler, could we have at least a way to tell for how
long a test have been scheduled (or for how long a board has been
reserved for a test that is scheduled)?
That data is already available in the current UI and over the XML-RPC API
and REST API.
Check for Queued Jobs and the scheduling state, also the job_details call
in XML-RPC. There are a variety of ways of getting the information you
require using the existing APIs - which one you use will depend on your
preference and current scripting.
Rather than polling on XML-RPC, it would be better for a monitoring process
to use ZMQ and the publisher events to get push notifications of change of
state. That lowers the load on the master, depending on how busy the
instance actually becomes.
...
That way we can use an external
tool to monitor this and manually cancel them when needed. Currently, I
don't think there is a way to tell since when the job was scheduled.
Every test job has a database object of submit_time created at the point
where the job is created upon submission.
...
If I have misunderstood, misstated or said something wrong, I'm happy to
be told,
Best regards,
Quentin
-- 

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Lava-users] Deadlock in scheduler