On 5 January 2018 at 10:52, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote:

Hello,

We have an installation where we use LAVA 2017.12. We are regularly
seeing jobs that remain stuck for several days.

The last step of removing the V1 codebase was to rewrite the scheduler state machine to remove legacy problems - some of which are similar to what you have described. This is being tested currently for the 2018.1 release. The changes include a different way (hopefully easier to follow) to output the changes for each device and including more details of the job.id.

It comes about, most likely, due to the complexity of handling the current job as part of the device object and the actual device as part of the testjob object. We have dropped that problem with 2018.1 (which is currently on staging.validation.linaro.org, built from the master branches of lava-server and lava-dispatcher using the staging-repo).

Your problem is likely to be the database states for the device and the test job. In 2017.12 and earlier, for a device to be reserved for a test job, the device must have no current_job entry and the testjob needs to be in Submitted with no actual_device entry. Then, the actual_device gets populated at Reserved.

If jobs are stuck in Submitted, it could be because the device has a current_job entry despite being in the correct state - however, the affected testjob also needs to be checked that it is not Running.

(This complexity is the main reason why we've done the new state machine.)

For example, I have a job right now on the Armada XP GP that is stuck
since 1 day and 11 hours. The log visible in the LAVA Web interface
looks like this:

http://code.bulix.org/7pvru8-255308?raw

This is job #855671 in our setup.

The logs on the lava-slave look like this:

http://code.bulix.org/c5tejy-255312?raw

So, from the lava-slave point of view, the job is finished.

However, the "Job END" message had to be resent several times to the
master. Interestingly, this sequence lead to a very nice:

ERROR [855671] lava-run crashed

On the lava-master side (which runs on another machine), the logs
look like this:

http://code.bulix.org/b61keb-255316?raw

And this happens for lots of jobs. Pretty much every day or two, we
have ten boards stuck in this situation.

This should not be happening that often. We have seen it where lots of test jobs get cancelled when there is a long queue. We have also seen it where the actual setup is buggy.

Check that NONE of the workers have any V1 hangovers, that includes lava-server still being installed on the worker and lava-master still running. Follow the docs on how to clean up a worker which used to support V1.

https://staging.validation.linaro.org/static/docs/v2/pipeline-server.html#disabling-v1-on-pipeline-dispatchers

I have the lava-master logs with DEBUGs if this can be helpful.
However, must DEBUG logs don't have the job number in them, so it makes
it difficult to associate the DEBUG messages with the problematic job
(since numerous other jobs are running).

Does anyone has an idea what could be causing this ? Or how to debug
this further ?

Best regards,

Thomas Petazzoni
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com
_______________________________________________
Lava-users mailing list
Lava-users@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lava-users

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/