I have a multi-node test involving 13 roles that is no longer syncing properly after upgrading to 2016.11 this morning. It seems that 2 or 3 nodes end up waiting for a specific message while the other ones finish the message and move on to the next. Looking at the dispatcher log, I don't see any errors, but it's only logging that it's sending to some of the nodes. For example, I see a message like this for the nodes that work in a run:
2016-11-10 13:10:37,295 Sending wait messageID 'qa-network-info' to /var/lib/lava/dispatcher/slave/tmp/7615/device.yaml in group 2651c0a0-811f-4b77-bc07-22af31744fe5: {"/var/lib/lava/dispatcher/slave/tmp/7619/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7613/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7623/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tm p/7620/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7611/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7621/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7622/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7617/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7618/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7614/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7615/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7616/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7612/device.yaml": {}} 2016-11-10 13:10:37,295 Sending wait response to /var/lib/lava/dispatcher/slave/tmp/7615/device.yaml in group 2651c0a0-811f-4b77-bc07-22af31744fe5: {"message": {"/var/lib/lava/dispatcher/slave/tmp/7619/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7613/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7623/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7620/ device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7611/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7621/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7622/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7617/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7618/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7614/device.yaml": {}, "/var/l ib/lava/dispatcher/slave/tmp/7615/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7616/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7612/device.yaml": {}}, "response": "ack"}
For the nodes that get stuck, there is no message like above.
All of the nodes are qemu type, all on the same host. The nodes that fail are not consistent, but there seems to be always 2 or 3 that fail in every run I tried.
Is there anything I can look at here to figure out what is happening?
-- James Oakley james.oakley@multapplied.net
On 10 November 2016 at 21:21, James Oakley james.oakley@multapplied.net wrote:
I have a multi-node test involving 13 roles that is no longer syncing properly after upgrading to 2016.11 this morning. It seems that 2 or 3 nodes end up waiting for a specific message while the other ones finish the message and move on to the next. Looking at the dispatcher log, I don't see any errors, but it's only logging that it's sending to some of the nodes. For example, I see a message like this for the nodes that work in a run:
2016-11-10 13:10:37,295 Sending wait messageID 'qa-network-info' to /var/lib/lava/dispatcher/slave/tmp/7615/device.yaml in group 2651c0a0-811f-4b77-bc07-22af31744fe5: {"/var/lib/lava/dispatcher/slave/tmp/7619/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7613/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7623/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tm p/7620/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7611/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7621/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7622/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7617/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7618/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7614/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7615/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7616/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7612/device.yaml": {}} 2016-11-10 13:10:37,295 Sending wait response to /var/lib/lava/dispatcher/slave/tmp/7615/device.yaml in group 2651c0a0-811f-4b77-bc07-22af31744fe5: {"message": {"/var/lib/lava/dispatcher/slave/tmp/7619/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7613/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7623/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7620/ device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7611/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7621/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7622/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7617/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7618/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7614/device.yaml": {}, "/var/l ib/lava/dispatcher/slave/tmp/7615/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7616/device.yaml": {}, "/var/lib/lava/dispatcher/slave/tmp/7612/device.yaml": {}}, "response": "ack"}
For the nodes that get stuck, there is no message like above.
All of the nodes are qemu type, all on the same host.
That many hosts on one machine could easily be consuming all resources for that machine and causing latency or timeouts. You should split the workload over 2 or more other machines by making new workers for the instance.
The nodes that fail are not consistent, but there seems to be always 2 or 3 that fail in every run I tried.
Is there anything I can look at here to figure out what is happening?
Check the coordinator logs to see what messages were actually sent and whether all nodes managed to contact the coordinator. Only one node can make a connection to the coordinator at any one time and there is no queuing of connections to the coordinator.
On 10 November 2016 at 13:34, Neil Williams neil.williams@linaro.org wrote:
That many hosts on one machine could easily be consuming all resources for that machine and causing latency or timeouts. You should split the workload over 2 or more other machines by making new workers for the instance.
It was timing out when I had the dispatcher setting up it's virtual disks on the local disk storage, but I moved the dispatcher cache/tmp to a 48G ramdisk. The machine has a 24 cores and 96G of RAM, so contention is no longer an issue. Ultimately, I want to serve RBD COW images from my local Ceph cluster to speed up the image deployment, but I haven't looked into what's required for that yet.
Check the coordinator logs to see what messages were actually sent and whether all nodes managed to contact the coordinator. Only one node can make a connection to the coordinator at any one time and there is no queuing of connections to the coordinator.
I tried another run and only one node failed (job id 7624). It did appear to contact the coordinator:
2016-11-10 13:42:36,302 lavaSend handler in Coordinator received a messageID 'qa-network-info' for group '963dcade-1f0d-42bb-b537-05a4d6e55fb1' from /var/lib/lava/dispatcher/slave/tmp/7624/device.yaml
It did not seem to get the lava-wait-all for that job, however. Just ones from the other jobs:
2016-11-10 13:42:53,566 lava_wait_all: {u'blocksize': 4096, u'client_name': u'/var/lib/lava/dispatcher/slave/tmp/7631/device.yaml', u'hostname': u'newlava', u'request': u'lava_wait_all', u'group_name': u'963dcade-1f0d-42bb-b537-05a4d6e55fb1', u'host': u'lava.multapplied.net', u'role': u'remote-router1', u'messageID': u'qa-network-info', u'poll_delay': 3, u'port': 3079}
These specific tests have been running just fine for months under 2016.9, and the release before that. This problem began immediately after I upgraded to 2016.11 this morning.
On 10 November 2016 at 14:07, James Oakley james.oakley@multapplied.net wrote:
These specific tests have been running just fine for months under 2016.9, and the release before that. This problem began immediately after I upgraded to 2016.11 this morning.
After some digging, I found that there was an external factor causing the tests to hang.
Sorry for the noise.
-- James Oakley james.oakley@multapplied.net