Hi,
We've hit an issue when running lava-master on Debian Jessie and lava-slave on Debian Stretch, after a few minutes the slave would stop working. After some investigation, it turned out to be due to a difference of the libzmq versions in Jessie (4.0.5+dfsg-2) and Stretch (4.2.1-4) causing some protocol errors.
The line that detects the error in Stretch is:
https://github.com/zeromq/libzmq/blob/7005f22726d4a6ca527f27560a0a132394fdbb...
This appears to be due to how the "once" counter gets written into memory and into the zmq packets: the libzmq version from Jessie uses memcpy whereas the one in Stretch calls put_uint64. As a result the byte endianness has changed from little to big, causing the packets to work until "once" reaches 255 which translates into 0xff << 56, after which it overflows to 0 and causes the error.
This is not a LAVA bug as such, rather a libzmq one, but it impacts interoperability between Jessie and Stretch for LAVA so it may need to be documented or resolved somehow. We've installed the new version of libzmq onto our Jessie servers to align them with Stretch; doing this does fix the problem.
Best wishes, Guillaume
On 16 June 2017 at 14:54, Guillaume Tucker guillaume.tucker@collabora.com wrote:
Hi,
We've hit an issue when running lava-master on Debian Jessie and lava-slave on Debian Stretch, after a few minutes the slave would stop working. After some investigation, it turned out to be due to a difference of the libzmq versions in Jessie (4.0.5+dfsg-2) and Stretch (4.2.1-4) causing some protocol errors.
The line that detects the error in Stretch is:
https://github.com/zeromq/libzmq/blob/7005f22726d4a6ca527f27560a0a132394fdbb...
This appears to be due to how the "once" counter gets written into memory and into the zmq packets: the libzmq version from Jessie uses memcpy whereas the one in Stretch calls put_uint64. As a result the byte endianness has changed from little to big, causing the packets to work until "once" reaches 255 which translates into 0xff << 56, after which it overflows to 0 and causes the error.
This is not a LAVA bug as such, rather a libzmq one, but it impacts interoperability between Jessie and Stretch for LAVA so it may need to be documented or resolved somehow. We've installed the new version of libzmq onto our Jessie servers to align them with Stretch; doing this does fix the problem.
Sounds like a backport of zeromq3 (4.2.1-4) to jessie-backports would be useful. Then LAVA can depend on the newer version in jessie-backports.
The Stretch release is imminent, so it's not a good time to look for a backport right now but once Stretch is released, the backport should be requested.
The package has been backported previously (to wheezy-backports), so it may be a simple task.
https://tracker.debian.org/pkg/zeromq3
I've not seen this problem in the reverse situation - I run my LAVA master on amd64 unstable (to keep up with Django changes) and I run a worker on a cubietruck (armhf) running Jessie.
https://tracker.debian.org/pkg/pyzmq