On 31 October 2017 at 09:47, Dragos Iorgulescu <Dragos.Iorgulescu@enea.com> wrote:

Hi, Neil!

Thanks for clearing things up :)

Attaching the whole plain log...

--
/Dragoş

On 2017-10-31 11:36:27+02:00 Neil Williams wrote:

On 31 October 2017 at 09:22, Dragos Iorgulescu <Dragos.Iorgulescu@enea.com> wrote:

Good morning everyone!

While running some tests using our Linux distro and a SuperMicro Intel D1521 board, we encounter a situation where a very long line (shown as a "broken line" in LAVA's logs - in the WEB UI) seems to interfere with the job run, in a sense that the job eventually ends in a timeout.

<< lava: broken line >> is just an indication that the line could not be parsed as ASCII or any obvious encoding - it probably contains binary sequences.

In the output.yaml on the master, you will have the original content, e.g. from a beaglebone-black job:

- {"dt": "2017-10-27T15:27:51.618988", "lvl": "target", "msg": !!binary "QvW1SrmBmqGVsbGBQg==\n"}

- {"dt": "2017-10-27T15:27:51.622342", "lvl": "target", "msg": !!binary "zaGlalJU5YuVyYE6QpWxwZ0CMr3JgQ==\n"}

This is python's best guess at what was sent.

This does not happen when that particular set of tests is not run and I am wondering if this might be a LAVA issue.

Please attach the complete output.yaml (available via the Actions, plain log link)

It's not related to long lines. It's related to your device not handling the serial connection reliably:

LAVA sent setenv:

- {"dt": "2017-10-27T16:09:07.663040", "lvl": "debug", "msg": {"sending": "setenv loadinitrd 'tftpboot 0x60000000 358/tftp-deploy-VLCRzw/ramdisk/ramdisk.ext4.gz.uboot'\n"}}

Your device dropped the first character but then failed to handle that as an error.

- {"dt": "2017-10-27T16:09:07.856605", "lvl": "target", "msg": "etenv loadinitrd 'tftpboot 0x60000000 358/tftp-deploy-VLCRzw/ramdisk/ramdisk.ext4.gz.uboot'"}

This looks like it was related to the delays caused by running usb start.

If your bootloader supports a "sleep" type command or "true" or similar, it may be worth adding that to your device configuration *as a completely separate line* after usb start so that your device has time to catch up after doing the usb start work and produce a prompt without needing to do much that would actually affect the rest of the job.

Equally, your bootloader seems to just timeout instead of handling errors in commands - typing etenv instead of setenv should produce an error message and a new prompt, not just sit there without any output. Without a fix for that, this device is not going to be reliable in automation.

Avoid using semi-colons in the bootloader configuration for this board - you appear to have problems with some commands not being interpreted correctly, so you will need to know which commands have worked and which have not. Combining multiple commands on one line is not going to help anyone. Not because of a long line but because you won't be able to tell which part failed.

--------------------
A bit of context, first.

SuperMicro boards "make sure" your board connection options are limited to using SoL (serial-over-lan), IPMI and a serial port (but in a very limited manner).
One is not able to do a "typical" image load, since you may not specify hardware addresses or properly interact with the bootloader.

This led to an approach which is best illustrated in the attached diagram.

For short: LAVA instantiates an LXC container, from where we run a series of scripts to prepare the board (copy some images & files on an already available Linux instance, altering the Grub menu and then rebooting the board, making it start the image we wish to test) and then run remote commands via SSH.
So, basically, the LXC container runs each and every command on the board, acting as the middleman.

This is common to a range of boards. It adds an extra level of complexity compared to devices which do not need LXC but it not a problem in itself.

-------------------

Now, we have the following:

https://paste.debian.net/993516/ -> plain log excerpt that contains the very long line I was talking about (please see line no. 90)
https://paste.debian.net/993517/ -> job definition
https://paste.debian.net/993518/ -> test definition

Is there a way to further investigate if this type of output is "giving LAVA a hard time"? Is there a setting that limits interpreted output, somehow?

May it be that this is a LAVA issue?

Kind regards!

--
/Dragoş

_______________________________________________
Lava-users mailing list
Lava-users@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lava-users

--

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/

Neil Williams
=============
neil.williams@linaro.org
http://www.linux.codehelp.co.uk/