Hello,
After upgrading to 2018.4 (also tried .5) many of our device-types using base-uboot.jinja2 are broken. While I really like the major improvement to run commands individually, there seems to be some problems and the LAVA output logs are very confusing, showing concatenated strings, etc.
Here is an example for an upstream device-type (meson-gxbb-p200), and here is where it starts interacting with u-boot: http://khilman.ddns.net/scheduler/job/15#L336
The "Parsed boot commands" look perfect, and all the commands in black all look good, but notice the commands at the u-boot prompt, they appear to be concatenated, starting right away at the "setenv initrd_high ..."
However, observing the commands on the actual serial port (I use conmux, so can observe the serial console interactions directly), I'm not seeing concatenated strings, but the "setenv serverip ..." never shows up, so the TFTP downloads fail, and the job fails.
Here's what I see directly on the serial console:
Hit Enter or space or Ctrl+C key to stop autoboot -- : 0 gxb_p200_v1# gxb_p200_v1#setenv autoload no gxb_p200_v1#setenv initrd_high 0xffffffff gxb_p200_v1#setenv fdt_high 0xffffffff gxb_p200_v1#dhcp dwmac.c9410000 Waiting for PHY auto negotiation to complete.. done Speed: 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 DHCP client bound to address 192.168.0.216 (267 ms) gxb_p200_v1#tftp 0x1080000 14/tftp-deploy-5v1wo7fv/kernel/uImage Speed: 100, full duplex Using dwmac.c9410000 device TFTP from server 192.168.0.1; our IP address is 192.168.0.216 Filename '14/tftp-deploy-5v1wo7fv/kernel/uImage'. Load address: 0x1080000 Loading: * TFTP error: 'File not found' (1)
Even more interesting is that on the same setup, a beaglebone-black device, using the same base-uboot.jinja2 is working just fine: http://khilman.ddns.net/scheduler/job/1
Any help would be appreciated, I'm thoroughly confused by what's going on here.
Thanks,
Kevin
On 30 May 2018 at 19:28, Kevin Hilman khilman@baylibre.com wrote:
Hello,
After upgrading to 2018.4 (also tried .5) many of our device-types using base-uboot.jinja2 are broken. While I really like the major improvement to run commands individually, there seems to be some problems and the LAVA output logs are very confusing, showing concatenated strings, etc.
Here is an example for an upstream device-type (meson-gxbb-p200), and here is where it starts interacting with u-boot: http://khilman.ddns.net/scheduler/job/15#L336
This is a classic presentation of the receiving end not coping with the speed of the transfer of the characters - the device-type template can define:
{% set boot_character_delay = 10 %}
This adds the specified number of milliseconds between each character sent along the serial, for all strings sent by the boot action. It shouldn't make any noticeable difference to the speed of the test job but it should dramatically improve the ability of the receiving buffer to cope with automated text entry instead of keyboard entry.
I suspect the older methods were just below some threshold in the DUT.
(There is also support for test_character_delay for similar reasons, should that be necessary.)
The "Parsed boot commands" look perfect, and all the commands in black all look good, but notice the commands at the u-boot prompt, they appear to be concatenated, starting right away at the "setenv initrd_high ..."
However, observing the commands on the actual serial port (I use conmux, so can observe the serial console interactions directly), I'm not seeing concatenated strings, but the "setenv serverip ..." never shows up, so the TFTP downloads fail, and the job fails.
Here's what I see directly on the serial console:
Hit Enter or space or Ctrl+C key to stop autoboot -- : 0 gxb_p200_v1# gxb_p200_v1#setenv autoload no gxb_p200_v1#setenv initrd_high 0xffffffff gxb_p200_v1#setenv fdt_high 0xffffffff gxb_p200_v1#dhcp dwmac.c9410000 Waiting for PHY auto negotiation to complete.. done Speed: 100, full duplex BOOTP broadcast 1 BOOTP broadcast 2 DHCP client bound to address 192.168.0.216 (267 ms) gxb_p200_v1#tftp 0x1080000 14/tftp-deploy-5v1wo7fv/kernel/uImage Speed: 100, full duplex Using dwmac.c9410000 device TFTP from server 192.168.0.1; our IP address is 192.168.0.216 Filename '14/tftp-deploy-5v1wo7fv/kernel/uImage'. Load address: 0x1080000 Loading: * TFTP error: 'File not found' (1)
Even more interesting is that on the same setup, a beaglebone-black device, using the same base-uboot.jinja2 is working just fine: http://khilman.ddns.net/scheduler/job/1
That is also typical of character delay issues - very device-type specific.
Any help would be appreciated, I'm thoroughly confused by what's going on here.
It took quite some time to identify this originally too.
There is a mention in the docs but it's not indexed well and that will be improved as an action from your question. There are a number of examples of it in the other device type templates.
https://lava.codehelp.co.uk/static/docs/v2/device-integration.html#extend-th... https://lava.codehelp.co.uk/static/docs/v2/ipmi-pxe-deploy.html#serial-over-... (which is where this was first found)
Hi Neil,
On Thu, May 31, 2018 at 12:30 AM, Neil Williams neil.williams@linaro.org wrote:
On 30 May 2018 at 19:28, Kevin Hilman khilman@baylibre.com wrote:
Hello,
After upgrading to 2018.4 (also tried .5) many of our device-types using base-uboot.jinja2 are broken. While I really like the major improvement to run commands individually, there seems to be some problems and the LAVA output logs are very confusing, showing concatenated strings, etc.
Here is an example for an upstream device-type (meson-gxbb-p200), and here is where it starts interacting with u-boot: http://khilman.ddns.net/scheduler/job/15#L336
This is a classic presentation of the receiving end not coping with the speed of the transfer of the characters - the device-type template can define:
{% set boot_character_delay = 10 %}
Yes, I'm aware of this feature and actually tried, but it doesn't help. I use that feature on several platforms that have buggy UART drivers in uboot, or that don't have proper FIFO control so commands need to be slowed down, so I'm aware of the classic symptoms and that solution. Sorry, I should've mentioned that I tried that in my original post.
However, for me, there are several reason why this is not a classic symptom
- it worked without any delays before the 2018.4 upgrade - it works fine in 2018.4 if we replace base-uboot.jinja2 with one from 2017.11 - the same board works with other pexpect tooling (pyboot) driving u-boot without any delays - it fails in the same way on several boards: it's always the command right after dhcp that's missing
Also, the classic symptoms that require that fix (at least in my experience) are randomly lost characters, not entire commands that just go missing.
To me what this actually looks like is LAVA is not fully waiting for a prompt before sending the next command. IOW, it sends a command before the prompt shows up, and so that command is totally lost. I haven't dug enough in the code to know how/why it's happening, but that's what it "feels" like so far.
Kevin
On 31 May 2018 at 16:01, Kevin Hilman khilman@baylibre.com wrote:
Hi Neil,
On Thu, May 31, 2018 at 12:30 AM, Neil Williams neil.williams@linaro.org wrote:
On 30 May 2018 at 19:28, Kevin Hilman khilman@baylibre.com wrote:
Hello,
After upgrading to 2018.4 (also tried .5) many of our device-types using base-uboot.jinja2 are broken. While I really like the major improvement to run commands individually, there seems to be some problems and the LAVA output logs are very confusing, showing concatenated strings, etc.
Here is an example for an upstream device-type (meson-gxbb-p200), and here is where it starts interacting with u-boot: http://khilman.ddns.net/scheduler/job/15#L336
This is a classic presentation of the receiving end not coping with the speed of the transfer of the characters - the device-type template can define:
{% set boot_character_delay = 10 %}
Yes, I'm aware of this feature and actually tried, but it doesn't help. I use that feature on several platforms that have buggy UART drivers in uboot, or that don't have proper FIFO control so commands need to be slowed down, so I'm aware of the classic symptoms and that solution. Sorry, I should've mentioned that I tried that in my original post.
That board is doing bad stuff on the U-Boot command line - commands are being repeated in the output despite only being sent once. That then causes the prompt to appear and therefore match.
gxb_p200_v1#setenv autoload no
appears twice in that log but was only sent once.
setenv initrd_high 0xffffffff
appears twice, although the second one is without the prompt (which is what should be happening because BBB does the same too).
The serverip *is* sent but the prompt is already out of sync, so the dhcp output happens afterwards.
At least in job 20, the TFTP does start but the board mangled the command to load the ramdisk.
The purpose of this change is to send the U-Boot commands and do the operations of those commands at the same time.
https://review.linaro.org/#/c/23195/
Change the flow of bootloader commands so they are executed individually
For uboot, instead of building and sending a singular large boot command, execute each step individually to allow better monitoring of errors. Also, make bootloader error messages per-bootloader.
That double prompt output is the problem. The old jinja2 template just happened to work with this because all the work was bundled up in one lump at the end. This was actually much more prone to errors on a wide range of boards.
However, for me, there are several reason why this is not a classic symptom
- it worked without any delays before the 2018.4 upgrade
- it works fine in 2018.4 if we replace base-uboot.jinja2 with one from
2017.11
That's more of a change than it seems - it radically changes how the interaction occurs with U-Boot - pushing back to the bundle of commands which caused errors elsewhere.
- the same board works with other pexpect tooling (pyboot) driving
u-boot without any delays
- it fails in the same way on several boards: it's always the command
right after dhcp that's missing
We're not seeing failures on boards available on staging.validation.linaro.org - which boards are failing?
Also, the classic symptoms that require that fix (at least in my experience) are randomly lost characters, not entire commands that just go missing.
We've seen more than just missing characters. Besides, if the missing character was a newline, then concatenation of commands might be seen but the execution of the command would be missing.
To me what this actually looks like is LAVA is not fully waiting for a prompt before sending the next command.
LAVA saw the prompt - the board issued it twice. The same prompt is being expected for the next line, so LAVA issued that line. There is no way to know that the board has not processed the previous line, we have to rely on the appearance of the prompt.
IOW, it sends a command before the prompt shows up, and so that command is totally lost. I haven't dug enough in the code to know how/why it's happening, but that's what it "feels" like so far.
The device is duplicating the output (at a very early stage) and that output includes a prompt, so the issuing of commands gets ahead of the processing.
I think we need a staging instance which has these boards available and which runs functional tests (where the test job never changes) on a regular basis, reporting back by email on incomplete test jobs. We do exactly this with staging.validation.linaro.org but that only covers those devices. The instance should use the nightly builds from the staging-repo (although it doesn't have to upgrade or run functional tests every night) so that we get a report before we make the release.
We're developing the docker support so that this can be done without needing a complete set of new devices, as this can be problematic. The idea will be that default runs use the production docker and functional tests use a docker containing the code from the nightly build.
Whilst this can work for the dispatcher now, it does not account for changes in the Jinja2 templates and we're working on lava-server docker support. It isn't ready yet.
This is one of those situations where there isn't a good alternative for simply having the relevant devices available for functional testing based on the staging-repo builds with reports coming in ahead of a release. That will provide some assurance ahead of upgrading the production instance.
It's just possible that the interrupt_char could be causing this problem, you've got a newline set - that could end up being carried over into the shell and U-Boot has a nasty habit of repeating the last command if a newline is entered with no other content.
IOW this change may have revealed an inherent problem in the device config which was hidden in the previous "bundled" set of commands.
Kevin
On Thu, May 31, 2018 at 8:38 AM, Neil Williams neil.williams@linaro.org wrote:
[...]
It's just possible that the interrupt_char could be causing this problem, you've got a newline set - that could end up being carried over into the shell and U-Boot has a nasty habit of repeating the last command if a newline is entered with no other content.
It's not only just possible, that was it! :) Simple fix[1] makes it happy again, and no boot_character_delay needed. I'll submit a patch shortly.
Breaking into uboot with '\n' was creating an extra prompt, which was surely causing problems with the pexpect matching in LAVA, making it think that it received a prompt already, when it was actually a previous prompt.
Optional suggestion: LAVA could be a bit more robust and handle this case by itself by "flushing" pexpect after breaking into uboot, and before sending any commands. Simply waiting for a prompt with a short timeout should catch this case, and catch the "stray" prompt, or just timeout and continue if it's not present.
Anyways, thanks for making the suggestion,
Kevin
[1] diff --git a/lava_scheduler_app/tests/device-types/meson-gx-common.jinja2 b/lava_scheduler_app/tests/device-types/meson-gx-common.jinja2 index c666bea2fdc2..6b3857c614fe 100644 --- a/lava_scheduler_app/tests/device-types/meson-gx-common.jinja2 +++ b/lava_scheduler_app/tests/device-types/meson-gx-common.jinja2 @@ -5,7 +5,7 @@ {% set console_device = console_device|default('ttyAML0') %} {% set baud_rate = baud_rate|default(115200) %} {% set interrupt_prompt = interrupt_prompt|default('stop autoboot') %} -{% set interrupt_char = interrupt_char|default('\n') %} +{% set interrupt_char = interrupt_char|default('') %}
{% set bootm_kernel_addr = '0x1080000' %} {% set bootm_dtb_addr = '0x1000000' %}
On Thu, May 31, 2018 at 9:53 AM, Kevin Hilman khilman@baylibre.com wrote:
On Thu, May 31, 2018 at 8:38 AM, Neil Williams neil.williams@linaro.org wrote:
[...]
It's just possible that the interrupt_char could be causing this problem, you've got a newline set - that could end up being carried over into the shell and U-Boot has a nasty habit of repeating the last command if a newline is entered with no other content.
It's not only just possible, that was it! :) Simple fix[1] makes it happy again, and no boot_character_delay needed. I'll submit a patch shortly.
I've got one more hiccup. I have a particular vendor uboot which requires either <enter> or Ctrl-C. No other chars work. :( The <enter> causes the "extra prompt problem", and I'm not seeing how to send ctrl chars with interrupt_char. I tried \x03 and LAVA didn't like that:
Infrastructure error: Cannot parse job definition: unacceptable character #x0003: special characters are not allowed in "<unicode string>", position 2829
I see there is interrupt_ctrl_list, but that doesn't seem to be used by base.jinja2 for uboot. Is there any reason not to do this[1] so that interrupt_ctrl_list defaults to empty, but could be set from the device-type, or device dict? I tested that patch and it allowed me to add this to the device dict, and it worked and avoids the "extra prompt" problem.
{% set interrupt_ctrl_list = ['c'] %}
Kevin
[1] diff --git a/lava_scheduler_app/tests/device-types/base.jinja2 b/lava_scheduler_app/tests/device-types/base.jinja2 index 13b053080f45..798545591dd2 100644 --- a/lava_scheduler_app/tests/device-types/base.jinja2 +++ b/lava_scheduler_app/tests/device-types/base.jinja2 @@ -66,6 +66,7 @@ constants: u-boot: interrupt-prompt: '{{ uboot_autoboot_prompt | default("Hit any key to stop autoboot") }}' interrupt-character: '{{ uboot_interrupt_character | default(" ") }}' + interrupt_ctrl_list: {{ interrupt_ctrl_list | default([]) }} final-message: '{{ uboot_final_message | default("Starting kernel") }}' error-messages: - 'Resetting CPU'
On 31 May 2018 at 21:19, Kevin Hilman khilman@baylibre.com wrote:
On Thu, May 31, 2018 at 9:53 AM, Kevin Hilman khilman@baylibre.com wrote:
On Thu, May 31, 2018 at 8:38 AM, Neil Williams neil.williams@linaro.org
wrote:
[...]
It's just possible that the interrupt_char could be causing this
problem,
you've got a newline set - that could end up being carried over into the shell and U-Boot has a nasty habit of repeating the last command if a newline is entered with no other content.
It's not only just possible, that was it! :) Simple fix[1] makes it happy again, and no boot_character_delay needed. I'll submit a patch shortly.
I've got one more hiccup. I have a particular vendor uboot which requires either <enter> or Ctrl-C. No other chars work. :( The <enter> causes the "extra prompt problem", and I'm not seeing how to send ctrl chars with interrupt_char. I tried \x03 and LAVA didn't like that:
This has already been seen before, at some stage, because there's a relevant comment in boot/__init__.py:
# vendor u-boot builds may require one or more control characters
https://git.linaro.org/lava/lava.git/tree/lava_dispatcher/actions/boot/__ini...
It's just that the support has been implemented more frequently for iPXE.
Infrastructure error: Cannot parse job definition: unacceptable character #x0003: special characters are not allowed in "<unicode string>", position 2829
I see there is interrupt_ctrl_list, but that doesn't seem to be used by base.jinja2 for uboot. Is there any reason not to do this[1] so that interrupt_ctrl_list defaults to empty, but could be set from the device-type, or device dict? I tested that patch and it allowed me to add this to the device dict, and it worked and avoids the "extra prompt" problem.
{% set interrupt_ctrl_list = ['c'] %}
Kevin
[1] diff --git a/lava_scheduler_app/tests/device-types/base.jinja2 b/lava_scheduler_app/tests/device-types/base.jinja2 index 13b053080f45..798545591dd2 100644 --- a/lava_scheduler_app/tests/device-types/base.jinja2 +++ b/lava_scheduler_app/tests/device-types/base.jinja2 @@ -66,6 +66,7 @@ constants: u-boot: interrupt-prompt: '{{ uboot_autoboot_prompt | default("Hit any key to stop autoboot") }}' interrupt-character: '{{ uboot_interrupt_character | default(" ") }}'
- interrupt_ctrl_list: {{ interrupt_ctrl_list | default([]) }} final-message: '{{ uboot_final_message | default("Starting
kernel") }}' error-messages: - 'Resetting CPU'
Email line-mangling not withstanding, if you'd like to submit a review for that change, I don't see a problem. Empty is the correct default for it to let the interrupt character be used for other devices.
On 1 June 2018 at 08:17, Neil Williams neil.williams@linaro.org wrote:
On 31 May 2018 at 21:19, Kevin Hilman khilman@baylibre.com wrote:
On Thu, May 31, 2018 at 9:53 AM, Kevin Hilman khilman@baylibre.com wrote:
On Thu, May 31, 2018 at 8:38 AM, Neil Williams <
neil.williams@linaro.org> wrote:
[...]
It's just possible that the interrupt_char could be causing this
problem,
you've got a newline set - that could end up being carried over into
the
shell and U-Boot has a nasty habit of repeating the last command if a newline is entered with no other content.
It's not only just possible, that was it! :) Simple fix[1] makes it happy again, and no boot_character_delay needed. I'll submit a patch shortly.
I've got one more hiccup. I have a particular vendor uboot which requires either <enter> or Ctrl-C. No other chars work. :( The <enter> causes the "extra prompt problem", and I'm not seeing how to send ctrl chars with interrupt_char. I tried \x03 and LAVA didn't like that:
This has already been seen before, at some stage, because there's a relevant comment in boot/__init__.py:
# vendor u-boot builds may require one or more control characters
https://git.linaro.org/lava/lava.git/tree/lava_dispatcher/ actions/boot/__init__.py#n629
It's just that the support has been implemented more frequently for iPXE.
Infrastructure error: Cannot parse job definition: unacceptable character #x0003: special characters are not allowed in "<unicode string>", position 2829
I see there is interrupt_ctrl_list, but that doesn't seem to be used by base.jinja2 for uboot. Is there any reason not to do this[1] so that interrupt_ctrl_list defaults to empty, but could be set from the device-type, or device dict? I tested that patch and it allowed me to add this to the device dict, and it worked and avoids the "extra prompt" problem.
{% set interrupt_ctrl_list = ['c'] %}
Kevin
[1] diff --git a/lava_scheduler_app/tests/device-types/base.jinja2 b/lava_scheduler_app/tests/device-types/base.jinja2 index 13b053080f45..798545591dd2 100644 --- a/lava_scheduler_app/tests/device-types/base.jinja2 +++ b/lava_scheduler_app/tests/device-types/base.jinja2 @@ -66,6 +66,7 @@ constants: u-boot: interrupt-prompt: '{{ uboot_autoboot_prompt | default("Hit any key to stop autoboot") }}' interrupt-character: '{{ uboot_interrupt_character | default(" ") }}'
- interrupt_ctrl_list: {{ interrupt_ctrl_list | default([]) }} final-message: '{{ uboot_final_message | default("Starting
kernel") }}' error-messages: - 'Resetting CPU'
Email line-mangling not withstanding, if you'd like to submit a review for that change, I don't see a problem. Empty is the correct default for it to let the interrupt character be used for other devices.
Ideally with a change to https://git.linaro.org/lava/lava.git/tree/lava_scheduler_app/tests/test_uboo... and adding a suitable device dictionary Jinja2 file in https://git.linaro.org/lava/lava.git/tree/lava_scheduler_app/tests/devices to ensure that this support is retained. We're gradually simplifying the syntax, https://git.linaro.org/lava/lava.git/tree/lava_scheduler_app/tests/test_grub... is the simplest method.