Hi all,
I found an interesting health failure today on origen07
http://validation.linaro.org/lava-server/scheduler/job/35016/log_file
When you look at the log, you see that the board starts off at the u-boot prompt. It then tries to do a "reboot", which (obviously) fails. So naturally, it then does a hard reset, and this is where it does something very odd: It interrupts the boot and tries to boot the previously installed test image. I haven't yet looked at the dispatcher code to figure out why (that's my next job).
What then started alarm bells ringing was that I saw this:
1261680 bytes read reading uInitrd
1532597 bytes read reading board.dtb
** Unable to read "board.dtb" from mmc 0:5 **
So whatever the test image was, it was expecting a device tree blob, which I would have assumed would have to have been installed during deploy_linaro_image() being that if there is one it should just be part of the test boot deployment.
So I looked at the log from the previous job:
http://validation.linaro.org/lava-server/scheduler/job/34938/log_file#entry2...
and sure enough, you'll see at that mark the same issue.
So there are two things:
1) There's some twisted logic in the dispatcher that's making it do odd things if it starts off in u-boot 2) Do we have an issue with dtbs not being handled properly by lava, or is it just that the hwpack was incomplete?
Thanks
Dave
Hi Dave,
On 10/10/2012 11:35 AM, Dave Pigott wrote:
Hi all,
I found an interesting health failure today on origen07
http://validation.linaro.org/lava-server/scheduler/job/35016/log_file
When you look at the log, you see that the board starts off at the u-boot prompt. It then tries to do a "reboot", which (obviously) fails. So naturally, it then does a hard reset, and this is where it does something very odd: It interrupts the boot and tries to boot the previously installed test image. I haven't yet looked at the dispatcher code to figure out why (that's my next job).
What then started alarm bells ringing was that I saw this:
1261680 bytes read reading uInitrd
1532597 bytes read reading board.dtb
** Unable to read "board.dtb" from mmc 0:5 **
So whatever the test image was, it was expecting a device tree blob, which I would have assumed would have to have been installed during deploy_linaro_image() being that if there is one it should just be part of the test boot deployment.
So I looked at the log from the previous job:
http://validation.linaro.org/lava-server/scheduler/job/34938/log_file#entry2...
and sure enough, you'll see at that mark the same issue.
So there are two things:
- There's some twisted logic in the dispatcher that's making it do odd things if it starts off in u-boot
- Do we have an issue with dtbs not being handled properly by lava, or is it just that the hwpack was incomplete?
Regarding the 2).
No dtb in the hwpack. I've created a very similar bug for another ci project: https://bugs.launchpad.net/linaro-ci/+bug/1064686
Probably "ubuntu packed kernels" is the only project (view) in jenkins handling the dtb in the right way.
Please note, that the dtb must be properly described in the hwpack, so that it could be picked up by LAVA / l-m-c. This is out of scope of the bug 1064686.
Thanks, Andrey
On 10/10/2012 08:56 AM, Andrey Konovalov wrote:
Hi Dave,
On 10/10/2012 11:35 AM, Dave Pigott wrote:
Hi all,
I found an interesting health failure today on origen07
http://validation.linaro.org/lava-server/scheduler/job/35016/log_file
When you look at the log, you see that the board starts off at the u-boot prompt. It then tries to do a "reboot", which (obviously) fails. So naturally, it then does a hard reset, and this is where it does something very odd: It interrupts the boot and tries to boot the previously installed test image. I haven't yet looked at the dispatcher code to figure out why (that's my next job).
I'm not sure we can trust anything that occurred in this job file after the "deploy_linaro_image is finished with error". I think at this point the dispatcher is in an unknown state and doesn't know what it should be sending to the serial console.
In this case, it still tried to do the boot_linaro_image action. However, we didn't successfully deploy an image, so anything going wrong there probably can't be trusted. I would have guessed it would have found the DTB file, but I'm not sure that's worth digging too far into.
I think the real problem we see here is what you and I discussed on IRC earlier. There are certain actions in our job file, that if failed should be considered non-recoverable. ie:
* if deploy_linaro_image fails, then boot_linaro_image can't run. * if boot_linaro_image fails, lava_test_install can't run * if lava_test_install fails - well that's tricky since it may have installed some of the test we need but not all.
I'm wondering if we need to spend some time trying to improve how actions related to one other in code?
What then started alarm bells ringing was that I saw this:
1261680 bytes read reading uInitrd
1532597 bytes read reading board.dtb
** Unable to read "board.dtb" from mmc 0:5 **
So whatever the test image was, it was expecting a device tree blob, which I would have assumed would have to have been installed during deploy_linaro_image() being that if there is one it should just be part of the test boot deployment.
Andy Doan andy.doan@linaro.org writes:
On 10/10/2012 08:56 AM, Andrey Konovalov wrote:
Hi Dave,
On 10/10/2012 11:35 AM, Dave Pigott wrote:
Hi all,
I found an interesting health failure today on origen07
http://validation.linaro.org/lava-server/scheduler/job/35016/log_file
When you look at the log, you see that the board starts off at the u-boot prompt. It then tries to do a "reboot", which (obviously) fails. So naturally, it then does a hard reset, and this is where it does something very odd: It interrupts the boot and tries to boot the previously installed test image. I haven't yet looked at the dispatcher code to figure out why (that's my next job).
I'm not sure we can trust anything that occurred in this job file after the "deploy_linaro_image is finished with error". I think at this point the dispatcher is in an unknown state and doesn't know what it should be sending to the serial console.
In this case, it still tried to do the boot_linaro_image action. However, we didn't successfully deploy an image, so anything going wrong there probably can't be trusted. I would have guessed it would have found the DTB file, but I'm not sure that's worth digging too far into.
I think the real problem we see here is what you and I discussed on IRC earlier. There are certain actions in our job file, that if failed should be considered non-recoverable. ie:
- if deploy_linaro_image fails, then boot_linaro_image can't run.
- if boot_linaro_image fails, lava_test_install can't run
- if lava_test_install fails - well that's tricky since it may have
installed some of the test we need but not all.
I'm wondering if we need to spend some time trying to improve how actions related to one other in code?
Yes please. I don't know if we want to do something generic, or just ensure deployment failures raise CriticalError -- which IIUC means no further actions will be attempted.
Cheers, mwh
On 10/10/2012 04:17 PM, Michael Hudson-Doyle wrote:
Andy Doan andy.doan@linaro.org writes:
On 10/10/2012 08:56 AM, Andrey Konovalov wrote:
Hi Dave,
On 10/10/2012 11:35 AM, Dave Pigott wrote:
Hi all,
I found an interesting health failure today on origen07
http://validation.linaro.org/lava-server/scheduler/job/35016/log_file
When you look at the log, you see that the board starts off at the u-boot prompt. It then tries to do a "reboot", which (obviously) fails. So naturally, it then does a hard reset, and this is where it does something very odd: It interrupts the boot and tries to boot the previously installed test image. I haven't yet looked at the dispatcher code to figure out why (that's my next job).
I'm not sure we can trust anything that occurred in this job file after the "deploy_linaro_image is finished with error". I think at this point the dispatcher is in an unknown state and doesn't know what it should be sending to the serial console.
In this case, it still tried to do the boot_linaro_image action. However, we didn't successfully deploy an image, so anything going wrong there probably can't be trusted. I would have guessed it would have found the DTB file, but I'm not sure that's worth digging too far into.
I think the real problem we see here is what you and I discussed on IRC earlier. There are certain actions in our job file, that if failed should be considered non-recoverable. ie:
- if deploy_linaro_image fails, then boot_linaro_image can't run.
- if boot_linaro_image fails, lava_test_install can't run
- if lava_test_install fails - well that's tricky since it may have
installed some of the test we need but not all.
I'm wondering if we need to spend some time trying to improve how actions related to one other in code?
Yes please. I don't know if we want to do something generic, or just ensure deployment failures raise CriticalError -- which IIUC means no further actions will be attempted.
CriticalError should at least fix the immediate problem.
Dave - you wanna take a stab at that for now, and we can do something more elaborate in the future?
On 10 Oct 2012, at 22:44, Andy Doan andy.doan@linaro.org wrote:
On 10/10/2012 04:17 PM, Michael Hudson-Doyle wrote:
Andy Doan andy.doan@linaro.org writes:
On 10/10/2012 08:56 AM, Andrey Konovalov wrote:
Hi Dave,
On 10/10/2012 11:35 AM, Dave Pigott wrote:
Hi all,
I found an interesting health failure today on origen07
http://validation.linaro.org/lava-server/scheduler/job/35016/log_file
When you look at the log, you see that the board starts off at the u-boot prompt. It then tries to do a "reboot", which (obviously) fails. So naturally, it then does a hard reset, and this is where it does something very odd: It interrupts the boot and tries to boot the previously installed test image. I haven't yet looked at the dispatcher code to figure out why (that's my next job).
I'm not sure we can trust anything that occurred in this job file after the "deploy_linaro_image is finished with error". I think at this point the dispatcher is in an unknown state and doesn't know what it should be sending to the serial console.
In this case, it still tried to do the boot_linaro_image action. However, we didn't successfully deploy an image, so anything going wrong there probably can't be trusted. I would have guessed it would have found the DTB file, but I'm not sure that's worth digging too far into.
I think the real problem we see here is what you and I discussed on IRC earlier. There are certain actions in our job file, that if failed should be considered non-recoverable. ie:
- if deploy_linaro_image fails, then boot_linaro_image can't run.
- if boot_linaro_image fails, lava_test_install can't run
- if lava_test_install fails - well that's tricky since it may have
installed some of the test we need but not all.
I'm wondering if we need to spend some time trying to improve how actions related to one other in code?
Yes please. I don't know if we want to do something generic, or just ensure deployment failures raise CriticalError -- which IIUC means no further actions will be attempted.
CriticalError should at least fix the immediate problem.
Dave - you wanna take a stab at that for now, and we can do something more elaborate in the future?
Yep. Will look at it today.
Thanks
Dave