Andy Doan andy.doan@linaro.org writes:
On 10/10/2012 08:56 AM, Andrey Konovalov wrote:
Hi Dave,
On 10/10/2012 11:35 AM, Dave Pigott wrote:
Hi all,
I found an interesting health failure today on origen07
http://validation.linaro.org/lava-server/scheduler/job/35016/log_file
When you look at the log, you see that the board starts off at the u-boot prompt. It then tries to do a "reboot", which (obviously) fails. So naturally, it then does a hard reset, and this is where it does something very odd: It interrupts the boot and tries to boot the previously installed test image. I haven't yet looked at the dispatcher code to figure out why (that's my next job).
I'm not sure we can trust anything that occurred in this job file after the "deploy_linaro_image is finished with error". I think at this point the dispatcher is in an unknown state and doesn't know what it should be sending to the serial console.
In this case, it still tried to do the boot_linaro_image action. However, we didn't successfully deploy an image, so anything going wrong there probably can't be trusted. I would have guessed it would have found the DTB file, but I'm not sure that's worth digging too far into.
I think the real problem we see here is what you and I discussed on IRC earlier. There are certain actions in our job file, that if failed should be considered non-recoverable. ie:
- if deploy_linaro_image fails, then boot_linaro_image can't run.
- if boot_linaro_image fails, lava_test_install can't run
- if lava_test_install fails - well that's tricky since it may have
installed some of the test we need but not all.
I'm wondering if we need to spend some time trying to improve how actions related to one other in code?
Yes please. I don't know if we want to do something generic, or just ensure deployment failures raise CriticalError -- which IIUC means no further actions will be attempted.
Cheers, mwh