Hey Guys,
I've noticed my own Panda/Android jobs were failing. ex: http://validation.linaro.org/lava-server/scheduler/job/17089
I've also noticed the official builds aren't working either: http://validation.linaro.org/lava-server/scheduler/job/17092
I've personally booted my image, so I think we have a problem. Is anyone aware of this or looking into it already?
-andy
Hi, Andy,
Can you check the system.tar.bz2, from the log it seems deployment finished but report no /system/bin/sh found.
On 31 March 2012 02:21, Andy Doan andy.doan@linaro.org wrote:
Hey Guys,
I've noticed my own Panda/Android jobs were failing. ex: <http://validation.linaro.org/**lava-server/scheduler/job/**17089http://validation.linaro.org/lava-server/scheduler/job/17089
I've also noticed the official builds aren't working either: <http://validation.linaro.org/**lava-server/scheduler/job/**17092http://validation.linaro.org/lava-server/scheduler/job/17092
I've personally booted my image, so I think we have a problem. Is anyone aware of this or looking into it already?
-andy
______________________________**_________________ linaro-validation mailing list linaro-validation@lists.**linaro.org linaro-validation@lists.linaro.org http://lists.linaro.org/**mailman/listinfo/linaro-**validationhttp://lists.linaro.org/mailman/listinfo/linaro-validation
I am try to repoduce the problem on my local lava setup. Let you inform the progress soon..
/Chi Thu
On 31 March 2012 06:59, Spring Zhang spring.zhang@linaro.org wrote:
Hi, Andy,
Can you check the system.tar.bz2, from the log it seems deployment finished but report no /system/bin/sh found.
On 31 March 2012 02:21, Andy Doan andy.doan@linaro.org wrote:
Hey Guys,
I've noticed my own Panda/Android jobs were failing. ex: http://validation.linaro.org/lava-server/scheduler/job/17089
I've also noticed the official builds aren't working either: http://validation.linaro.org/lava-server/scheduler/job/17092
I've personally booted my image, so I think we have a problem. Is anyone aware of this or looking into it already?
-andy
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
-- Best wishes, Spring Zhang
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
I found the problem. The init.rc in the ramdisk image (uInitrd) has changed. The lava-dispatcher patch the partition tables in the init.rc file. Now the partition tables have moved to the init.partitions.rc file and the init.rc import that file.
A fix on lava-dispatcher is ongoing.
/Chi Thu
On 31 March 2012 09:41, Le.chi Thu le.chi.thu@linaro.org wrote:
I am try to repoduce the problem on my local lava setup. Let you inform the progress soon..
/Chi Thu
On 31 March 2012 06:59, Spring Zhang spring.zhang@linaro.org wrote:
Hi, Andy,
Can you check the system.tar.bz2, from the log it seems deployment finished but report no /system/bin/sh found.
On 31 March 2012 02:21, Andy Doan andy.doan@linaro.org wrote:
Hey Guys,
I've noticed my own Panda/Android jobs were failing. ex: http://validation.linaro.org/lava-server/scheduler/job/17089
I've also noticed the official builds aren't working either: http://validation.linaro.org/lava-server/scheduler/job/17092
I've personally booted my image, so I think we have a problem. Is anyone aware of this or looking into it already?
-andy
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
-- Best wishes, Spring Zhang
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
Here is the fix. I need help to deploy it on production after approve.
https://code.launchpad.net/~le-chi-thu/lava-dispatcher/fix-android-boot-prob...
/Chi Thu
On 31 March 2012 10:08, Le.chi Thu le.chi.thu@linaro.org wrote:
I found the problem. The init.rc in the ramdisk image (uInitrd) has changed. The lava-dispatcher patch the partition tables in the init.rc file. Now the partition tables have moved to the init.partitions.rc file and the init.rc import that file.
A fix on lava-dispatcher is ongoing.
/Chi Thu
On 31 March 2012 09:41, Le.chi Thu le.chi.thu@linaro.org wrote:
I am try to repoduce the problem on my local lava setup. Let you inform the progress soon..
/Chi Thu
On 31 March 2012 06:59, Spring Zhang spring.zhang@linaro.org wrote:
Hi, Andy,
Can you check the system.tar.bz2, from the log it seems deployment finished but report no /system/bin/sh found.
On 31 March 2012 02:21, Andy Doan andy.doan@linaro.org wrote:
Hey Guys,
I've noticed my own Panda/Android jobs were failing. ex: http://validation.linaro.org/lava-server/scheduler/job/17089
I've also noticed the official builds aren't working either: http://validation.linaro.org/lava-server/scheduler/job/17092
I've personally booted my image, so I think we have a problem. Is anyone aware of this or looking into it already?
-andy
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
-- Best wishes, Spring Zhang
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
On Sat, 31 Mar 2012 11:50:22 +0200, "Le.chi Thu" le.chi.thu@linaro.org wrote:
Here is the fix. I need help to deploy it on production after approve.
https://code.launchpad.net/~le-chi-thu/lava-dispatcher/fix-android-boot-prob...
I've deployed this now (as well as my scheduler "kill long running jobs" fix).
Cheers, mwh
On Sat, Mar 31, 2012 at 10:08:27AM +0200, Le.chi Thu wrote:
I found the problem. The init.rc in the ramdisk image (uInitrd) has changed. The lava-dispatcher patch the partition tables in the init.rc file. Now the partition tables have moved to the init.partitions.rc file and the init.rc import that file.
We need to coordinate things better imo. AFAIK Zach and his team knew that he was changing this file, so if he would have known that this file change requires coordination with the LAVA team he probably would have done that.
May I propose to a list of the files we patch/touch and coordinate with Zach so he can remember to coordinate with LAVA team such a change?
In this way we can avoid LAVA team finding out about this through job bustage :).
Notification would have been nice, but it would be even better if we didn't have to modify this file at all. Long term, that will be possible through the use of this dual SD device, but I think there are good possibilities for a short term solution. We don't have to deal with this sort of thing on ubuntu because ubuntu allows us to use named partitions. I was really hoping that android wouldn't still be using hardcoded partitions after all this time. Is there really nothing that can be done there? I suspect its not just lava that this causes trouble with. The other possibility for a near term solution which I think we should do regardless of whether the hard coded partitions are fixed, is this idea I discussed recently about having a very slim initramfs based master image that can boot completely. Off of the shared boot / testboot partition. This would mean that we no longer have to make any changes to the mounts in android, and would simultaneously solve Zygmunt's issue with corruption on hard reboots.
Thanks, Paul Larson On Mar 31, 2012 6:43 AM, "Alexander Sack" asac@linaro.org wrote:
On Sat, Mar 31, 2012 at 10:08:27AM +0200, Le.chi Thu wrote:
I found the problem. The init.rc in the ramdisk image (uInitrd) has changed. The lava-dispatcher patch the partition tables in the init.rc file. Now the partition tables have moved to the init.partitions.rc file and the init.rc import that file.
We need to coordinate things better imo. AFAIK Zach and his team knew that he was changing this file, so if he would have known that this file change requires coordination with the LAVA team he probably would have done that.
May I propose to a list of the files we patch/touch and coordinate with Zach so he can remember to coordinate with LAVA team such a change?
In this way we can avoid LAVA team finding out about this through job bustage :).
-- Alexander Sack asac@linaro.org Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
On 3/31/12 6:43 AM, Alexander Sack wrote:
May I propose to a list of the files we patch/touch and coordinate with Zach so he can remember to coordinate with LAVA team such a change?
In this way we can avoid LAVA team finding out about this through job bustage :).
I'm a +1 for this and saw Paul's response from the technical standpoint. However, I think we also need to take a step back and ask why it took _me_ to identify this was happening.
I only uncovered this because I couldn't get my benchmark jobs to execute. The panda tracking build has been broken since build 229 (8 days ago) and all of the 12.03 builds failed to actually run in LAVA.
In other words, are we really submitting LAVA jobs and not caring about the results?
-andy
On Sat, Mar 31, 2012 at 8:51 PM, Andy Doan andy.doan@linaro.org wrote:
On 3/31/12 6:43 AM, Alexander Sack wrote:
May I propose to a list of the files we patch/touch and coordinate with Zach so he can remember to coordinate with LAVA team such a change?
In this way we can avoid LAVA team finding out about this through job bustage :).
I'm a +1 for this and saw Paul's response from the technical standpoint. However, I think we also need to take a step back and ask why it took _me_ to identify this was happening.
I only uncovered this because I couldn't get my benchmark jobs to execute. The panda tracking build has been broken since build 229 (8 days ago) and all of the 12.03 builds failed to actually run in LAVA.
In other words, are we really submitting LAVA jobs and not caring about the results?
That's how it would appear. My assumption has been that those jobs coming from the android ci system are looked at when the android team makes releases. One of the things I'm looking to do on the QA team is to start looking at manual and automated tests together, while starting to transition some of the manual tests to run in lava instead.
LAVA is quite capable of catching things like this, but someone has to care for the results and notice the breakage to make it useful.
Thanks, Paul Larson
On 31 March 2012 20:51, Andy Doan andy.doan@linaro.org wrote:
On 3/31/12 6:43 AM, Alexander Sack wrote:
May I propose to a list of the files we patch/touch and coordinate with Zach so he can remember to coordinate with LAVA team such a change?
In this way we can avoid LAVA team finding out about this through job bustage :).
I'm a +1 for this and saw Paul's response from the technical standpoint. However, I think we also need to take a step back and ask why it took _me_ to identify this was happening.
I only uncovered this because I couldn't get my benchmark jobs to execute. The panda tracking build has been broken since build 229 (8 days ago) and all of the 12.03 builds failed to actually run in LAVA.
In other words, are we really submitting LAVA jobs and not caring about the results?
Since LAVA:
1. Can't reliably boot all the builds in all configurations 2. Doesn't use linaro-android-media-create (which we tell users to use) 3. Doesn't use the right bootloaders
We've always hand tested our builds to ensure they work. Until LAVA:
1. Can program a build in the same manner we tell users to 2. Doesn't assume anything about the target, like it even booting
We have to keep hand testing.
-andy
On 04/01/2012 08:26 PM, Zach Pfeffer wrote:
In other words, are we really submitting LAVA jobs and not caring about the
results?
Since LAVA:
- Can't reliably boot all the builds in all configurations
- Doesn't use linaro-android-media-create (which we tell users to use)
- Doesn't use the right bootloaders
We've always hand tested our builds to ensure they work. Until LAVA:
- Can program a build in the same manner we tell users to
- Doesn't assume anything about the target, like it even booting
We have to keep hand testing.
I think even if LAVA were perfect, hand testing is still required. And I won't (in this thread) debate the limitations your bringing up.
In my case, LAVA has been working pretty reliably for Panda for about 4 months now (at least for my benchmark jobs). When I saw it broken, I pushed the issue and the team found a fix pretty quickly. So shouldn't we have someone paying attention to at least Panda builds and raise an issue when they trend from mostly working to completely broken?
On 1 April 2012 22:07, Andy Doan andy.doan@linaro.org wrote:
On 04/01/2012 08:26 PM, Zach Pfeffer wrote:
In other words, are we really submitting LAVA jobs and not caring about the
results?
Since LAVA:
- Can't reliably boot all the builds in all configurations
- Doesn't use linaro-android-media-create (which we tell users to use)
- Doesn't use the right bootloaders
We've always hand tested our builds to ensure they work. Until LAVA:
- Can program a build in the same manner we tell users to
- Doesn't assume anything about the target, like it even booting
We have to keep hand testing.
I think even if LAVA were perfect, hand testing is still required. And I won't (in this thread) debate the limitations your bringing up.
In my case, LAVA has been working pretty reliably for Panda for about 4 months now (at least for my benchmark jobs). When I saw it broken, I pushed the issue and the team found a fix pretty quickly. So shouldn't we have someone paying attention to at least Panda builds and raise an issue when they trend from mostly working to completely broken?
Yeah, Panda's been pretty good. I think monitoring the builds fits pretty squarely in the new QA groups area. Paul, perhaps you can add, Android LAVA health to your daily checklist.
On Sun, Apr 1, 2012 at 10:24 PM, Zach Pfeffer zach.pfeffer@linaro.orgwrote:
On 1 April 2012 22:07, Andy Doan andy.doan@linaro.org wrote:
On 04/01/2012 08:26 PM, Zach Pfeffer wrote:
In other words, are we really submitting LAVA jobs and not caring about the
results?
Since LAVA:
- Can't reliably boot all the builds in all configurations
- Doesn't use linaro-android-media-create (which we tell users to use)
- Doesn't use the right bootloaders
We've always hand tested our builds to ensure they work. Until LAVA:
- Can program a build in the same manner we tell users to
- Doesn't assume anything about the target, like it even booting
We have to keep hand testing.
I think even if LAVA were perfect, hand testing is still required. And I won't (in this thread) debate the limitations your bringing up.
In my case, LAVA has been working pretty reliably for Panda for about 4 months now (at least for my benchmark jobs). When I saw it broken, I
pushed
the issue and the team found a fix pretty quickly. So shouldn't we have someone paying attention to at least Panda builds and raise an issue when they trend from mostly working to completely broken?
Yeah, Panda's been pretty good. I think monitoring the builds fits pretty squarely in the new QA groups area. Paul, perhaps you can add, Android LAVA health to your daily checklist.
I think that's similar to what I suggested earlier in this thread when I said:
That's how it would appear. My assumption has been that those jobs coming from the android ci system are looked at when the android team makes releases. One of the things I'm looking to do on the QA team is to start looking at manual and automated tests together, while starting to transition some of the manual tests to run in lava instead.
However, I'm a bit surprised to learn that this isn't already part of the process. I was always under the impression that all this work we did for getting results from lava into your build pages in a way that could be displayed the way you wanted them was so that they *would* be looked at as part of the testing and release process, and that the goal of the work yongqin has been doing to push more and more automated tests into lava-android-test was to grow the automation and reduce the manual effort for testing builds. This really seems like the only sane option, considering there are 12 builds on android alone to test!
What you are suggesting now, sounds as if none of it was even worth our time until we get the hardware dongle. I don't think it has to be an all-or-nothing approach though. I think lava can provide quite a bit of usefulness in its current form, and even testing with the hardware dongle is likely to break from time to time.
As a place to get started, how about if we add a lava-0xbench, lava-busybox, lava-cts,... test tag in your spreadsheet where the results of each of those would get logged? At least that way we know if someone looked at it or not.
Thanks, Paul Larson
I do not think LAVA will and can replace all manual test efforts. One of the benefit of LAVA is providing regression tests together with continue integration. But if no one monitoring and react on output of regression tests, we lost the benefit completely.
One think we must make sure the the regression tests shall always pass and run successfully on good builds, which is not the case today. So when the tests fail, it is must easier for LAVA users to focus on investigate why the tests / LAVA failed.
BR
/Chi Thu
On 2 April 2012 05:07, Andy Doan andy.doan@linaro.org wrote:
On 04/01/2012 08:26 PM, Zach Pfeffer wrote:
In other words, are we really submitting LAVA jobs and not caring about the
results?
Since LAVA:
- Can't reliably boot all the builds in all configurations
- Doesn't use linaro-android-media-create (which we tell users to use)
- Doesn't use the right bootloaders
We've always hand tested our builds to ensure they work. Until LAVA:
- Can program a build in the same manner we tell users to
- Doesn't assume anything about the target, like it even booting
We have to keep hand testing.
I think even if LAVA were perfect, hand testing is still required. And I won't (in this thread) debate the limitations your bringing up.
In my case, LAVA has been working pretty reliably for Panda for about 4 months now (at least for my benchmark jobs). When I saw it broken, I pushed the issue and the team found a fix pretty quickly. So shouldn't we have someone paying attention to at least Panda builds and raise an issue when they trend from mostly working to completely broken?
On Sun, Apr 1, 2012 at 3:51 AM, Andy Doan andy.doan@linaro.org wrote:
On 3/31/12 6:43 AM, Alexander Sack wrote:
May I propose to a list of the files we patch/touch and coordinate with Zach so he can remember to coordinate with LAVA team such a change?
In this way we can avoid LAVA team finding out about this through job bustage :).
I'm a +1 for this and saw Paul's response from the technical standpoint. However, I think we also need to take a step back and ask why it took _me_ to identify this was happening.
I only uncovered this because I couldn't get my benchmark jobs to execute. The panda tracking build has been broken since build 229 (8 days ago) and all of the 12.03 builds failed to actually run in LAVA.
In other words, are we really submitting LAVA jobs and not caring about the results?
Thats the case, yes. The point is that the general flakyness makes it hard to establish folks being hard about getting no green on LAVA...
so chicken, egg.
On 31 March 2012 06:43, Alexander Sack asac@linaro.org wrote:
On Sat, Mar 31, 2012 at 10:08:27AM +0200, Le.chi Thu wrote:
I found the problem. The init.rc in the ramdisk image (uInitrd) has changed. The lava-dispatcher patch the partition tables in the init.rc file. Now the partition tables have moved to the init.partitions.rc file and the init.rc import that file.
We need to coordinate things better imo. AFAIK Zach and his team knew that he was changing this file, so if he would have known that this file change requires coordination with the LAVA team he probably would have done that.
I would have given people a heads up, but I never would have expected LAVA to have a dependency on an init script.
May I propose to a list of the files we patch/touch and coordinate with Zach so he can remember to coordinate with LAVA team such a change?
In this way we can avoid LAVA team finding out about this through job bustage :).
We can coordinate, but the fundamental issue is that LAVA should have these dependencies in the first place, it should treat the unit as a blackbox and be able to boot anything, in any configuration.
-- Alexander Sack asac@linaro.org Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog
On Mon, Apr 2, 2012 at 3:21 AM, Zach Pfeffer zach.pfeffer@linaro.org wrote:
On 31 March 2012 06:43, Alexander Sack asac@linaro.org wrote:
On Sat, Mar 31, 2012 at 10:08:27AM +0200, Le.chi Thu wrote:
I found the problem. The init.rc in the ramdisk image (uInitrd) has changed. The lava-dispatcher patch the partition tables in the init.rc file. Now the partition tables have moved to the init.partitions.rc file and the init.rc import that file.
We need to coordinate things better imo. AFAIK Zach and his team knew that he was changing this file, so if he would have known that this file change requires coordination with the LAVA team he probably would have done that.
I would have given people a heads up, but I never would have expected LAVA to have a dependency on an init script.
Right. That's what I basically said, yes. If you were super brilliant you would have remembered that we patch the partitions etc. ... but as I said above, we need to establish a better way to track which files are currently tightly coupled to the LAVA setup so we have a chance to establish a process around that ...
On top, every build after a change landing should succeed in LAVA, especially if the previous one did ... we need to establish monitoring the results you see and following up deeper in your teams process/mindset.
AFAIK you can resubmit builds manually, so establishing a policy that stops everything if booting a tip build fail and investigating that first, could probably be done even in the current situation.
On 2 April 2012 06:26, Alexander Sack asac@linaro.org wrote:
On Mon, Apr 2, 2012 at 3:21 AM, Zach Pfeffer zach.pfeffer@linaro.org wrote:
On 31 March 2012 06:43, Alexander Sack asac@linaro.org wrote:
On Sat, Mar 31, 2012 at 10:08:27AM +0200, Le.chi Thu wrote:
I found the problem. The init.rc in the ramdisk image (uInitrd) has changed. The lava-dispatcher patch the partition tables in the init.rc file. Now the partition tables have moved to the init.partitions.rc file and the init.rc import that file.
We need to coordinate things better imo. AFAIK Zach and his team knew that he was changing this file, so if he would have known that this file change requires coordination with the LAVA team he probably would have done that.
I would have given people a heads up, but I never would have expected LAVA to have a dependency on an init script.
Right. That's what I basically said, yes. If you were super brilliant you would have remembered that we patch the partitions etc. ... but as I said above, we need to establish a better way to track which files are currently tightly coupled to the LAVA setup so we have a chance to establish a process around that ...
On top, every build after a change landing should succeed in LAVA, especially if the previous one did ... we need to establish monitoring the results you see and following up deeper in your teams process/mindset.
AFAIK you can resubmit builds manually, so establishing a policy that stops everything if booting a tip build fail and investigating that first, could probably be done even in the current situation.
Well we do this today - stop everything and fix things. We just do it based on our manual test results since LAVA results have not been reliable. LAVA is after all only a tool and if it doesn't work we can't let it get in the way of good process, which is stop everything and fix stuff when its broken. LAVA has come a long way, but we still have a ways to go before we can trust the results.
All that said, we're all in the same platform boat so we can fix LAVA to the point we can trust it and then we can rock even harder - and then we can re-enable our premerge loop and not break the build. Woot.
-- Alexander Sack Technical Director, Linaro Platform Teams http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog
linaro-validation@lists.linaro.org