Hello,
On Fri, 5 Apr 2013 17:45:07 +0300 Paul Sokolovsky Paul.Sokolovsky@linaro.org wrote:
[]
I know that Matt tried to do bunch of builds and they were failing by one reason or another, the best thing I can do besides re-enabling daily builds is to launch build of a previous known-good revision on couple of boards in parallel (to insure against possible hw issues of a particular board). Hope that will re-establish baseline, so let's check the builds on Monday.
Ok, so here're these 2 builds:
gcc-4.8~svn196132
panda-es02 https://validation.linaro.org/lava-server/scheduler/job/50993
panda-es05 https://validation.linaro.org/lava-server/scheduler/job/50994
#50993 went well.
#50994 midway in compilation started to get invalid data (grep for "is not valid in preprocessor expressions"), then got kernel fault, then got caught in reboot loop, apparently due to:
[ 6.631256] thermal_init_thermal_state: Getting initial temp for cpu domain [ 6.638702] thermal_request_temp [ 6.642150] omap_fatal_zone:FATAL ZONE (hot spot temp: 128490)
- all these behaviors were seen by me before (actually, previously, I didn't see such explicit messages from kernel that it's a thermal faults).
So, CBuild/LAVA can do (successful) builds, but some builds fail due to thermal issues. Actually, let me load up all boards with the same build now, towards assessing thermal failure rate more scientifically.
I'm not sure what's wrong with Matt's Fri builds, none of which succeeded, will look into that on Mon.
Hello,
On Sat, 6 Apr 2013 15:40:15 +0300 Paul Sokolovsky Paul.Sokolovsky@linaro.org wrote:
[]
Ok, so here're these 2 builds:
gcc-4.8~svn196132
panda-es02 https://validation.linaro.org/lava-server/scheduler/job/50993
panda-es05 https://validation.linaro.org/lava-server/scheduler/job/50994
#50993 went well.
#50994 midway in compilation started to get invalid data (grep for "is not valid in preprocessor expressions"), then got kernel fault, then got caught in reboot loop, apparently due to:
[ 6.631256] thermal_init_thermal_state: Getting initial temp for cpu domain [ 6.638702] thermal_request_temp [ 6.642150] omap_fatal_zone:FATAL ZONE (hot spot temp: 128490)
- all these behaviors were seen by me before (actually, previously, I
didn't see such explicit messages from kernel that it's a thermal faults).
So, CBuild/LAVA can do (successful) builds, but some builds fail due to thermal issues. Actually, let me load up all boards with the same build now, towards assessing thermal failure rate more scientifically.
Well, let's count:
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... OK
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Failed, thermal (see above)
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... "../../../gcc-4.8~svn196132/gcc/gengtype.c:4106:39: error: cannot convert 'flisT*' to 'flist*' in assignment" flipped bit in file (so that before, including on local Panda), then other random failures.
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... "rsync: getaddrinfo: toolchain64 2000: Temporary failure in name resolution" - network flip
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... OK
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Lot of "malloc: ../bash/jobs.c:743: assertion botched"
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Reboot during configure, then reboot cycle.
So, out of 7 builds, only 2 were successful, the rest fail due to thermal issues (apparently, all but the one with network issues). We can try to consider why LAVA-based builds have such low yield rate comparing to native Cbuild builds (one explanation is that LAVA does pretty heavy lifting to install OS, etc., so when build starts, CPU is already pretty hot), but it's clear that using kernel with voltage/frequency scaling disabled simply doesn't work that well for *builds* (vs benchmarks).
So, we can consider preparing OS image which uses the same basic OS, but normal kernel, and use that for builds. Would TCWG be able to prepare such image? If not, we can add it to the list of tasks for (now combined) LAVA/Infra team, but with all the other tasks we have, it may take some time to get to.
Paul,
you suggest to use different configurations for doing the "build the kernel" validation and the benchmarking? Why wouldn't we run in the same heat issues while benchmarking?
Can we somehow put the SoC into a very low power state for 2 minutes for cool down and simply wait before starting the build/test?
Anyway, I assume that the strict environment requirement where toolchain WG needs to sign off on the setup would mostly apply to benchmarking only and that we could probably choose any stable image for doing the build validation of the toolchain that is rock solid. Matt?
If so, it sounds sensible to just pick a recent release with thermal enabled for the build job and use the special configuration for the benchmarking parts - maybe with a cooling step as above.
On Mon, Apr 8, 2013 at 7:56 PM, Paul Sokolovsky paul.sokolovsky@linaro.orgwrote:
Hello,
On Sat, 6 Apr 2013 15:40:15 +0300 Paul Sokolovsky Paul.Sokolovsky@linaro.org wrote:
[]
Ok, so here're these 2 builds:
gcc-4.8~svn196132
panda-es02 https://validation.linaro.org/lava-server/scheduler/job/50993
panda-es05 https://validation.linaro.org/lava-server/scheduler/job/50994
#50993 went well.
#50994 midway in compilation started to get invalid data (grep for "is not valid in preprocessor expressions"), then got kernel fault, then got caught in reboot loop, apparently due to:
[ 6.631256] thermal_init_thermal_state: Getting initial temp for cpu domain [ 6.638702] thermal_request_temp [ 6.642150] omap_fatal_zone:FATAL ZONE (hot spot temp: 128490)
- all these behaviors were seen by me before (actually, previously, I
didn't see such explicit messages from kernel that it's a thermal faults).
So, CBuild/LAVA can do (successful) builds, but some builds fail due to thermal issues. Actually, let me load up all boards with the same build now, towards assessing thermal failure rate more scientifically.
Well, let's count:
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... OK
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Failed, thermal (see above)
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... "../../../gcc-4.8~svn196132/gcc/gengtype.c:4106:39: error: cannot convert 'flisT*' to 'flist*' in assignment" flipped bit in file (so that before, including on local Panda), then other random failures.
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... "rsync: getaddrinfo: toolchain64 2000: Temporary failure in name resolution" - network flip
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... OK
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Lot of "malloc: ../bash/jobs.c:743: assertion botched"
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Reboot during configure, then reboot cycle.
So, out of 7 builds, only 2 were successful, the rest fail due to thermal issues (apparently, all but the one with network issues). We can try to consider why LAVA-based builds have such low yield rate comparing to native Cbuild builds (one explanation is that LAVA does pretty heavy lifting to install OS, etc., so when build starts, CPU is already pretty hot), but it's clear that using kernel with voltage/frequency scaling disabled simply doesn't work that well for *builds* (vs benchmarks).
So, we can consider preparing OS image which uses the same basic OS, but normal kernel, and use that for builds. Would TCWG be able to prepare such image? If not, we can add it to the list of tasks for (now combined) LAVA/Infra team, but with all the other tasks we have, it may take some time to get to.
-- Best Regards, Paul
Linaro.org | Open source software for ARM SoCs Follow Linaro: http://www.facebook.com/pages/Linaro http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog
On Tue, 9 Apr 2013 01:20:02 +0200 Alexander Sack asac@linaro.org wrote:
Paul,
you suggest to use different configurations for doing the "build the kernel"
I'm sure it's just typo, but "build gcc/binutils" to be exact.
validation and the benchmarking? Why wouldn't we run in the same heat issues while benchmarking?
Well, I'm just trying to brainstorm how we can make CBuild/LAVA actually useful, not something which takes 10hrs just to finish with failure in >50% of cases. So, just common sense: benchmarking will still be affected, but at least builds won't be. Also, a gcc build takes ~10hrs, so there's much more chance for it to go into thermal issues than for a benchmark (don't remember exactly, but should be just 1-2hrs).
Can we somehow put the SoC into a very low power state for 2 minutes for cool down and simply wait before starting the build/test?
That's something I have no idea about (well, I'm sure we can, I'm not sure next 2 minutes won't heat it back).
Anyway, I assume that the strict environment requirement where toolchain WG needs to sign off on the setup would mostly apply to benchmarking only and that we could probably choose any stable image for doing the build validation of the toolchain that is rock solid. Matt?
It's trickier than that. There always been a known stable image available as an alternative during testing, but gcc build on it, later cannot be run on official TCWG image due to missing lib dependencies. I.e., that should be not just any image, but pretty close to TCWG's. Also, current TCWG image was received in ready form from Michael Hope, and I'm personally not sure what's inside, so some effort would be needed to be spent on figuring that out.
If so, it sounds sensible to just pick a recent release with thermal enabled for the build job and use the special configuration for the benchmarking parts - maybe with a cooling step as above.
On Mon, Apr 8, 2013 at 7:56 PM, Paul Sokolovsky paul.sokolovsky@linaro.orgwrote:
Hello,
On Sat, 6 Apr 2013 15:40:15 +0300 Paul Sokolovsky Paul.Sokolovsky@linaro.org wrote:
[]
Ok, so here're these 2 builds:
gcc-4.8~svn196132
panda-es02 https://validation.linaro.org/lava-server/scheduler/job/50993
panda-es05 https://validation.linaro.org/lava-server/scheduler/job/50994
#50993 went well.
#50994 midway in compilation started to get invalid data (grep for "is not valid in preprocessor expressions"), then got kernel fault, then got caught in reboot loop, apparently due to:
[ 6.631256] thermal_init_thermal_state: Getting initial temp for cpu domain [ 6.638702] thermal_request_temp [ 6.642150] omap_fatal_zone:FATAL ZONE (hot spot temp: 128490)
- all these behaviors were seen by me before (actually,
previously, I didn't see such explicit messages from kernel that it's a thermal faults).
So, CBuild/LAVA can do (successful) builds, but some builds fail due to thermal issues. Actually, let me load up all boards with the same build now, towards assessing thermal failure rate more scientifically.
Well, let's count:
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... OK
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Failed, thermal (see above)
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... "../../../gcc-4.8~svn196132/gcc/gengtype.c:4106:39: error: cannot convert 'flisT*' to 'flist*' in assignment" flipped bit in file (so that before, including on local Panda), then other random failures.
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... "rsync: getaddrinfo: toolchain64 2000: Temporary failure in name resolution" - network flip
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... OK
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Lot of "malloc: ../bash/jobs.c:743: assertion botched"
https://validation.linaro.org/lava-server/dashboard/streams/anonymous/cbuild... Reboot during configure, then reboot cycle.
So, out of 7 builds, only 2 were successful, the rest fail due to thermal issues (apparently, all but the one with network issues). We can try to consider why LAVA-based builds have such low yield rate comparing to native Cbuild builds (one explanation is that LAVA does pretty heavy lifting to install OS, etc., so when build starts, CPU is already pretty hot), but it's clear that using kernel with voltage/frequency scaling disabled simply doesn't work that well for *builds* (vs benchmarks).
So, we can consider preparing OS image which uses the same basic OS, but normal kernel, and use that for builds. Would TCWG be able to prepare such image? If not, we can add it to the list of tasks for (now combined) LAVA/Infra team, but with all the other tasks we have, it may take some time to get to.
-- Best Regards, Paul
Linaro.org | Open source software for ARM SoCs Follow Linaro: http://www.facebook.com/pages/Linaro http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog
On 9 April 2013 13:18, Paul Sokolovsky paul.sokolovsky@linaro.org wrote:
Well, I'm just trying to brainstorm how we can make CBuild/LAVA actually useful, not something which takes 10hrs just to finish with failure in >50% of cases. So, just common sense: benchmarking will still be affected, but at least builds won't be. Also, a gcc build takes ~10hrs, so there's much more chance for it to go into thermal issues than for a benchmark (don't remember exactly, but should be just 1-2hrs).
It sounds like something that should be solved in system software - do you happen to know which cpufreq governor is being used? I know the Panda ES seemed to be having problems with the ondemand governor. We could try switching to the userspace cpufreq governor and underclocking a bit (not very scientific) to see if that helps:
http://processors.wiki.ti.com/index.php/OMAP-L1_Linux_Drivers_Usage#Power_Ma...
I guess we wouldn't want to do any frequency scaling for benchmarks but that doesn't seem to be an issue.
On 09/04/13 00:20, Alexander Sack wrote:
Paul,
you suggest to use different configurations for doing the "build the kernel" validation and the benchmarking? Why wouldn't we run in the same heat issues while benchmarking?
Can we somehow put the SoC into a very low power state for 2 minutes for cool down and simply wait before starting the build/test?
Or we could put a heatsink/active cooling on the PandaES boards?
As Paul says elsewhere we hammer the boards continuously for a number of hours, being cool before we start isn't going to help much.
Anyway, I assume that the strict environment requirement where toolchain WG needs to sign off on the setup would mostly apply to benchmarking only and that we could probably choose any stable image for doing the build validation of the toolchain that is rock solid. Matt?
Yes.
My requirements for the kernel/filesystem image used for building are that: a) It works b) It doesn't change regularly :-).
For benchmarking we need to use a known good (and special) kernel and filesystem which changes irregularly.
Thanks,
Matt
On Tue, Apr 9, 2013 at 2:47 PM, Matthew Gretton-Dann < matthew.gretton-dann@linaro.org> wrote:
On 09/04/13 00:20, Alexander Sack wrote:
Paul,
you suggest to use different configurations for doing the "build the kernel" validation and the benchmarking? Why wouldn't we run in the same heat issues while benchmarking?
Can we somehow put the SoC into a very low power state for 2 minutes for cool down and simply wait before starting the build/test?
Or we could put a heatsink/active cooling on the PandaES boards?
As Paul says elsewhere we hammer the boards continuously for a number of hours, being cool before we start isn't going to help much.
Anyway, I assume that the strict environment requirement where toolchain
WG needs to sign off on the setup would mostly apply to benchmarking only and that we could probably choose any stable image for doing the build validation of the toolchain that is rock solid. Matt?
Yes.
My requirements for the kernel/filesystem image used for building are that: a) It works b) It doesn't change regularly :-).
I believe point b) is talking solutions for the requirement of minimal noise caused by false negatives. With that, I believe it is like I said. We can use whatever we feel is stable as long as we protect you from noise when we decide to change stuff...
For benchmarking we need to use a known good (and special) kernel and filesystem which changes irregularly.
Thanks,
Matt
-- Matthew Gretton-Dann Toolchain Working Group, Linaro
I'm running these builds on my Pandas in my local lab. I can confirm it is pushing the OMAP4 to it's thermal limit on both my boards. With DVFS enabled (i.e. Not the performance gov) it should scale back the voltage and frequency to deal with the heat. However since there is a stability issue with DVFS, I am going to manually lower the clock speed step to 800mhz - 1000mhz with the performance gov and see if that deals with some of the heat.
On 9 April 2013 06:52, Alexander Sack asac@linaro.org wrote:
On Tue, Apr 9, 2013 at 2:47 PM, Matthew Gretton-Dann < matthew.gretton-dann@linaro.org> wrote:
On 09/04/13 00:20, Alexander Sack wrote:
Paul,
you suggest to use different configurations for doing the "build the kernel" validation and the benchmarking? Why wouldn't we run in the same heat issues while benchmarking?
Can we somehow put the SoC into a very low power state for 2 minutes for cool down and simply wait before starting the build/test?
Or we could put a heatsink/active cooling on the PandaES boards?
As Paul says elsewhere we hammer the boards continuously for a number of hours, being cool before we start isn't going to help much.
Anyway, I assume that the strict environment requirement where toolchain
WG needs to sign off on the setup would mostly apply to benchmarking only and that we could probably choose any stable image for doing the build validation of the toolchain that is rock solid. Matt?
Yes.
My requirements for the kernel/filesystem image used for building are that: a) It works b) It doesn't change regularly :-).
I believe point b) is talking solutions for the requirement of minimal noise caused by false negatives. With that, I believe it is like I said. We can use whatever we feel is stable as long as we protect you from noise when we decide to change stuff...
For benchmarking we need to use a known good (and special) kernel and filesystem which changes irregularly.
Thanks,
Matt
-- Matthew Gretton-Dann Toolchain Working Group, Linaro
-- Alexander Sack Director, Linaro Platform Engineering http://www.linaro.org | Open source software for ARM SoCs http://twitter.com/#%21/linaroorg - http://www.linaro.org/linaro-blog
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
linaro-validation@lists.linaro.org