Overheating Pandas

List overview All Threads
Download

newer

older

Re: [Linaro-validation] LAVA...

WiFi access points in LAVA Lab

Renato Golin

3 Jul 2013 3 Jul '13

1:13 p.m.

Hi Folks,

I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.

The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.

I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.

It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.

Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.

We, at the toolchain group, make things worse by using 100% CPU, 24 / 7, something that Panda boards, or Arndales were not designed to do. However, with ARM moving into the server space, their designs will have to be re-thought, and what a better place than Linaro for making sure we get it right?

For the time being, I believe we *must* have air conditioning in the Lab all the time, and we *must* have heat-sinks on every board, and we *must* monitor the CPU temperature of the boards, at least until we're comfortable that they're not failing all the time.

Can we make a temperature monitor (like the one attached) a default feature on Linaro Ubuntu distributions? We could dump that info to the syslog/dmesg whenever it crosses the (say) 75% threshold, and report more often when it crosses the 95%, possibly dumping the processe(s) that are consuming more CPU at the time, to enable post-mortem debugging.

cheers, --renato

As a side note, the quad-A9 ODroid does ship with a massive heat-sink, which also serves as a fancy case. Quite clever, really.

Attachments:

attachment.html (text/html — 2.7 KB)
monitor.bin (application/octet-stream — 325 bytes)

Show replies by date

James Tunnicliffe

3 Jul 3 Jul

2:42 p.m.

I believe that in the LAVA lab there are a few pandas with USB keys that are used for builds to try and overcome some reliability problems. Don't know if it was a temperature problem or something else. With any luck someone who knows more about that issue can speak up and share what they found. You could also try running "stress --cpu 4 --vm 2" and see if any errors show. I find that on my desktop running 2x the number of CPU stress threads as I have CPUs is about right to eat all available resources. That will just stress RAM and CPU, not disk I/O, which should pinpoint the problem. Plenty of other options (http://www.hecticgeek.com/2012/11/stress-test-your-ubuntu-computer-with-stre...)...

Is running at 100% of the thermal limit really an issue? Isn't the point that it is the limit, which itself should have some safety built in? I don't know off hand if the OMAP 4 SoCs incorporate hardware frequency limiting or if it is entirely software, in which case the kernel frequency governor should (at a guess) be throttling back.

I did have a panda give up on me about a year ago. It wasn't being worked hard, but did refuse to get through a boot most of the time (it did power on and get part way through booting). Those boards aren't designed for high reliability and it may be that you just need to get a couple of replacements.

James

On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:

...

Hi Folks,

I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.

The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.

I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.

It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.

Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.

We, at the toolchain group, make things worse by using 100% CPU, 24 / 7, something that Panda boards, or Arndales were not designed to do. However, with ARM moving into the server space, their designs will have to be re-thought, and what a better place than Linaro for making sure we get it right?

For the time being, I believe we *must* have air conditioning in the Lab all the time, and we *must* have heat-sinks on every board, and we *must* monitor the CPU temperature of the boards, at least until we're comfortable that they're not failing all the time.

Can we make a temperature monitor (like the one attached) a default feature on Linaro Ubuntu distributions? We could dump that info to the syslog/dmesg whenever it crosses the (say) 75% threshold, and report more often when it crosses the 95%, possibly dumping the processe(s) that are consuming more CPU at the time, to enable post-mortem debugging.

cheers, --renato

As a side note, the quad-A9 ODroid does ship with a massive heat-sink, which also serves as a fancy case. Quite clever, really.

linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation

-- James Tunnicliffe

Renato Golin

3:32 p.m.

On 3 July 2013 15:42, James Tunnicliffe james.tunnicliffe@linaro.orgwrote:

...

I believe that in the LAVA lab there are a few pandas with USB keys that are used for builds to try and overcome some reliability problems.

I'm using USB drives for that reason.

Is running at 100% of the thermal limit really an issue? Isn't the

...

point that it is the limit, which itself should have some safety built in? I don't know off hand if the OMAP 4 SoCs incorporate hardware frequency limiting or if it is entirely software, in which case the kernel frequency governor should (at a guess) be throttling back.

That's what I thought, but apparently, both Panda and Panda ES on current Linaro Ubuntu 13.03 fail randomly with USB drives (SSD or HDD) after a few hours under constant load. That means it's impossible for me to use them for toolchain testing at all. Arndales have also given up after a few hours, though after the errata kernel patches it was a bit better.

The only board that hasn't failed yet is the Chromebook, which has clocked a solid 5-month period under intense load. Guess what? The Chromebook's A15, which is identical to the Arndale's, has a massive heat-sink almost the size of the laptop itself.

I did have a panda give up on me about a year ago. It wasn't being

...

worked hard, but did refuse to get through a boot most of the time (it did power on and get part way through booting). Those boards aren't designed for high reliability and it may be that you just need to get a couple of replacements.

I have tried 5 different Pandas and all of them fail the same way. I don't think it's a matter of replacing the defective, but of trying a new board altogether...

cheers, --renato

Mans Rullgard

2:59 p.m.

On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:

...

Hi Folks,

I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.

The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.

I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.

It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.

Modern silicon processes are much more power-efficient than those of the 90s. For example, an old ~500MHz Alpha machine I have readily consumes 90W even when idle. A quad-core Intel i7 typically has a TDP of 130W at full load. That's orders of magnitude more gates clocked at 6x the frequency and still using only marginally more power.

BTW, the RAM is a separate chip mounted on top of the SoC.

...

Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.

An OMAP4460 will run at 1.2GHz indefinitely without overheating in reasonable ambient temperature. The higher frequencies are only meant to be used in conjunction with (software) thermal management to throttle back if temperature rises.

If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.

-- Mans Rullgard / mru

Renato Golin

3:48 p.m.

On 3 July 2013 15:59, Mans Rullgard mans.rullgard@linaro.org wrote:

...

Modern silicon processes are much more power-efficient than those of the 90s. For example, an old ~500MHz Alpha machine I have readily consumes 90W even when idle. A quad-core Intel i7 typically has a TDP of 130W at full load. That's orders of magnitude more gates clocked at 6x the frequency and still using only marginally more power.

I don't remember the numbers exactly, but the DX Intel machines weren't that power-hungry. Here[1] I read they used 600mA on a 5V input, which gives you 3W consumption, and it already had a heat-sink. ;)

An OMAP4460 will run at 1.2GHz indefinitely without overheating in

...

reasonable ambient temperature.

Probably in Sweden, "room temperature" is -10... ;)

But running at 1.2GHz doesn't mean it will be using the whole system, RAM and GPU included, which being on the same SoC, contribute to the overall temperature. I've seen some GPU errors on the syslog, not sure it's related to the failures, or caused by them.

...

If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.

I'd expect that Linaro's kernel on Ubuntu 13.03 already had a decent thermal control of the Panda. I can get the temperatures without special code, so I assume the kernel knows precisely what to do, and I also hope that the kernel can do scheduling, otherwise, what's the point of measuring temperatures...

But more to the point, I don't want to be scaled down when hot, I want it never to get hot in the first place, so I can run at full 1.2GHz, 24 / 7. If the scheduler reduces the frequency to decrease the temperature, I'll be testing more commits per run AND my benchmarks will be skewed, depending on room temperature, which is the same as to say they're not benchmarks at all.

cheers, --renato

[1] http://www.datasheetarchive.com/486+DX-datasheet.html

Mans Rullgard

4:22 p.m.

On 3 July 2013 16:48, Renato Golin renato.golin@linaro.org wrote:

...

On 3 July 2013 15:59, Mans Rullgard mans.rullgard@linaro.org wrote:

...
An OMAP4460 will run at 1.2GHz indefinitely without overheating in reasonable ambient temperature.

If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.

I'd expect that Linaro's kernel on Ubuntu 13.03 already had a decent thermal control of the Panda. I can get the temperatures without special code, so I assume the kernel knows precisely what to do, and I also hope that the kernel can do scheduling, otherwise, what's the point of measuring temperatures...

But more to the point, I don't want to be scaled down when hot, I want it never to get hot in the first place, so I can run at full 1.2GHz, 24 / 7. If the scheduler reduces the frequency to decrease the temperature, I'll be testing more commits per run AND my benchmarks will be skewed, depending on room temperature, which is the same as to say they're not benchmarks at all.

I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management. 1.4GHz and higher _does_ require active thermal management, and I would not assume that a random kernel has this feature enabled merely because it can report the temperature.

If you want to run benchmarks on this chip, you must do so at no higher than 1.2GHz. The chip is designed for phones/tablets where high CPU load typically only occurs in short bursts.

-- Mans Rullgard / mru

Renato Golin

4:41 p.m.

On 3 July 2013 17:22, Mans Rullgard mans.rullgard@linaro.org wrote:

...

I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management.

My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.

linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000

Now what?

cheers, --renato

James Tunnicliffe

4:50 p.m.

On 3 July 2013 17:41, Renato Golin renato.golin@linaro.org wrote:

...

On 3 July 2013 17:22, Mans Rullgard mans.rullgard@linaro.org wrote:

...
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management.

My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.

linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000

Now what?

Are you using the same set up as the LAVA lab in terms of OS, kernel, software versions? If the Cbuild/LAVA boards run reliably (and I don't think they have any direct cooling or a heatsink on them), then that is a useful place to start.

-- James Tunnicliffe

Mans Rullgard

5:08 p.m.

On 3 July 2013 17:41, Renato Golin renato.golin@linaro.org wrote:

...

On 3 July 2013 17:22, Mans Rullgard mans.rullgard@linaro.org wrote:

...
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management.

My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.

linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000

Now what?

4430 max frequency is 1.0GHz unless I'm mistaken. Either way, try reducing your clock to 1.0GHz and see what happens.

-- Mans Rullgard / mru

Renato Golin

7:37 p.m.

On 3 July 2013 18:08, Mans Rullgard mans.rullgard@linaro.org wrote:

...

4430 max frequency is 1.0GHz unless I'm mistaken. Either way, try reducing your clock to 1.0GHz and see what happens.

Yes, I meant 4430 and 4460 at their natural high frequencies.

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 350000 700000 920000 1200000

I'll set the max to 920MHz on the scaling and let's see how it goes...

cheers, --renato

Richard Earnshaw

5:33 p.m.

On 03/07/13 17:41, Renato Golin wrote:

...

On 3 July 2013 17:22, Mans Rullgard <mans.rullgard@linaro.org mailto:mans.rullgard@linaro.org> wrote:
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal
management.
My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.

linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000

Now what?

keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.

Remember that manufacturers match the form of packaging to the expected TDP of the intended usage environment (to keep product costs down). In a mobile part that probably means relatively cheap plastic package because a hot chip would burn a hole in your pocket -- literally. The package almost certainly doesn't have a high thermal conductivity from the chip to the external surface so while a heat sink might help, it won't be as effective as with other packaging options.

Chips expected to dissipate large amounts of power normally have a metal pad on the package so that a heat sink with thermal grease will make a good thermal contact.

Renato Golin

7:45 p.m.

On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:

...

Chips expected to dissipate large amounts of power normally have a metal pad on the package so that a heat sink with thermal grease will make a good thermal contact.

This is a really good point. The heat-sink do get really hot, but it's not the final temperature that matters, but the speed in which it dissipates through the plastic bit to the heat-sink during peak usage, and plastic sucks at thermal conductivity.

Let's see how it behaves at 920MHz...

I wonder if the ODroid heat-sink, which is bigger than the board itself, is really that effective, or just more of a vanity item. The Arndale has a metallic case, and I could fit a north-bridge heat-sink on it, which is bigger than the RAM heat-sink I put on the Pandas, and after the errata fix, they did behave properly at full speed.

cheers, --renato

Mans Rullgard

9:20 p.m.

On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:

...

On 03/07/13 17:41, Renato Golin wrote:

...
On 3 July 2013 17:22, Mans Rullgard <mans.rullgard@linaro.org mailto:mans.rullgard@linaro.org> wrote:
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal
management.
My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.

linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000

Now what?
keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.

Remember that manufacturers match the form of packaging to the expected TDP of the intended usage environment (to keep product costs down). In a mobile part that probably means relatively cheap plastic package because a hot chip would burn a hole in your pocket -- literally. The package almost certainly doesn't have a high thermal conductivity from the chip to the external surface so while a heat sink might help, it won't be as effective as with other packaging options.

Chips expected to dissipate large amounts of power normally have a metal pad on the package so that a heat sink with thermal grease will make a good thermal contact.

The PoP RAM also complicates cooling.

-- Mans Rullgard / mru

Renato Golin

4 Jul 4 Jul

11:27 a.m.

On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:

...

keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.

It might be a bit too soon, but I just got a few 7h builds out of the boards at 920MHz without a single glitch, whereas before, they wouldn't run for more than 4hs in a row. Both boards are running non-stop since 8pm yesterday.

I'll keep them running during Connect on the exact same place as they are now (and were before), just to be sure, but I'm still betting that they cannot run at 1.2GHz on full steam for long periods without some serious cooling.

cheers, --renato

James Tunnicliffe

11:58 a.m.

On 4 July 2013 12:27, Renato Golin renato.golin@linaro.org wrote:

...

On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:

...
keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.

It might be a bit too soon, but I just got a few 7h builds out of the boards at 920MHz without a single glitch, whereas before, they wouldn't run for more than 4hs in a row. Both boards are running non-stop since 8pm yesterday.

I'll keep them running during Connect on the exact same place as they are now (and were before), just to be sure, but I'm still betting that they cannot run at 1.2GHz on full steam for long periods without some serious cooling.

Faster clocks also drink more power - now you are using slower clocks those boards will be stressing the PSU less. There are plenty of other components on the board, any of which could be causing the problem, including the peripherals that you plugged in.

If you don't have another 5V PSU to try but do have a spare ATX PSU then it isn't difficult to hook up the 0 and 5V rails from a molex connector. May be worth a go. You could easily run all the boards from 1 ATX PSU. (short pin 16 (green) to a black pin to turn the PSU on http://en.wikipedia.org/wiki/ATX#Power_supply).

James

Renato Golin

12:12 p.m.

On 4 July 2013 12:58, James Tunnicliffe james.tunnicliffe@linaro.orgwrote:

...

Faster clocks also drink more power - now you are using slower clocks those boards will be stressing the PSU less.

That is true. Though, if memory serves me well, I think I was using one decent power supply and one cheap in the lab, and both Pandas were failing randomly.

Matt, if you could have a look at my rack shelf (it's the top one with the chromebook in it), there should be a power supply with a velcro on it. I only used the cheap one on my second Panda, because that was the one I was using to set it up on my desk.

Furthermore, mine were not the only Pandas failing when running toolchain testing and benchmarking...

There are plenty of other

...

components on the board, any of which could be causing the problem, including the peripherals that you plugged in.

Just network and a USB thumb drive. Shouldn't be problematic.

I don't have a PSU at home nor a decent power supply here, so I can't perform this test, but I'm not really sure it will make a difference based on what happened in the lab rack.

--renato

Renato Golin

5 Jul 5 Jul

7:56 a.m.

On 4 July 2013 12:27, Renato Golin renato.golin@linaro.org wrote:

...

On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:

...
keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.

It might be a bit too soon, but I just got a few 7h builds out of the boards at 920MHz without a single glitch, whereas before, they wouldn't run for more than 4hs in a row. Both boards are running non-stop since 8pm yesterday.

Yesterday I turned one of the boards back to 1.2GHz (3pm), and it died during the night (2am). The 920MHz is still working. The room temperature didn't go over 26C (the thermometer is by the boards).

I do not believe it is possible to run the 4460 at 1.2GHz on full load without decent thermal management. I can see the frequency changing due to load on my log, so the kernel is doing "something", but I don't think it's actively slowing things down due to temperature concerns.

The heat sink improved the load periods (based on lab data), but as Richard said, thermal conductivity has to be minimum along all the path out, and the plastic casing does not help.

I've run the cpuburn at 920MHz and it runs indefinitely at around 70% max temperature (51C). When I set the maximum to 1.2GHz, it dies in 5 seconds.

Does anyone know how to turn on thermal management on the Linux kernel for the OMAP chips?

cheers, --renato

Renato Golin

6 Jul 6 Jul

8:39 a.m.

On 5 July 2013 08:56, Renato Golin renato.golin@linaro.org wrote:

...

Yesterday I turned one of the boards back to 1.2GHz (3pm), and it died during the night (2am). The 920MHz is still working. The room temperature didn't go over 26C (the thermometer is by the boards).

Status update:

One of the boards failed at 920MHz @ 60% temperature levels. I blamed the power supply and switched to make sure the *other* board would fail as well at 920MHz, which it did. So I retired that power supply and am using a third one, all cheap (all I have here).

I'm not sure how it managed to run for two full days before, but it could be due to the temperature of the room that has increased these last two days and the power supply itself is overheating.

cheers, --renato

Christian Robottom Reis

7 Jul 7 Jul

10 p.m.

On Sat, Jul 06, 2013 at 09:39:13AM +0100, Renato Golin wrote:

...

On 5 July 2013 08:56, Renato Golin renato.golin@linaro.org wrote:

...
Yesterday I turned one of the boards back to 1.2GHz (3pm), and it died during the night (2am). The 920MHz is still working. The room temperature didn't go over 26C (the thermometer is by the boards).

Status update:

One of the boards failed at 920MHz @ 60% temperature levels. I blamed the power supply and switched to make sure the *other* board would fail as well at 920MHz, which it did. So I retired that power supply and am using a third one, all cheap (all I have here).

I know others have said this, but since you seem to not yet be convinced: the frequency is a red herring. If your PSU can't really keep up with the Panda, all bets are off. We early on in the LAVA lab figured out they only ran reliably on those massive (IIRC, 4A) bricks that Digikey sells.

-- Christian Robottom Reis | [+1] 612 216 4935 | http://launchpad.net/~kiko Canonical VP Hyperscale | [+55 16] 9112 6430 | http://async.com.br/~kiko

Will Newton

3 Jul 3 Jul

10:01 p.m.

On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:

...

Hi Folks,

I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.

It may also be worth examining your power supplies and see if they are providing enough current to run the chip this hot reliably. A bench supply could eliminate this possibility conclusively.

-- Will Newton Toolchain Working Group, Linaro

Renato Golin

4 Jul 4 Jul

8:44 a.m.

On 3 July 2013 23:01, Will Newton will.newton@linaro.org wrote:

...

It may also be worth examining your power supplies and see if they are providing enough current to run the chip this hot reliably. A bench supply could eliminate this possibility conclusively.

They're cheap... *very* cheap... They're not the ones Linaro uses in the lab most of the time, but are the ones Linaro has loads of in the "power supply" drawer, and the ones that websites show you as "PandaBoard power supply".

Not this one:

http://www.digikey.com/product-detail/en/PSAC30U-050/993-1019-ND/2384432?cur...

This one:

http://www.amazon.co.uk/Pandaboard-Board-replacement-supply-adaptor/dp/B0087...

The difference in price tells you a lot... ;)

This was my conclusion when my Panda at home, on idle, was locking up. It wasn't turning off every time, some times it'd just lock and have one LED constantly on and the other constantly off, sometimes it'd shutdown completely, and some times the screen would freeze, but it'd still be "running". With many other appliances connected to the same socket (TV, Sky, PS3, printer, etc), the spikes could be causing trouble.

The boards now have run overnight at 920MHz without a glitch, though they are understandably 50% slower. I'll see how they behave during today, and if they don't fail, I'll conclude that it was, indeed, the frequency, not the power supply.

cheers, --renato

Milosz Wasilewski

9 a.m.

On 4 July 2013 09:44, Renato Golin renato.golin@linaro.org wrote:

...

On 3 July 2013 23:01, Will Newton will.newton@linaro.org wrote:

...
It may also be worth examining your power supplies and see if they are providing enough current to run the chip this hot reliably. A bench supply could eliminate this possibility conclusively.

They're cheap... *very* cheap... They're not the ones Linaro uses in the lab most of the time, but are the ones Linaro has loads of in the "power supply" drawer, and the ones that websites show you as "PandaBoard power supply".

Not this one:

http://www.digikey.com/product-detail/en/PSAC30U-050/993-1019-ND/2384432?cur...

This one:

http://www.amazon.co.uk/Pandaboard-Board-replacement-supply-adaptor/dp/B0087...

What is the output current of this PSU? I tried running pandaboard with 2.5A PSU and it didn't even start. 3A seems to be the minimum.

milosz

...

The difference in price tells you a lot... ;)

This was my conclusion when my Panda at home, on idle, was locking up. It wasn't turning off every time, some times it'd just lock and have one LED constantly on and the other constantly off, sometimes it'd shutdown completely, and some times the screen would freeze, but it'd still be "running". With many other appliances connected to the same socket (TV, Sky, PS3, printer, etc), the spikes could be causing trouble.

The boards now have run overnight at 920MHz without a glitch, though they are understandably 50% slower. I'll see how they behave during today, and if they don't fail, I'll conclude that it was, indeed, the frequency, not the power supply.

cheers, --renato

linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation

Matt Hart

9:01 a.m.

I've just had a quick glance at the rack in LAVA Lab with the pandas in it, and it seems like we are using the 5V 4A brick style power supplies, not the cheap ones.

Matt

On 4 July 2013 10:00, Milosz Wasilewski milosz.wasilewski@linaro.orgwrote:

...

On 4 July 2013 09:44, Renato Golin renato.golin@linaro.org wrote:

...
On 3 July 2013 23:01, Will Newton will.newton@linaro.org wrote:

...
It may also be worth examining your power supplies and see if they are providing enough current to run the chip this hot reliably. A bench supply could eliminate this possibility conclusively.

They're cheap... *very* cheap... They're not the ones Linaro uses in the

lab

...
most of the time, but are the ones Linaro has loads of in the "power

supply"

...
drawer, and the ones that websites show you as "PandaBoard power supply".

Not this one:

http://www.digikey.com/product-detail/en/PSAC30U-050/993-1019-ND/2384432?cur...

...
This one:

http://www.amazon.co.uk/Pandaboard-Board-replacement-supply-adaptor/dp/B0087...

What is the output current of this PSU? I tried running pandaboard with 2.5A PSU and it didn't even start. 3A seems to be the minimum.

milosz

...
The difference in price tells you a lot... ;)

This was my conclusion when my Panda at home, on idle, was locking up. It wasn't turning off every time, some times it'd just lock and have one LED constantly on and the other constantly off, sometimes it'd shutdown completely, and some times the screen would freeze, but it'd still be "running". With many other appliances connected to the same socket (TV,

Sky,

...
PS3, printer, etc), the spikes could be causing trouble.

The boards now have run overnight at 920MHz without a glitch, though they are understandably 50% slower. I'll see how they behave during today,

and if

...
they don't fail, I'll conclude that it was, indeed, the frequency, not

the

...
power supply.

cheers, --renato

linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation

linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation

Renato Golin

9:10 a.m.

On 4 July 2013 10:01, Matt Hart matthew.hart@linaro.org wrote:

...

I've just had a quick glance at the rack in LAVA Lab with the pandas in it, and it seems like we are using the 5V 4A brick style power supplies, not the cheap ones.

I know, but somewhere in the lab there's a box full of the cheap ones, and these are the ones I used for my buildbots, and the ones I bought for me at home.

--renato

Renato Golin

9:08 a.m.

On 4 July 2013 10:00, Milosz Wasilewski milosz.wasilewski@linaro.orgwrote:

...

What is the output current of this PSU? I tried running pandaboard with 2.5A PSU and it didn't even start. 3A seems to be the minimum.

These are 5V and on my multimeter I got almost 6V, but it's not just the voltage, but the constant supply of current.

Since this supply is very cheap, it doesn't have a way of ensuring constant current when the current is being temporarily diverged to another socket because of a peak usage from another device. I only have a small Atom server and my laptop, so there isn't much that could cause any substantial lack of current.

cheers, --renato

Mans Rullgard

10:29 a.m.

On 4 July 2013 10:08, Renato Golin renato.golin@linaro.org wrote:

...

On 4 July 2013 10:00, Milosz Wasilewski milosz.wasilewski@linaro.org wrote:

...
What is the output current of this PSU? I tried running pandaboard with 2.5A PSU and it didn't even start. 3A seems to be the minimum.

These are 5V and on my multimeter I got almost 6V, but it's not just the voltage, but the constant supply of current.

Since this supply is very cheap, it doesn't have a way of ensuring constant current when the current is being temporarily diverged to another socket because of a peak usage from another device.

That is not how electricity works.

-- Mans Rullgard / mru

Renato Golin

10:52 a.m.

On 4 July 2013 11:29, Mans Rullgard mans.rullgard@linaro.org wrote:

...

That is not how electricity works.

I may not have myself clear, I suppose... We can digress at Connect about electricity.

cheers, --renato

Renato Golin

17 Jul 17 Jul

3:59 p.m.

Folks,

I had my final round of tests and I can say that there is no final conclusion on why they fail, but they do failed under every scenario I could try them on.

I've tested 3 identical boards (Panda-ES RevB2) with 5 different power supplies. Even at 920MHz, with decent power supplies (high-quality 5V/4A, the ones used in the lab) they fail at 70% of their target temperatures, at least since the last measurement (<1min before failing). So, unless they overheat in less than a minute, for no apparent reason, and get hot enough to make the plastic case be a nuisance to heat transfer, they're not really failing because of heat. Power supplies also very cool, so I doubt they're at fault.

There isn't absolutely anything on the logs, no kernel panic, no error message, nothing. Since there is no indication that lowering the frequency to 700MHz will make any difference (heat issue was indeed very likely a red herring), I'm basically giving up on the Pandas. They were either not meant to run for long times at full capacity, or our kernel (Linaro 3.5.0-213-omap4) is not up to the task (which is worrying). But since I'm not a kernel engineer, there isn't much I can do from where I stand.

If anyone want to continue the investigation, on a kernel level, I can help set up the boards, but now I need to re-focus on more pressing issues.

cheers, --renato

4693

days inactive

4707

days old

linaro-validation@lists.linaro.org

27 comments

participants

tags (0)

participants (8)

Christian Robottom Reis
James Tunnicliffe
Mans Rullgard
Matt Hart
Milosz Wasilewski
Renato Golin
Richard Earnshaw
Will Newton