I sat down and measured the power consumption of the NEON unit on an OMAP3. Method and results are here: https://wiki.linaro.org/MichaelHope/Sandbox/NEONPower
The board takes 2.37 W and the NEON unit adds an extra 120 mW. Assuming the core takes 1 W, then the code needs to run 12 % faster with NEON on to be a net power win.
Note that the results are inaccurate but valid enough.
-- Michael
On Mon, Nov 29, 2010 at 3:58 AM, Michael Hope michael.hope@linaro.orgwrote:
I sat down and measured the power consumption of the NEON unit on an OMAP3. Method and results are here: https://wiki.linaro.org/MichaelHope/Sandbox/NEONPower
The board takes 2.37 W and the NEON unit adds an extra 120 mW. Assuming the core takes 1 W, then the code needs to run 12 % faster with NEON on to be a net power win.
Note that the results are inaccurate but valid enough.
Thanks for doing this Michael. Very interesting numbers. Some questions/comments:
1. What is the load on the system when running your loop? 2. 2.73W for the board certainly means no PM. So the actual cost of NEON will be higher when you run a kernel with working PM. 3. What percentage of the packages in the main repo of Ubuntu generate NEON?
Regards, Amit
On Mon, Nov 29, 2010 at 6:36 PM, Amit Kucheria amit.kucheria@linaro.org wrote:
On Mon, Nov 29, 2010 at 3:58 AM, Michael Hope michael.hope@linaro.org wrote:
I sat down and measured the power consumption of the NEON unit on an OMAP3. Method and results are here: https://wiki.linaro.org/MichaelHope/Sandbox/NEONPower
The board takes 2.37 W and the NEON unit adds an extra 120 mW. Assuming the core takes 1 W, then the code needs to run 12 % faster with NEON on to be a net power win.
Note that the results are inaccurate but valid enough.
Thanks for doing this Michael. Very interesting numbers. Some questions/comments:
- What is the load on the system when running your loop?
It's the only process running, so the load average should be 1.
- 2.73W for the board certainly means no PM. So the actual cost of NEON
will be higher when you run a kernel with working PM.
Yip, but the absolute cost should be the same. The interesting question is: given a workload that takes, say X CPU seconds at full power, how much faster does the NEON version have to be to use less total energy?
- What percentage of the packages in the main repo of Ubuntu generate NEON?
Very few at the moment. Only the ones that have NEON specific backends really, but that includes X via pixman and any video or audio decoders.
-- Michael
On Sun, Nov 28, 2010 at 10:28 PM, Michael Hope michael.hope@linaro.org wrote:
I sat down and measured the power consumption of the NEON unit on an OMAP3. Method and results are here: https://wiki.linaro.org/MichaelHope/Sandbox/NEONPower
The board takes 2.37 W and the NEON unit adds an extra 120 mW. Assuming the core takes 1 W, then the code needs to run 12 % faster with NEON on to be a net power win.
Note that the results are inaccurate but valid enough.
Just to play devil's advocate... the results will differ, perhaps significantly, between SoCs of course.
In terms of the amount of energy required to perform a particular operation (i.e., at the microbenchmark level) I agree with your conclusion. However, in practice I suspect this isn't enough. I'm not familiar with exactly when NEON is likely to get turned on and off, but you need to factor in the behaviour of the OS--- if you accelerate a DSP operation which is used a few dozen times per timeslice, NEON will be used for only a tiny proportion of the time it is used, because once NEON is on, it probably stays on at least until the interrupt, and probably until the next task switch. With the kernel configured for dynamic timer tick, this can get even more exaggerated, since the rescheduling frequency may drop.
The real benefits, in performance and power, therefore come in operations which dominate the run-time of a particular process, such as intensive image handling or codec operations. NEON in widely-dispersed but sporadically used features (such as general-purpose library code) could be expected to come at a net power cost. If you use NEON for memcpy for example, you will basically never be able to turn the NEON unit off. That's unlikely to be a win overall, since even if you now optimise all the code in the system for NEON, you're unlikely to see a significant performance boost-- NEON simply isn't designed for accelerating general-purpose code.
The correct decision for how to optimise a given piece of code seems to depend on the SoC and the runtime load profile. And while you can usefully predict that at build-time for a media player or dedicated media stack components, it's pretty much impossible to do so with general-purpose libraries... unless there's a cunning strategy I haven't thought of.
Ideally, processes whose load varies significantly over time and between different use cases (such as Xorg) would be able to select between NEON-ised and non-NEON-ised implementations dynamically, based on the current load. But I guess we're some distance away from being able to achieve that... ?
Cheers ---Dave
On Tue, Nov 30, 2010 at 12:37 AM, Dave Martin dave.martin@linaro.org wrote:
On Sun, Nov 28, 2010 at 10:28 PM, Michael Hope michael.hope@linaro.org wrote:
I sat down and measured the power consumption of the NEON unit on an OMAP3. Method and results are here: https://wiki.linaro.org/MichaelHope/Sandbox/NEONPower
The board takes 2.37 W and the NEON unit adds an extra 120 mW. Assuming the core takes 1 W, then the code needs to run 12 % faster with NEON on to be a net power win.
Note that the results are inaccurate but valid enough.
Just to play devil's advocate... the results will differ, perhaps significantly, between SoCs of course.
In terms of the amount of energy required to perform a particular operation (i.e., at the microbenchmark level) I agree with your conclusion. However, in practice I suspect this isn't enough. I'm not familiar with exactly when NEON is likely to get turned on and off, but you need to factor in the behaviour of the OS--- if you accelerate a DSP operation which is used a few dozen times per timeslice, NEON will be used for only a tiny proportion of the time it is used, because once NEON is on, it probably stays on at least until the interrupt, and probably until the next task switch. With the kernel configured for dynamic timer tick, this can get even more exaggerated, since the rescheduling frequency may drop.
The real benefits, in performance and power, therefore come in operations which dominate the run-time of a particular process, such as intensive image handling or codec operations. NEON in widely-dispersed but sporadically used features (such as general-purpose library code) could be expected to come at a net power cost. If you use NEON for memcpy for example, you will basically never be able to turn the NEON unit off. That's unlikely to be a win overall, since even if you now optimise all the code in the system for NEON, you're unlikely to see a significant performance boost-- NEON simply isn't designed for accelerating general-purpose code.
The correct decision for how to optimise a given piece of code seems to depend on the SoC and the runtime load profile. And while you can usefully predict that at build-time for a media player or dedicated media stack components, it's pretty much impossible to do so with general-purpose libraries... unless there's a cunning strategy I haven't thought of.
Ideally, processes whose load varies significantly over time and between different use cases (such as Xorg) would be able to select between NEON-ised and non-NEON-ised implementations dynamically, based on the current load. But I guess we're some distance away from being able to achieve that... ?
I agree. I've been wondering if this is more of a power management topic as what you've described there is basically the same as what the CPU frequency governor does in deciding the best way to achieve a workload. Perhaps this can also turn into hints to executing code re: what instruction set to use.
There might be an argument for explicit control as well. Say you're decoding a AAC stream and using 20 % CPU - it might be more efficient to acquire and release the NEON unit from within the decoder to start it up faster and release it as soon as the job is done.
Could a kernel developer describe how the NEON unit is controlled? My understanding is: * NEON is generally off * Executing a NEON instruction causes a instruction trap, which kicks the kernel, which starts the unit up * The kernel only saves the NEON registers if the code uses them
I'm not sure about: * Does NEON remain on as long as that process is executing? Does it get turned off on task switch, or perhaps after a timeout? * VFP uses the same register set. Does a floating point instruction also turn the NEON coprocessor on?
-- Michael
Dave,
On Tue, Nov 30, 2010 at 2:15 AM, Michael Hope michael.hope@linaro.org wrote:
On Tue, Nov 30, 2010 at 12:37 AM, Dave Martin dave.martin@linaro.org wrote:
On Sun, Nov 28, 2010 at 10:28 PM, Michael Hope michael.hope@linaro.org wrote:
I sat down and measured the power consumption of the NEON unit on an OMAP3. Method and results are here: https://wiki.linaro.org/MichaelHope/Sandbox/NEONPower
The board takes 2.37 W and the NEON unit adds an extra 120 mW. Assuming the core takes 1 W, then the code needs to run 12 % faster with NEON on to be a net power win.
Note that the results are inaccurate but valid enough.
Just to play devil's advocate... the results will differ, perhaps significantly, between SoCs of course.
In terms of the amount of energy required to perform a particular operation (i.e., at the microbenchmark level) I agree with your conclusion. However, in practice I suspect this isn't enough. I'm not familiar with exactly when NEON is likely to get turned on and off, but you need to factor in the behaviour of the OS--- if you accelerate a DSP operation which is used a few dozen times per timeslice, NEON will be used for only a tiny proportion of the time it is used, because once NEON is on, it probably stays on at least until the interrupt, and probably until the next task switch. With the kernel configured for dynamic timer tick, this can get even more exaggerated, since the rescheduling frequency may drop.
The real benefits, in performance and power, therefore come in operations which dominate the run-time of a particular process, such as intensive image handling or codec operations. NEON in widely-dispersed but sporadically used features (such as general-purpose library code) could be expected to come at a net power cost. If you use NEON for memcpy for example, you will basically never be able to turn the NEON unit off. That's unlikely to be a win overall, since even if you now optimise all the code in the system for NEON, you're unlikely to see a significant performance boost-- NEON simply isn't designed for accelerating general-purpose code.
The correct decision for how to optimise a given piece of code seems to depend on the SoC and the runtime load profile. And while you can usefully predict that at build-time for a media player or dedicated media stack components, it's pretty much impossible to do so with general-purpose libraries... unless there's a cunning strategy I haven't thought of.
Ideally, processes whose load varies significantly over time and between different use cases (such as Xorg) would be able to select between NEON-ised and non-NEON-ised implementations dynamically, based on the current load. But I guess we're some distance away from being able to achieve that... ?
I agree. I've been wondering if this is more of a power management topic as what you've described there is basically the same as what the CPU frequency governor does in deciding the best way to achieve a workload. Perhaps this can also turn into hints to executing code re: what instruction set to use.
There might be an argument for explicit control as well. Say you're decoding a AAC stream and using 20 % CPU - it might be more efficient to acquire and release the NEON unit from within the decoder to start it up faster and release it as soon as the job is done.
Could a kernel developer describe how the NEON unit is controlled? My understanding is: * NEON is generally off * Executing a NEON instruction causes a instruction trap, which kicks the kernel, which starts the unit up * The kernel only saves the NEON registers if the code uses them
I'm not sure about: * Does NEON remain on as long as that process is executing? Does it get turned off on task switch, or perhaps after a timeout?
On OMAP3, Neon is a separate Power domain and it can transition to low power state on its own based on its activity (managed by PRCM HW). However Neon PD has a Wake dependency with MPU which means Neon is woken up whenever MPU comes out of standby state.
* VFP uses the same register set. Does a floating point instruction also turn the NEON coprocessor on?
Yes I supposed so since VFP engine is part of Neon Unit.
Vishwa
-- Michael
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
Hi,
On Mon, Nov 29, 2010 at 8:45 PM, Michael Hope michael.hope@linaro.org wrote:
On Tue, Nov 30, 2010 at 12:37 AM, Dave Martin dave.martin@linaro.org wrote:
On Sun, Nov 28, 2010 at 10:28 PM, Michael Hope michael.hope@linaro.org wrote:
I sat down and measured the power consumption of the NEON unit on an OMAP3. Method and results are here: https://wiki.linaro.org/MichaelHope/Sandbox/NEONPower
The board takes 2.37 W and the NEON unit adds an extra 120 mW. Assuming the core takes 1 W, then the code needs to run 12 % faster with NEON on to be a net power win.
Note that the results are inaccurate but valid enough.
Just to play devil's advocate... the results will differ, perhaps significantly, between SoCs of course.
In terms of the amount of energy required to perform a particular operation (i.e., at the microbenchmark level) I agree with your conclusion. However, in practice I suspect this isn't enough. I'm not familiar with exactly when NEON is likely to get turned on and off, but you need to factor in the behaviour of the OS--- if you accelerate a DSP operation which is used a few dozen times per timeslice, NEON will be used for only a tiny proportion of the time it is used, because once NEON is on, it probably stays on at least until the interrupt, and probably until the next task switch. With the kernel configured for dynamic timer tick, this can get even more exaggerated, since the rescheduling frequency may drop.
The real benefits, in performance and power, therefore come in operations which dominate the run-time of a particular process, such as intensive image handling or codec operations. NEON in widely-dispersed but sporadically used features (such as general-purpose library code) could be expected to come at a net power cost. If you use NEON for memcpy for example, you will basically never be able to turn the NEON unit off. That's unlikely to be a win overall, since even if you now optimise all the code in the system for NEON, you're unlikely to see a significant performance boost-- NEON simply isn't designed for accelerating general-purpose code.
The correct decision for how to optimise a given piece of code seems to depend on the SoC and the runtime load profile. And while you can usefully predict that at build-time for a media player or dedicated media stack components, it's pretty much impossible to do so with general-purpose libraries... unless there's a cunning strategy I haven't thought of.
Ideally, processes whose load varies significantly over time and between different use cases (such as Xorg) would be able to select between NEON-ised and non-NEON-ised implementations dynamically, based on the current load. But I guess we're some distance away from being able to achieve that... ?
I agree. I've been wondering if this is more of a power management topic as what you've described there is basically the same as what the CPU frequency governor does in deciding the best way to achieve a workload. Perhaps this can also turn into hints to executing code re: what instruction set to use.
There might be an argument for explicit control as well. Say you're decoding a AAC stream and using 20 % CPU - it might be more efficient to acquire and release the NEON unit from within the decoder to start it up faster and release it as soon as the job is done.
Could a kernel developer describe how the NEON unit is controlled? My understanding is: * NEON is generally off * Executing a NEON instruction causes a instruction trap, which kicks the kernel, which starts the unit up * The kernel only saves the NEON registers if the code uses them
I'll give the architectural view--- someone else will have to comment on the hardware.
Currently, at every context switch, the kernel disables VFP and NEON by clearing the EN bit in the FPEXC control register. The first attempt use use VFP or NEON by the process will cause a trap into the kernel, which does any necessary context switching of the VFP/NEON registers, enables them by setting FPEXC.EN and returning to userspace. VFP and NEON remain enabled until the next context switch.
This policy has nothing to do with power--- it's purely done so that the VFP and NEON context can be switched lazily. If the kernel switches to a process that doesn't use VFP or NEON, the old register contents will remain, so you may also save an additional register bank context switch if the next context switch takes you back to the process which actually owns the register contents.
Particular SoCs may implement their own additional stragety for power management. A particular SoC may respond to the toggling of FPEXC.EN by clock-gating the whole NEON functional unit for example. Or there may some entirely separate logic. However, in the current implementation I believe the NEON unit can't normally be destructively powered down, since the kernel assumes that the last register contents switched into the VFP/NEON register bank are preserved.
I'm not sure about: * Does NEON remain on as long as that process is executing? Does it get turned off on task switch, or perhaps after a timeout?
Basically, NEON is turned on when a process tries to execute a NEON/VFP instruction, and turned off on each task switch.
In principle, the kernel could be cleverer than this--- for example, doing the NEON/VFP register state switch non-lazily and leaving the unit on when switching to a process which is likely to use VFP/NEON; or possibly applying a timeout as you suggest.
Obviously, there's a risk of pathological behaviour if NEON/VFP is disabled too agressively, since you could churn constantly turning it off and then back on again.
* VFP uses the same register set. Does a floating point instruction also turn the NEON coprocessor on?
Yes-- these are one and the same thing from the kernel's point of view. FPEXC.EN=0 basically causes all instructions accessing that register bank to trap.
Cheers ---Dave
From: linaro-dev-bounces@lists.linaro.org [mailto:linaro-dev- bounces@lists.linaro.org] On Behalf Of Dave Martin Sent: Tuesday, November 30, 2010 3:41 AM
- VFP uses the same register set. Does a floating point
instruction
also turn the NEON coprocessor on?
Yes-- these are one and the same thing from the kernel's point of view. FPEXC.EN=0 basically causes all instructions accessing that register bank to trap.
Same registers for neon/vfp and the hardware engines from omap-prcm (global hw power fsm) perspective are in same power domain. Inside ARM blocks many times yet more power domains exist which the soc must supply power to in some way.
General interest is standard A8 VFP block is much less powerful than ARM11 version and A9 version. It is an iterative engine instead of a pipelined one. For A8 the projected faster clock speeds were supposed to make it sufficient and allowed a smaller footprint. In A9 other tradeoffs happened which resulted in the more capable engine being put back in for vfp side. A8 does have some run fast mode which allows elements of VFP to enter neon pipe for speed for those who need it. A few customers did this.
Neon is much faster and for single precision if you don't mind non-full ieee compliance. It might end up being more power efficient given the way an soc powers the block or not. For double you have to use vfp as neon can't do it.
Your description of operations seems to follow what I've seen before. Finding a cleaver way to exploit it would be the trick.
On 11/30/2010 03:41 AM, Dave Martin wrote:
Hi,
On Mon, Nov 29, 2010 at 8:45 PM, Michael Hopemichael.hope@linaro.org wrote:
On Tue, Nov 30, 2010 at 12:37 AM, Dave Martindave.martin@linaro.org wrote:
On Sun, Nov 28, 2010 at 10:28 PM, Michael Hopemichael.hope@linaro.org wrote:
I sat down and measured the power consumption of the NEON unit on an OMAP3. Method and results are here: https://wiki.linaro.org/MichaelHope/Sandbox/NEONPower
The board takes 2.37 W and the NEON unit adds an extra 120 mW. Assuming the core takes 1 W, then the code needs to run 12 % faster with NEON on to be a net power win.
Note that the results are inaccurate but valid enough.
Just to play devil's advocate... the results will differ, perhaps significantly, between SoCs of course.
In terms of the amount of energy required to perform a particular operation (i.e., at the microbenchmark level) I agree with your conclusion. However, in practice I suspect this isn't enough. I'm not familiar with exactly when NEON is likely to get turned on and off, but you need to factor in the behaviour of the OS--- if you accelerate a DSP operation which is used a few dozen times per timeslice, NEON will be used for only a tiny proportion of the time it is used, because once NEON is on, it probably stays on at least until the interrupt, and probably until the next task switch. With the kernel configured for dynamic timer tick, this can get even more exaggerated, since the rescheduling frequency may drop.
The real benefits, in performance and power, therefore come in operations which dominate the run-time of a particular process, such as intensive image handling or codec operations. NEON in widely-dispersed but sporadically used features (such as general-purpose library code) could be expected to come at a net power cost. If you use NEON for memcpy for example, you will basically never be able to turn the NEON unit off. That's unlikely to be a win overall, since even if you now optimise all the code in the system for NEON, you're unlikely to see a significant performance boost-- NEON simply isn't designed for accelerating general-purpose code.
The correct decision for how to optimise a given piece of code seems to depend on the SoC and the runtime load profile. And while you can usefully predict that at build-time for a media player or dedicated media stack components, it's pretty much impossible to do so with general-purpose libraries... unless there's a cunning strategy I haven't thought of.
Ideally, processes whose load varies significantly over time and between different use cases (such as Xorg) would be able to select between NEON-ised and non-NEON-ised implementations dynamically, based on the current load. But I guess we're some distance away from being able to achieve that... ?
I agree. I've been wondering if this is more of a power management topic as what you've described there is basically the same as what the CPU frequency governor does in deciding the best way to achieve a workload. Perhaps this can also turn into hints to executing code re: what instruction set to use.
There might be an argument for explicit control as well. Say you're decoding a AAC stream and using 20 % CPU - it might be more efficient to acquire and release the NEON unit from within the decoder to start it up faster and release it as soon as the job is done.
Could a kernel developer describe how the NEON unit is controlled? My understanding is:
- NEON is generally off
- Executing a NEON instruction causes a instruction trap, which kicks
the kernel, which starts the unit up
- The kernel only saves the NEON registers if the code uses them
I'll give the architectural view--- someone else will have to comment on the hardware.
Currently, at every context switch, the kernel disables VFP and NEON by clearing the EN bit in the FPEXC control register. The first attempt use use VFP or NEON by the process will cause a trap into the kernel, which does any necessary context switching of the VFP/NEON registers, enables them by setting FPEXC.EN and returning to userspace. VFP and NEON remain enabled until the next context switch.
This policy has nothing to do with power--- it's purely done so that the VFP and NEON context can be switched lazily. If the kernel switches to a process that doesn't use VFP or NEON, the old register contents will remain, so you may also save an additional register bank context switch if the next context switch takes you back to the process which actually owns the register contents.
Particular SoCs may implement their own additional stragety for power management. A particular SoC may respond to the toggling of FPEXC.EN by clock-gating the whole NEON functional unit for example. Or there may some entirely separate logic. However, in the current implementation I believe the NEON unit can't normally be destructively powered down, since the kernel assumes that the last register contents switched into the VFP/NEON register bank are preserved.
On SMP, the registers are saved on context switch because the process can be moved to another core. On UP, they are saved lazily when the next process accesses NEON. So powergating in the UP case would have to be handled differently.
I'm not sure about:
- Does NEON remain on as long as that process is executing? Does it
get turned off on task switch, or perhaps after a timeout?
Basically, NEON is turned on when a process tries to execute a NEON/VFP instruction, and turned off on each task switch.
In principle, the kernel could be cleverer than this--- for example, doing the NEON/VFP register state switch non-lazily and leaving the unit on when switching to a process which is likely to use VFP/NEON; or possibly applying a timeout as you suggest.
Obviously, there's a risk of pathological behaviour if NEON/VFP is disabled too agressively, since you could churn constantly turning it off and then back on again.
Adding powergating when enabling/disabling NEON will increase overhead and make the problem worse. Probably, some sort of timeout to powergate NEON would be better.
Another possibility would be controlling the cpu affinity for processes using NEON. This would help keep NEON powered off on most cores.
Rob