Hi all,
I'd be interested in people's views on the following idea-- feel free to ignore if it doesn't interest you.
For power-management purposes, it's useful to be able to turn off functional blocks on the SoC.
For on-SoC peripherals, this can be managed through the driver framework in the kernel, but for functional blocks of the CPU itself which are used by instruction set extensions, such as NEON or other media accelerators, it would be interesting if processes could adapt to these units appearing and disappearing at runtime. This would mean that user processes would need to select dynamically between different implementations of accelerated functionality at runtime.
This allows for more active power management of such functional blocks: if the CPU is not fully loaded, you can turn them off -- the kernel can spot when there is significant idle time and do this. If the CPU becomes fully loaded, applications which have soft-realtime constraints can notice this and switch to their accelerated code (which will cause the kernel to switch the functional unit(s) on). Or, the kernel can react to increasing CPU load by speculatively turn it on instead. This is analogous to the behaviour of other power governors in the system. Non-aware applications will still work seamlessly -- these may simply run accelerated code if the hardware supports it, causing the kernel to turn the affected functional block(s) on.
In order for this to work, some dynamic status information would need to be visible to each user process, and polled each time a function with a dynamically switchable choice of implementations gets called. You probably don't need to worry about race conditions either-- if the process accidentally tries to use a turned-off feature, you will take a fault which gives the kernel the chance to turn the feature back on. Generally, this should be a rare occurrence.
The dynamic feature status information should ideally be per-CPU global, though we could have a separate copy per thread, at the cost of more memory. It can't be system-global, since different CPUs may have a different set of functional blocks active at any one time -- for this reason, the information can't be stored in an existing mapping such as the vectors page. Conversely, existing mechanisms such sysfs probably involve too much overhead to be polled every time you call copy_pixmap() or whatever.
Alternatively, each thread could register a userspace buffer (a single word is probably adequate) into which the CPU pokes the hardware status flags each time it returns to userspace, if the hardware status has changed or if the thread has been migrated.
Either of the above approaches could be prototyped as an mmap'able driver, though this may not be the best approach in the long run.
Does anyone have a view on whether this is a worthwhile idea, or what the best approach would be?
Cheers ---Dave
Dave,
For the case of NEON and its use in graphics libraries, we are certainly pushing explicitly for runtime detection. However, this tends to be done by detecting the presence of NEON at initialization time, rather than at each path invocation (to avoid rescanning /proc/self/auxv). Are you saying that the init code could still detect NEON this way, but there would need to be additional checks when taking individual paths?
cheers, Jesse
On Fri, Dec 3, 2010 at 8:28 AM, Dave Martin dave.martin@linaro.org wrote:
Hi all,
I'd be interested in people's views on the following idea-- feel free to ignore if it doesn't interest you.
For power-management purposes, it's useful to be able to turn off functional blocks on the SoC.
For on-SoC peripherals, this can be managed through the driver framework in the kernel, but for functional blocks of the CPU itself which are used by instruction set extensions, such as NEON or other media accelerators, it would be interesting if processes could adapt to these units appearing and disappearing at runtime. This would mean that user processes would need to select dynamically between different implementations of accelerated functionality at runtime.
This allows for more active power management of such functional blocks: if the CPU is not fully loaded, you can turn them off -- the kernel can spot when there is significant idle time and do this. If the CPU becomes fully loaded, applications which have soft-realtime constraints can notice this and switch to their accelerated code (which will cause the kernel to switch the functional unit(s) on). Or, the kernel can react to increasing CPU load by speculatively turn it on instead. This is analogous to the behaviour of other power governors in the system. Non-aware applications will still work seamlessly -- these may simply run accelerated code if the hardware supports it, causing the kernel to turn the affected functional block(s) on.
In order for this to work, some dynamic status information would need to be visible to each user process, and polled each time a function with a dynamically switchable choice of implementations gets called. You probably don't need to worry about race conditions either-- if the process accidentally tries to use a turned-off feature, you will take a fault which gives the kernel the chance to turn the feature back on. Generally, this should be a rare occurrence.
The dynamic feature status information should ideally be per-CPU global, though we could have a separate copy per thread, at the cost of more memory. It can't be system-global, since different CPUs may have a different set of functional blocks active at any one time -- for this reason, the information can't be stored in an existing mapping such as the vectors page. Conversely, existing mechanisms such sysfs probably involve too much overhead to be polled every time you call copy_pixmap() or whatever.
Alternatively, each thread could register a userspace buffer (a single word is probably adequate) into which the CPU pokes the hardware status flags each time it returns to userspace, if the hardware status has changed or if the thread has been migrated.
Either of the above approaches could be prototyped as an mmap'able driver, though this may not be the best approach in the long run.
Does anyone have a view on whether this is a worthwhile idea, or what the best approach would be?
Cheers ---Dave
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
On Fri, Dec 03, 2010 at 04:28:27PM +0000, Dave Martin wrote:
For on-SoC peripherals, this can be managed through the driver framework in the kernel, but for functional blocks of the CPU itself which are used by instruction set extensions, such as NEON or other media accelerators, it would be interesting if processes could adapt to these units appearing and disappearing at runtime. This would mean that user processes would need to select dynamically between different implementations of accelerated functionality at runtime.
The ELF hwcaps are used by the linker to determine what facilities are available, and therefore which dynamic libraries to link in.
For instance, if you have a selection of C libraries on your platform built for different features - eg, lets say you have a VFP based library and a soft-VFP based library.
If the linker sees - at application startup - that HWCAP_VFP is set, it will select the VFP based library. If HWCAP_VFP is not set, it will select the soft-VFP based library instead.
A VFP-based library is likely to contain VFP instructions, sometimes in the most unlikely of places - eg, printf/scanf is likely to invoke VFP instructions even when they aren't dealing with floating point in their format string.
The problem comes is if you take away HWCAP_VFP after an application has been bound to the hard-VFP library, there is no way, sort of killing and re-exec'ing the program, to change the libraries that it is bound to.
In order for this to work, some dynamic status information would need to be visible to each user process, and polled each time a function with a dynamically switchable choice of implementations gets called. You probably don't need to worry about race conditions either-- if the process accidentally tries to use a turned-off feature, you will take a fault which gives the kernel the chance to turn the feature back on.
Yes, you can use a fault to re-enable some features such as VFP.
The dynamic feature status information should ideally be per-CPU global, though we could have a separate copy per thread, at the cost of more memory.
Threads are migrated across CPUs so you can't rely on saying CPU0 has VFP powered up and CPU1 has VFP powered down, and then expect that threads using VFP will remain on CPU0. The system will spontaneously move that thread to CPU1 if CPU1 is less loaded than CPU0.
I think what may be possible is to hook VFP power state into the code which enables/disables access to VFP.
However, I'm not aware of any platforms or CPUs where (eg) VFP is powered or clocked independently to the main CPU.
Hi,
On Fri, Dec 3, 2010 at 4:51 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Fri, Dec 03, 2010 at 04:28:27PM +0000, Dave Martin wrote:
For on-SoC peripherals, this can be managed through the driver framework in the kernel, but for functional blocks of the CPU itself which are used by instruction set extensions, such as NEON or other media accelerators, it would be interesting if processes could adapt to these units appearing and disappearing at runtime. This would mean that user processes would need to select dynamically between different implementations of accelerated functionality at runtime.
The ELF hwcaps are used by the linker to determine what facilities are available, and therefore which dynamic libraries to link in.
For instance, if you have a selection of C libraries on your platform built for different features - eg, lets say you have a VFP based library and a soft-VFP based library.
If the linker sees - at application startup - that HWCAP_VFP is set, it will select the VFP based library. If HWCAP_VFP is not set, it will select the soft-VFP based library instead.
A VFP-based library is likely to contain VFP instructions, sometimes in the most unlikely of places - eg, printf/scanf is likely to invoke VFP instructions even when they aren't dealing with floating point in their format string.
True... this is most likely to be useful for specialised functional units which are used in specific places (such as NEON), and which aren't distributed throughout the code. As you say, in general-purpose code built with -mfpu=vfp*, VFP is distributed all over the place, so you'd probably see a net cost as you thrash turning VFP on and off. The point may be moot-- I'm not aware of a SoC which can power-manage VFP; but NEON might be different.
What you describe is one of two mechanisms currently in use--- the other is for a single library to contain two implementations of certain functions and to choose between them based on the hwcaps. Typically, one set of functions is chosen a library initialisation time. Some libraries, such as libpixman, are implementated this way; and it's often preferable since the the proportion of functions in a library which get significant benefit from special instruction set extensions is often pretty small. So you avoid having duplicate copies of libraries in the filesystem. (Of course, if the distro's packager was intelligent enough, it could avoid installing the duplicate, but that's a separate issue.)
Unfortunately, glibc does a good job of hiding not only the hwcaps passed on the initial stack but also the derived information which drives shared library selection (or at least frustrates reliable access to this information); so generally code which wants to check the hwcaps must read /proc/self/auxv (or parse /proc/cpuinfo ... but that's more laborious). However, the cost isn't too problematic if this only happens once, when a library is initialised.
In the near future, STT_IFUNC support in the tools and ld.so may add to the mix, by allowing the dynamic linker to select different implementations of code at the function level, not just the whole-library level. If so, this will provide a better way to implement the optimised function selection challenge outlined above.
The problem comes is if you take away HWCAP_VFP after an application has been bound to the hard-VFP library, there is no way, sort of killing and re-exec'ing the program, to change the libraries that it is bound to.
Agreed--- the application has to be aware in order for this to become really useful.
However, to be clear, I'm not suggesting that the kernel should _ever_ break the contract embodied in /proc/cpuinfo, or the hwcaps passed at process startup. If the hwcaps say NEON is supported then it must be supported (though this is allowed to involve a fault and a possible SoC-specific delay while the functional unit is brought back online).
Rather, the dynamic status would indicate whether or not the functional unit is in a "ready" state or not.
In order for this to work, some dynamic status information would need to be visible to each user process, and polled each time a function with a dynamically switchable choice of implementations gets called. You probably don't need to worry about race conditions either-- if the process accidentally tries to use a turned-off feature, you will take a fault which gives the kernel the chance to turn the feature back on.
Yes, you can use a fault to re-enable some features such as VFP.
The dynamic feature status information should ideally be per-CPU global, though we could have a separate copy per thread, at the cost of more memory.
Threads are migrated across CPUs so you can't rely on saying CPU0 has VFP powered up and CPU1 has VFP powered down, and then expect that threads using VFP will remain on CPU0. The system will spontaneously move that thread to CPU1 if CPU1 is less loaded than CPU0.
My theory was that this wouldn't matter -- the dynamic status contains hints that this or that functional unit is likely to be in a "ready" state. It's stastically unlikely that the thread will be suspended or migrated during a single execution of a particular function in most cases; though of course it may happen sometimes.
If a thread tries to execute an instruction and and finds that functional unit turned off, the kernel then makes a desicision about whether to sleep the process for a bit, turn the feature on locally, or migrate the thread.
I think what may be possible is to hook VFP power state into the code which enables/disables access to VFP.
Indeed; I believe in some implementations that the SoC is clever enough to save some power automatically when these features are disabled (provided that the saving is non-destructive).
However, I'm not aware of any platforms or CPUs where (eg) VFP is powered or clocked independently to the main CPU.
As I said above, the main use case I'm aware of would be NEON; it's possible other vendors' extensions such as iwmmxt can also be managed in similar, but this is outside my field of knowledge.
Cheers ---Dave
On 12/3/2010 11:35 AM, Dave Martin wrote:
What you describe is one of two mechanisms currently in use--- the other is for a single library to contain two implementations of certain functions and to choose between them based on the hwcaps. Typically, one set of functions is chosen a library initialisation time. Some libraries, such as libpixman, are implementated this way; and it's often preferable since the the proportion of functions in a library which get significant benefit from special instruction set extensions is often pretty small.
I've believed for a long time that we should try to encourage this approach. The current approach (different libraries for each hardware configuration) is prevalent, both in the toolchain ("multilibs") and in other libraries -- but it seems to me premised on the idea that one is building everything from source for one's particular hardware. In the earlier days of FOSS, the typical installation model was to download a source tarball, build it, and install it on your local machine. In that context, tuning the library "just so" for your machine made sense. But, to enable binary distribution, having to have N copies of a library (let alone an application) for N different ARM core variants just doesn't make sense to me.
So, I certainly think that things like STT_GNU_IFUNC (which enable determination of which routine to use at application start-up) make a lot of sense.
I think your idea of exposing whether a unit is "ready", to allow even more fine-grained choices as an application runs, is clever. I don't really know enough to say whether most applications could take advantage of that. One of the problems I see is that you need global information, not local information. In particular, if I'm using NEON to implement the inner loop of some performance-critical application, then when the unit is not ready, I want the kernel to wake it up already! But, if I'm just using NEON to do some random computation off the critical path, I'm probably happy to do it slowly if that's more efficient than waking up the NEON unit. But, which of these cases I'm in isn't always locally known at the point I'm doing the computation; the computation may be buried in a small library routine.
Do we have good examples of applications that could profit from this capability?
On Sun, Dec 5, 2010 at 3:14 PM, Mark Mitchell mark@codesourcery.com wrote:
On 12/3/2010 11:35 AM, Dave Martin wrote:
What you describe is one of two mechanisms currently in use--- the other is for a single library to contain two implementations of certain functions and to choose between them based on the hwcaps. Typically, one set of functions is chosen a library initialisation time. Some libraries, such as libpixman, are implementated this way; and it's often preferable since the the proportion of functions in a library which get significant benefit from special instruction set extensions is often pretty small.
I've believed for a long time that we should try to encourage this approach. The current approach (different libraries for each hardware configuration) is prevalent, both in the toolchain ("multilibs") and in other libraries -- but it seems to me premised on the idea that one is building everything from source for one's particular hardware. In the earlier days of FOSS, the typical installation model was to download a source tarball, build it, and install it on your local machine. In that context, tuning the library "just so" for your machine made sense. But, to enable binary distribution, having to have N copies of a library (let alone an application) for N different ARM core variants just doesn't make sense to me.
Just so, and as discussed before improvements to package managers could help here to avoid installing duplicate libraries. (I believe that rpm may have some capability here (?) but deb does not at present).
So, I certainly think that things like STT_GNU_IFUNC (which enable determination of which routine to use at application start-up) make a lot of sense.
I think your idea of exposing whether a unit is "ready", to allow even more fine-grained choices as an application runs, is clever. I don't really know enough to say whether most applications could take advantage of that. One of the problems I see is that you need global information, not local information. In particular, if I'm using NEON to implement the inner loop of some performance-critical application, then when the unit is not ready, I want the kernel to wake it up already! But, if I'm just using NEON to do some random computation off the critical path, I'm probably happy to do it slowly if that's more efficient than waking up the NEON unit. But, which of these cases I'm in isn't always locally known at the point I'm doing the computation; the computation may be buried in a small library routine.
That's a fair concern -- I haven't explored the policy aspect much. One possibility is that if the kernel sees system load nearing 100%, it turns NEON on regardless. But that's a pretty crude lever, and might not bring a benefit if the software isn't able to use NEON. Subtler approaches might involve the kernel collecting statistics on applications' use of functional units, or some participation from applications with realtime requirements. Obviously, this is a but fuzzy for now...
Do we have good examples of applications that could profit from this capability?
Currently, I don't have many examples-- the main one is related to the discussions aroung using NEON for memcpy(). This can be a performance win on some platforms, but except when the system is heavily loaded, or when NEON happens to be turned on anyway, it may not be advantageous for the user or overall system performance.
Cheers ---Dave
On 12/6/2010 5:07 AM, Dave Martin wrote:
But, to enable binary distribution, having to have N copies of a library (let alone an application) for N different ARM core variants just doesn't make sense to me.
Just so, and as discussed before improvements to package managers could help here to avoid installing duplicate libraries. (I believe that rpm may have some capability here (?) but deb does not at present).
Yes, a smarter package manager could help a device builder automatically get the right version of a library. But, something more fundamental has to happen to avoid the library developer having to *produce* N versions of a library. (Yes, in theory, you just type "make" with different CFLAGS options, but in practice of course it's often more complex than that, especially if you need to validate the library.)
Currently, I don't have many examples-- the main one is related to the discussions aroung using NEON for memcpy(). This can be a performance win on some platforms, but except when the system is heavily loaded, or when NEON happens to be turned on anyway, it may not be advantageous for the user or overall system performance.
How good of a proxy would the length of the copy be, do you think? If you want to copy 1G of data, and NEON makes you 2x-4x faster, then it seems to me that you probably want to use NEON, almost independent of overall system load. But, if you're only going to copy 16 bytes, even if NEON is faster, it's probably OK not to use it -- the function-call overhead to get into memcpy at all is probably significant relative to the time you'd save by using NEON. In between, it's harder, of course -- but perhaps if memcpy is the key example, we could get 80% of the benefit of your idea simply by a test inside memcpy as to the length of the data to be copied?
Hi,
On Tue, Dec 7, 2010 at 1:02 AM, Mark Mitchell mark@codesourcery.com wrote:
On 12/6/2010 5:07 AM, Dave Martin wrote:
But, to enable binary distribution, having to have N copies of a library (let alone an application) for N different ARM core variants just doesn't make sense to me.
Just so, and as discussed before improvements to package managers could help here to avoid installing duplicate libraries. (I believe that rpm may have some capability here (?) but deb does not at present).
Yes, a smarter package manager could help a device builder automatically get the right version of a library. But, something more fundamental has to happen to avoid the library developer having to *produce* N versions of a library. (Yes, in theory, you just type "make" with different CFLAGS options, but in practice of course it's often more complex than that, especially if you need to validate the library.)
Yes-- though I didn't elaborate on it. You need a packager that can understand, say, that a binary built for ARMv5 EABI can interoperate with ARMv7 binaries etc. Again, I've heard it suggested that RPM can handle this, but I haven't looked at it in detail myself.
Currently, I don't have many examples-- the main one is related to the discussions aroung using NEON for memcpy(). This can be a performance win on some platforms, but except when the system is heavily loaded, or when NEON happens to be turned on anyway, it may not be advantageous for the user or overall system performance.
How good of a proxy would the length of the copy be, do you think? If you want to copy 1G of data, and NEON makes you 2x-4x faster, then it seems to me that you probably want to use NEON, almost independent of overall system load. But, if you're only going to copy 16 bytes, even if NEON is faster, it's probably OK not to use it -- the function-call overhead to get into memcpy at all is probably significant relative to the time you'd save by using NEON. In between, it's harder, of course -- but perhaps if memcpy is the key example, we could get 80% of the benefit of your idea simply by a test inside memcpy as to the length of the data to be copied?
For the memcpy() case, the answer is probably yes, though how often memcpy is called by a given thread is also of significance.
However, there's still a problem: NEON is not designed for implementing memcpy(), so there's no guarantee that it will always be faster ... it is on some SoCs in some situations, but much less beneficial on others -- the "sweet spots" both for performance and power may differ widely from core to core and from SoC to SoC. So running benchmarks on one or two boards and then hard-compiling some thresholds into glibc may not be the right approach. Also, gcc implements memcpy directly too for some cases (but only for small copies?)
The dynamic hwcaps approach doesn't really solve that problem: for adapting to different SoCs, you really want a way to run a benchmark on the target to make your decision (xine-lib chooses an internal memcpy implementation this way for example), or a way to pass some platform metrics to glibc / other affected libraries. Identifying the precise SoC from /proc/cpuinfo isn't always straightforward, but I've seen some code making use of it in similar ways.
Cheers ---Dave
On Tue, Dec 07, 2010 at 10:45:42AM +0000, Dave Martin wrote:
Yes-- though I didn't elaborate on it. You need a packager that can understand, say, that a binary built for ARMv5 EABI can interoperate with ARMv7 binaries etc. Again, I've heard it suggested that RPM can handle this, but I haven't looked at it in detail myself.
That is indeed the case - as on x86, it used to be common to build the majority of the distribution for i386, and glibc and a few other bits for a range of ix86 CPUs.
rpm and yum know that i386 is compatible with i486, which is compatible with i586 etc, so it will install an i386 package on i686 if no i486, i586 or i686 package is available.
It does the same for ARM with ARMv3, ARMv4 etc.
The dynamic hwcaps approach doesn't really solve that problem:
Has anyone investigated whether it is possible to power down things like Neon etc meanwhile leaving the rest of the CPU running? I've not seen anything in the ARM documentation to suggest that's the case.
Even in MPCore based systems, the interface between the SCU and individual processors by default doesn't have the necessary clamps built in to allow individual CPUs to be powered off, and I'm not aware of any designs which decided to enable this feature (as there's a performance penalty). So I'd be really surprised if there was any support for powering down Neon separately from the host CPU.
If that's the case, it's entirely pointless discussing what userspace can or can't do - if you have Neon available and can't power it down, and it's faster for doing something, you might as well use it so you can put the main CPU into WFI mode or get on with some other useful work.
Hi,
On Tue, Dec 7, 2010 at 11:04 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Dec 07, 2010 at 10:45:42AM +0000, Dave Martin wrote:
Yes-- though I didn't elaborate on it. You need a packager that can understand, say, that a binary built for ARMv5 EABI can interoperate with ARMv7 binaries etc. Again, I've heard it suggested that RPM can handle this, but I haven't looked at it in detail myself.
That is indeed the case - as on x86, it used to be common to build the majority of the distribution for i386, and glibc and a few other bits for a range of ix86 CPUs.
rpm and yum know that i386 is compatible with i486, which is compatible with i586 etc, so it will install an i386 package on i686 if no i486, i586 or i686 package is available.
It does the same for ARM with ARMv3, ARMv4 etc.
That sounds plausible. If you really want to go to town on this it gets more complicated, but there's still a lot of value in modelling the architectural development as a linear progression in this way.
The dynamic hwcaps approach doesn't really solve that problem:
Has anyone investigated whether it is possible to power down things like Neon etc meanwhile leaving the rest of the CPU running? I've not seen anything in the ARM documentation to suggest that's the case.
Even in MPCore based systems, the interface between the SCU and individual processors by default doesn't have the necessary clamps built in to allow individual CPUs to be powered off, and I'm not aware of any designs which decided to enable this feature (as there's a performance penalty). So I'd be really surprised if there was any support for powering down Neon separately from the host CPU.
It's not part of the architecture per se, but some SoCs do put NEON in a separate power domain and can power manage it somewhat indepedently.
However, I guess we need to clarify exactly how this works for SoCs in practice. If NEON and VFP are power-managed co-dependently (for example, but no so likely(?)) this is not so useful to us ... since it's becoming common to build everything with -mfpu=vfp*.
Because the kernel only uses FPEXC.EN to en/disable these extensions, NEON and VFP are somewhat tied together ... though we might be able to get more flexible by toggling the CPACR.ASEDIS control bit instead. I don't believe the kernel currently touches this (?)
If that's the case, it's entirely pointless discussing what userspace can or can't do - if you have Neon available and can't power it down, and it's faster for doing something, you might as well use it so you can put the main CPU into WFI mode or get on with some other useful work.
Indeed ... my layman's understanding is that it is worth it on some platforms, but I guess I need to clarify this with someone who understands the hardware.
Cheers ---Dave
On Tue, Dec 07, 2010 at 03:06:51PM +0000, Dave Martin wrote:
Hi,
On Tue, Dec 7, 2010 at 11:04 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Dec 07, 2010 at 10:45:42AM +0000, Dave Martin wrote:
Yes-- though I didn't elaborate on it. You need a packager that can understand, say, that a binary built for ARMv5 EABI can interoperate with ARMv7 binaries etc. Again, I've heard it suggested that RPM can handle this, but I haven't looked at it in detail myself.
That is indeed the case - as on x86, it used to be common to build the majority of the distribution for i386, and glibc and a few other bits for a range of ix86 CPUs.
rpm and yum know that i386 is compatible with i486, which is compatible with i586 etc, so it will install an i386 package on i686 if no i486, i586 or i686 package is available.
It does the same for ARM with ARMv3, ARMv4 etc.
That sounds plausible.
That sounds like doubt.
I've used rpm extensively over the last 10 years or so, both on x86 and ARM. I've built many versions of Red Hat and Fedora for ARM. My ARM machines here (including the one which is going to send this email) run the result of that - and is currently a mixture of ARMv3 and ARMv4 Fedora packages.
On Tue, Dec 7, 2010 at 3:21 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Dec 07, 2010 at 03:06:51PM +0000, Dave Martin wrote:
Hi,
On Tue, Dec 7, 2010 at 11:04 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Dec 07, 2010 at 10:45:42AM +0000, Dave Martin wrote:
Yes-- though I didn't elaborate on it. You need a packager that can understand, say, that a binary built for ARMv5 EABI can interoperate with ARMv7 binaries etc. Again, I've heard it suggested that RPM can handle this, but I haven't looked at it in detail myself.
That is indeed the case - as on x86, it used to be common to build the majority of the distribution for i386, and glibc and a few other bits for a range of ix86 CPUs.
rpm and yum know that i386 is compatible with i486, which is compatible with i586 etc, so it will install an i386 package on i686 if no i486, i586 or i686 package is available.
It does the same for ARM with ARMv3, ARMv4 etc.
That sounds plausible.
That sounds like doubt.
I've used rpm extensively over the last 10 years or so, both on x86 and ARM. I've built many versions of Red Hat and Fedora for ARM. My ARM machines here (including the one which is going to send this email) run the result of that - and is currently a mixture of ARMv3 and ARMv4 Fedora packages.
Only doubt in the sense that I don't have experience with it myself, but I'm happy to take your word on it since you're more familiar with rpm.
Cheers ---Dave
Hi,
On Fri, 3 Dec 2010 16:28:27 +0000 Dave Martin dave.martin@linaro.org wrote:
This allows for more active power management of such functional blocks: if the CPU is not fully loaded, you can turn them off -- the kernel can spot when there is significant idle time and do this. If the CPU becomes fully loaded, applications which have soft-realtime constraints can notice this and switch to their accelerated code (which will cause the kernel to switch the functional unit(s) on). Or, the kernel can react to increasing CPU load by speculatively turn it on instead. This is analogous to the behaviour of other power governors in the system. Non-aware applications will still work seamlessly -- these may simply run accelerated code if the hardware supports it, causing the kernel to turn the affected functional block(s) on.
From a power management perspective, is it really useful to load the CPU instead of using specialized units which usually provide more computing power per watt consumed ?
When the CPU is idle, it can enter sleep states to save power and let a more specialized unit do the optimized work. For example, when doing video decoding, probably specialized DSPs to a much better job from a power management perspective than the CPU would do, so it's better to keep the CPU idle and let the DSP do its video decoding job. No?
Thomas
On Sun, Dec 5, 2010 at 2:12 PM, Thomas Petazzoni thomas.petazzoni@free-electrons.com wrote:
Hi,
On Fri, 3 Dec 2010 16:28:27 +0000 Dave Martin dave.martin@linaro.org wrote:
This allows for more active power management of such functional blocks: if the CPU is not fully loaded, you can turn them off -- the kernel can spot when there is significant idle time and do this. If the CPU becomes fully loaded, applications which have soft-realtime constraints can notice this and switch to their accelerated code (which will cause the kernel to switch the functional unit(s) on). Or, the kernel can react to increasing CPU load by speculatively turn it on instead. This is analogous to the behaviour of other power governors in the system. Non-aware applications will still work seamlessly -- these may simply run accelerated code if the hardware supports it, causing the kernel to turn the affected functional block(s) on.
From a power management perspective, is it really useful to load the CPU instead of using specialized units which usually provide more computing power per watt consumed ?
No--- but you can't in general just exchange cycles on one functional unit for cycles on another in the same way as you
Suppose 90% if your code (by execution time) can take advantage of a specialised functional unit. Should you turn that unit on?
Now, suppose only 5% of the code can take advantage, but the platform is not completely busy. Turning on a special functional unit consumes extra power and will provide no speedup to the user -- is it still worth turning it on? What if the CPU is fully loaded doing other work and your program is close to missing its realtime deadlines -- should you turn on the separate unit now?
It not an easy thing to judge -- really, I'm just wondering whether dynamic adaptation is feasible at all and whether it's worth experimenting with...
When the CPU is idle, it can enter sleep states to save power and let a more specialized unit do the optimized work. For example, when doing video decoding, probably specialized DSPs to a much better job from a power management perspective than the CPU would do, so it's better to keep the CPU idle and let the DSP do its video decoding job. No?
Often, definitely yes; however, it depends on various factors -- not least, the software must have been ported to make use of the DSP in order for this to be possible at all.
But the performance and power aspects are not trivial: separate DSP units tend to have high setup and teardown costs, so as above, if the total load on the DSP will be low, it may not be worth using it at all from a power perspective; and using a DSP in the wrong way can also lead to slower execution than doing everything on the CPU.
Cheers ---Dave
Dave Martin wrote:
On Sun, Dec 5, 2010 at 2:12 PM, Thomas Petazzoni thomas.petazzoni@free-electrons.com wrote:
Hi,
On Fri, 3 Dec 2010 16:28:27 +0000 Dave Martin dave.martin@linaro.org wrote:
This allows for more active power management of such functional blocks: if the CPU is not fully loaded, you can turn them off -- the kernel can spot when there is significant idle time and do this. If the CPU becomes fully loaded, applications which have soft-realtime constraints can notice this and switch to their accelerated code (which will cause the kernel to switch the functional unit(s) on). Or, the kernel can react to increasing CPU load by speculatively turn it on instead. This is analogous to the behaviour of other power governors in the system. Non-aware applications will still work seamlessly -- these may simply run accelerated code if the hardware supports it, causing the kernel to turn the affected functional block(s) on.
From a power management perspective, is it really useful to load the CPU instead of using specialized units which usually provide more computing power per watt consumed ?
No--- but you can't in general just exchange cycles on one functional unit for cycles on another in the same way as you
Suppose 90% if your code (by execution time) can take advantage of a specialised functional unit. Should you turn that unit on?
Now, suppose only 5% of the code can take advantage, but the platform is not completely busy. Turning on a special functional unit consumes extra power and will provide no speedup to the user -- is it still worth turning it on? What if the CPU is fully loaded doing other work and your program is close to missing its realtime deadlines -- should you turn on the separate unit now?
I think Thomas's point is that doing the 5% on the CPU may consume more power than turning on the special functional unit - even when the system is not busy and the user doesn't see a time difference.
I don't know if that's true for available hardware, but it seems like it's worth investigating before taking the idea further.
-- Jamie
On Wed, Dec 8, 2010 at 11:01 AM, Jamie Lokier jamie@shareable.org wrote:
Dave Martin wrote:
On Sun, Dec 5, 2010 at 2:12 PM, Thomas Petazzoni thomas.petazzoni@free-electrons.com wrote:
Hi,
On Fri, 3 Dec 2010 16:28:27 +0000 Dave Martin dave.martin@linaro.org wrote:
This allows for more active power management of such functional blocks: if the CPU is not fully loaded, you can turn them off -- the kernel can spot when there is significant idle time and do this. If the CPU becomes fully loaded, applications which have soft-realtime constraints can notice this and switch to their accelerated code (which will cause the kernel to switch the functional unit(s) on). Or, the kernel can react to increasing CPU load by speculatively turn it on instead. This is analogous to the behaviour of other power governors in the system. Non-aware applications will still work seamlessly -- these may simply run accelerated code if the hardware supports it, causing the kernel to turn the affected functional block(s) on.
From a power management perspective, is it really useful to load the CPU instead of using specialized units which usually provide more computing power per watt consumed ?
No--- but you can't in general just exchange cycles on one functional unit for cycles on another in the same way as you
Suppose 90% if your code (by execution time) can take advantage of a specialised functional unit. Should you turn that unit on?
Now, suppose only 5% of the code can take advantage, but the platform is not completely busy. Turning on a special functional unit consumes extra power and will provide no speedup to the user -- is it still worth turning it on? What if the CPU is fully loaded doing other work and your program is close to missing its realtime deadlines -- should you turn on the separate unit now?
I think Thomas's point is that doing the 5% on the CPU may consume more power than turning on the special functional unit - even when the system is not busy and the user doesn't see a time difference.
I don't know if that's true for available hardware, but it seems like it's worth investigating before taking the idea further.
Agreed -- either could be the case. It's something you can never be certain about without doing some measurements...
Cheers ---Dave
On 03/12/10 16:28, Dave Martin wrote:
Hi all,
I'd be interested in people's views on the following idea-- feel free to ignore if it doesn't interest you.
For power-management purposes, it's useful to be able to turn off functional blocks on the SoC.
For on-SoC peripherals, this can be managed through the driver framework in the kernel, but for functional blocks of the CPU itself which are used by instruction set extensions, such as NEON or other media accelerators, it would be interesting if processes could adapt to these units appearing and disappearing at runtime. This would mean that user processes would need to select dynamically between different implementations of accelerated functionality at runtime.
This allows for more active power management of such functional blocks: if the CPU is not fully loaded, you can turn them off -- the kernel can spot when there is significant idle time and do this. If the CPU becomes fully loaded, applications which have soft-realtime constraints can notice this and switch to their accelerated code (which will cause the kernel to switch the functional unit(s) on). Or, the kernel can react to increasing CPU load by speculatively turn it on instead. This is analogous to the behaviour of other power governors in the system. Non-aware applications will still work seamlessly -- these may simply run accelerated code if the hardware supports it, causing the kernel to turn the affected functional block(s) on.
In order for this to work, some dynamic status information would need to be visible to each user process, and polled each time a function with a dynamically switchable choice of implementations gets called. You probably don't need to worry about race conditions either-- if the process accidentally tries to use a turned-off feature, you will take a fault which gives the kernel the chance to turn the feature back on.
Could you do what the original FP did, and start with units off and use the first use of $unit in the process to turn it on? Do things like NEON support this?
Hi,
On Tue, Dec 7, 2010 at 9:15 PM, Ben Dooks ben-linux@fluff.org wrote:
[...]
Could you do what the original FP did, and start with units off and use the first use of $unit in the process to turn it on? Do things like NEON support this?
Actually, this is still done -- it's the same code since NEON and VFP use a common register file. The issue under discussion is that userspace can't detect whether units like these are active or not, and so can't make dynamic runtime decisions about whether to run accelerated code or not. (And also, whether this would actually be usefuil)
Cheers ---Dave