One of the Valgrind subtools is Cachegrind; this is a cache profiler. (It simulates the I1, D1 and L2 caches so it can pinpoint the sources of cache misses in application code.)
On x86 Cachegrind automatically queries the host CPU to find out what sort/size of cache it has installed, and by default will simulate that sort of cache. (You can also use command line options to specify a different cache layout to model.)
On ARM, the ARMv7 VMSA coprocessor registers which describe the cache geometry are privileged-mode access only. This means cachegrind can't do the same "default cache model is the same as your real CPU" behaviour that it does on x86.
Can the kernel folks on this list suggest whether it would be a reasonable idea for the kernel to provide some sort of userspace API so tools like cachegrind can find out the cache geometry?
Thanks in advance -- PMM
On Fri, Oct 15, 2010 at 2:49 PM, Peter Maydell peter.maydell@linaro.org wrote:
One of the Valgrind subtools is Cachegrind; this is a cache profiler. (It simulates the I1, D1 and L2 caches so it can pinpoint the sources of cache misses in application code.)
On x86 Cachegrind automatically queries the host CPU to find out what sort/size of cache it has installed, and by default will simulate that sort of cache. (You can also use command line options to specify a different cache layout to model.)
On ARM, the ARMv7 VMSA coprocessor registers which describe the cache geometry are privileged-mode access only. This means cachegrind can't do the same "default cache model is the same as your real CPU" behaviour that it does on x86.
Can the kernel folks on this list suggest whether it would be a reasonable idea for the kernel to provide some sort of userspace API so tools like cachegrind can find out the cache geometry?
There are similar issues with the CPU ID and feature registers.
If we're going to do suggest a change for this, it would be good to clear up the whole CPU identification / CPU feature detection area at the same time.
Note, a key stumbling block has been that the configuration in effect may not be the same as that supported by the hardware (because some kernel feature is compiled out, or errata workarounds are in force, for example). This means that simply mirroring the CPU registers up to userspace (or simplistically emulating the MRCs) may give rise to problems.
This is potentially something where we can make some worthwhile progress in linaro, but there's a risk of creating Yet Another Interface, which few people migrate to --- exacerbating the fragmentation further. It would be interesting if anyone has thoughts on how to manage that.
Cheers ---Dave
On Fri, Oct 15, 2010 at 2:49 PM, Peter Maydell peter.maydell@linaro.org wrote:
One of the Valgrind subtools is Cachegrind; this is a cache profiler. (It simulates the I1, D1 and L2 caches so it can pinpoint the sources of cache misses in application code.)
On x86 Cachegrind automatically queries the host CPU to find out what sort/size of cache it has installed, and by default will simulate that sort of cache. (You can also use command line options to specify a different cache layout to model.)
On ARM, the ARMv7 VMSA coprocessor registers which describe the cache geometry are privileged-mode access only. This means cachegrind can't do the same "default cache model is the same as your real CPU" behaviour that it does on x86.
Can the kernel folks on this list suggest whether it would be a reasonable idea for the kernel to provide some sort of userspace API so tools like cachegrind can find out the cache geometry?
There are similar issues with the CPU ID and feature registers.
If we're going to do suggest a change for this, it would be good to clear up the whole CPU identification / CPU feature detection area at the same time.
Note, a key stumbling block has been that the configuration in effect may not be the same as that supported by the hardware (because some kernel feature is compiled out, or errata workarounds are in force, for example). This means that simply mirroring the CPU registers up to userspace (or simplistically emulating the MRCs) may give rise to problems.
This is potentially something where we can make some worthwhile progress in linaro, but there's a risk of creating Yet Another Interface, which few people migrate to --- exacerbating the fragmentation further. It would be interesting if anyone has thoughts on how to manage that.
Cheers ---Dave
From: linaro-dev-bounces@lists.linaro.org [mailto:linaro-dev- bounces@lists.linaro.org] On Behalf Of Peter Maydell
One of the Valgrind subtools is Cachegrind; this is a cache profiler. (It simulates the I1, D1 and L2 caches so it can pinpoint the sources of cache misses in application code.)
Part of this info is exported to user space through /proc/cpuinfo
Catalin did post a patch long back to fix up decode for v7. I recall RMK not linking some aspect. The reasons are buried in mail archives. IIRC it had to do with expectations around that interface and the constant churn around he formatting that happened.
There is enough info in cpuinfo you can guess, but you will be wrong due to errata modifications. Also there are a lot of cache options which are changeable. Things like if your using write-alloc or not will change results a lot.
Is it possible on the tool to just have it take input from some config file? If you know your CPU it could fall back and use that information. Getting the cache information from the tool would be cool. If the kernel has some issues getting the info letting the user pick and having a default config would be next best.
... if you look at proc and see v7 and smp you can infer a lot of the configuration and pick a close default.
Regards, Richard W.
On Sat, 2010-10-16 at 02:05 +0100, Woodruff, Richard wrote:
From: linaro-dev-bounces@lists.linaro.org [mailto:linaro-dev- bounces@lists.linaro.org] On Behalf Of Peter Maydell
One of the Valgrind subtools is Cachegrind; this is a cache profiler. (It simulates the I1, D1 and L2 caches so it can pinpoint the sources of cache misses in application code.)
Part of this info is exported to user space through /proc/cpuinfo
Catalin did post a patch long back to fix up decode for v7. I recall RMK not linking some aspect. The reasons are buried in mail archives. IIRC it had to do with expectations around that interface and the constant churn around he formatting that happened.
I recall the patch was originally implemented by Tony Thompson @ ARM but it wasn't accepted by RMK. But I think the patch wasn't giving enough information to be useful to cachegrind anyway.
The cache configuration can be a lot more complex on ARMv7 onwards as you can have several levels of cache with different cache line sizes. We've had discussions in ARM in the past but I don't think we got to any clear conclusion. Maybe Linux could export a /sys filesystem with all the CPUID registers but care needs to be taken as simply checking for Neon features doesn't mean that the kernel supports them (or that the CPU doesn't have any associated errata).