Greetings,
I'm experiencing what appears to be a minimum clock resolution issue in using clock_gettime() on a PandaBoard ES running ubuntu.
*> uname -r* 3.1.1-8-linaro-lt-omap
*> cat /proc/version* Linux version 3.1.1-8-linaro-lt-omap (buildd@diphda) (gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) ) #8~lt~ci~20120118001257+025756-Ubuntu SMP PREEMPT Thu Jan 19 09:
I'm using clock_gettime() (and have tried gettimeofday()) to compute the elapsed time around roughly 15ms of computation (image processing). While the computed time is stable on my x86_64 machine, it is not on my PandaBoard ES. I have tried various clocks (e.g. CLOCK_REALTIME), but the issue remains. No error codes are returned by clock_gettime().
The result on my x86_64 machine looks like this:
*elapsed (s) elapsed (ns) elapsed (us) time (after) time (before)* 0s 532260ns *532us* (t1: 73741s 92573265ns) (t0: 73741s 92041005ns) 0s 544413ns *544us* (t1: 73741s 109390136ns) (t0: 73741s 108845723ns) 0s 529328ns *529us* (t1: 73741s 126024860ns) (t0: 73741s 125495532ns)
A: 1.7s in total. *0.536ms* on average.
If I move over to my PandaBoard ES, I calculate elapsed times of 0us on some iterations.
*elapsed (s) elapsed (ns) elapsed (us) time (after) time (before)* 0s 0ns *0us* (t1: 269529s 192626951ns) (t0: 269529s 192626951ns) 0s 0ns *0us* (t1: 269529s 215606688ns) (t0: 269529s 215606688ns) 0s 2655030ns *2655us* (t1: 269529s 252349852ns) (t0: 269529s 249694822ns) 0s 2593994ns *2593us* (t1: 269529s 286163328ns) (t0: 269529s 283569334ns) 0s 30518ns *30us* (t1: 269529s 317657469ns) (t0: 269529s 317626951ns)
If I crank up the amount of work done between the time calls (timetest.c:18: inneriters = 1e7;) such that the timed loop takes around 72ms, the timing results seem accurate and none of the intermediate calculations result in a 0us elapsed time. If I reduce it to around 10-25ms (inneriters=1e6), I get occasional 0us elapsed times. Around 2ms (inneriters=1e5), most results measure an elapsed time of 0us.
I'm trying to optimize image processing functions, which take on the order of 2-15ms to process. Am I stuck with this timing resolution? I want to be careful to not omit issues like cache performance when timing, as I might if I repeatedly process an image to average the results. Currently, that seems like the best option.
Source code and makefile attached, as well as /proc/timer_list
Is this a property of the hardware, or might it be a bug?
Thanks, Andrew
On 02/07/2012 11:43 PM, Andrew Richardson wrote:
Greetings,
I'm experiencing what appears to be a minimum clock resolution issue in using clock_gettime() on a PandaBoard ES running ubuntu.
*> uname -r* 3.1.1-8-linaro-lt-omap *> cat /proc/version* Linux version 3.1.1-8-linaro-lt-omap (buildd@diphda) (gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) ) #8~lt~ci~20120118001257+025756-Ubuntu SMP PREEMPT Thu Jan 19 09:
I'm using clock_gettime() (and have tried gettimeofday()) to compute the
Which clock_t were you using? I think CLOCK_MONOTONIC makes sense for what you are trying to do and perhaps it has different resolution/accuracy.
elapsed time around roughly 15ms of computation (image processing). While the computed time is stable on my x86_64 machine, it is not on my PandaBoard ES. I have tried various clocks (e.g. CLOCK_REALTIME), but the issue remains. No error codes are returned by clock_gettime().
The result on my x86_64 machine looks like this:
*elapsed (s) elapsed (ns) elapsed (us) time (after) time (before)* 0s 532260ns *532us* (t1: 73741s 92573265ns) (t0: 73741s 92041005ns) 0s 544413ns *544us* (t1: 73741s 109390136ns) (t0: 73741s 108845723ns) 0s 529328ns *529us* (t1: 73741s 126024860ns) (t0: 73741s 125495532ns) A: 1.7s in total. *0.536ms* on average.
If I move over to my PandaBoard ES, I calculate elapsed times of 0us on some iterations.
*elapsed (s) elapsed (ns) elapsed (us) time (after) time (before)* 0s 0ns *0us* (t1: 269529s 192626951ns) (t0: 269529s 192626951ns) 0s 0ns *0us* (t1: 269529s 215606688ns) (t0: 269529s 215606688ns) 0s 2655030ns *2655us* (t1: 269529s 252349852ns) (t0: 269529s 249694822ns) 0s 2593994ns *2593us* (t1: 269529s 286163328ns) (t0: 269529s 283569334ns) 0s 30518ns *30us* (t1: 269529s 317657469ns) (t0: 269529s 317626951ns)
If I crank up the amount of work done between the time calls (timetest.c:18: inneriters = 1e7;) such that the timed loop takes around 72ms, the timing results seem accurate and none of the intermediate calculations result in a 0us elapsed time. If I reduce it to around 10-25ms (inneriters=1e6), I get occasional 0us elapsed times. Around 2ms (inneriters=1e5), most results measure an elapsed time of 0us.
I'm trying to optimize image processing functions, which take on the order of 2-15ms to process. Am I stuck with this timing resolution? I want to be careful to not omit issues like cache performance when timing, as I might if I repeatedly process an image to average the results. Currently, that seems like the best option.
Source code and makefile attached, as well as /proc/timer_list
Is this a property of the hardware, or might it be a bug?
Thanks, Andrew
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
I was using CLOCK_MONOTONIC_RAW before. I just tried CLOCK_MONOTONIC and CLOCK_REALTIME and did not see any improvement when timing 2-3ms events.
Andrew
On 12-02-07 06:16 PM, Zygmunt Krynicki wrote:
On 02/07/2012 11:43 PM, Andrew Richardson wrote:
Greetings,
I'm experiencing what appears to be a minimum clock resolution issue in using clock_gettime() on a PandaBoard ES running ubuntu.
*> uname -r* 3.1.1-8-linaro-lt-omap *> cat /proc/version* Linux version 3.1.1-8-linaro-lt-omap (buildd@diphda) (gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) ) #8~lt~ci~20120118001257+025756-Ubuntu SMP PREEMPT Thu Jan 19 09:
I'm using clock_gettime() (and have tried gettimeofday()) to compute the
Which clock_t were you using? I think CLOCK_MONOTONIC makes sense for what you are trying to do and perhaps it has different resolution/accuracy.
elapsed time around roughly 15ms of computation (image processing). While the computed time is stable on my x86_64 machine, it is not on my PandaBoard ES. I have tried various clocks (e.g. CLOCK_REALTIME), but the issue remains. No error codes are returned by clock_gettime().
The result on my x86_64 machine looks like this:
*elapsed (s) elapsed (ns) elapsed (us) time (after) time (before)* 0s 532260ns *532us* (t1: 73741s 92573265ns) (t0: 73741s 92041005ns) 0s 544413ns *544us* (t1: 73741s 109390136ns) (t0: 73741s
108845723ns) 0s 529328ns *529us* (t1: 73741s 126024860ns) (t0: 73741s 125495532ns)
A: 1.7s in total. *0.536ms* on average.
If I move over to my PandaBoard ES, I calculate elapsed times of 0us on some iterations.
*elapsed (s) elapsed (ns) elapsed (us) time (after) time (before)* 0s 0ns *0us* (t1: 269529s 192626951ns) (t0: 269529s 192626951ns) 0s 0ns *0us* (t1: 269529s 215606688ns) (t0: 269529s 215606688ns) 0s 2655030ns *2655us* (t1: 269529s 252349852ns) (t0: 269529s 249694822ns) 0s 2593994ns *2593us* (t1: 269529s 286163328ns) (t0: 269529s 283569334ns) 0s 30518ns *30us* (t1: 269529s 317657469ns) (t0: 269529s
317626951ns)
If I crank up the amount of work done between the time calls (timetest.c:18: inneriters = 1e7;) such that the timed loop takes around 72ms, the timing results seem accurate and none of the intermediate calculations result in a 0us elapsed time. If I reduce it to around 10-25ms (inneriters=1e6), I get occasional 0us elapsed times. Around 2ms (inneriters=1e5), most results measure an elapsed time of 0us.
I'm trying to optimize image processing functions, which take on the order of 2-15ms to process. Am I stuck with this timing resolution? I want to be careful to not omit issues like cache performance when timing, as I might if I repeatedly process an image to average the results. Currently, that seems like the best option.
Source code and makefile attached, as well as /proc/timer_list
Is this a property of the hardware, or might it be a bug?
Thanks, Andrew
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
On Wed, 2012-02-08 at 00:16 +0100, Zygmunt Krynicki wrote:
On 02/07/2012 11:43 PM, Andrew Richardson wrote:
Greetings,
I'm experiencing what appears to be a minimum clock resolution issue in using clock_gettime() on a PandaBoard ES running ubuntu.
*> uname -r* 3.1.1-8-linaro-lt-omap *> cat /proc/version* Linux version 3.1.1-8-linaro-lt-omap (buildd@diphda) (gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) ) #8~lt~ci~20120118001257+025756-Ubuntu SMP PREEMPT Thu Jan 19 09:
I'm using clock_gettime() (and have tried gettimeofday()) to compute the
Which clock_t were you using? I think CLOCK_MONOTONIC makes sense for what you are trying to do and perhaps it has different resolution/accuracy.
Hrm. No, that shouldn't be the case. CLOCK_MONOTONIC and CLOCK_REALTIME are driven by the same accumulation, and are only different by an offset.
That said, in the test case you're using CLOCK_MONOTONIC_RAW, which I don't think you really want, as its not NTP freq corrected. In addition it is driven by some slightly different accumulation logic. But you said CLOCK_REALTIME showed the same issue, so its probably not some CLOCK_MONOTONIC_RAW specific bug.
elapsed time around roughly 15ms of computation (image processing). While the computed time is stable on my x86_64 machine, it is not on my PandaBoard ES. I have tried various clocks (e.g. CLOCK_REALTIME), but the issue remains. No error codes are returned by clock_gettime().
The result on my x86_64 machine looks like this:
*elapsed (s) elapsed (ns) elapsed (us) time (after) time (before)* 0s 532260ns *532us* (t1: 73741s 92573265ns) (t0: 73741s 92041005ns) 0s 544413ns *544us* (t1: 73741s 109390136ns) (t0: 73741s 108845723ns) 0s 529328ns *529us* (t1: 73741s 126024860ns) (t0: 73741s 125495532ns) A: 1.7s in total. *0.536ms* on average.
If I move over to my PandaBoard ES, I calculate elapsed times of 0us on some iterations.
*elapsed (s) elapsed (ns) elapsed (us) time (after) time (before)* 0s 0ns *0us* (t1: 269529s 192626951ns) (t0: 269529s 192626951ns) 0s 0ns *0us* (t1: 269529s 215606688ns) (t0: 269529s 215606688ns) 0s 2655030ns *2655us* (t1: 269529s 252349852ns) (t0: 269529s 249694822ns) 0s 2593994ns *2593us* (t1: 269529s 286163328ns) (t0: 269529s 283569334ns) 0s 30518ns *30us* (t1: 269529s 317657469ns) (t0: 269529s 317626951ns)
If I crank up the amount of work done between the time calls (timetest.c:18: inneriters = 1e7;) such that the timed loop takes around 72ms, the timing results seem accurate and none of the intermediate calculations result in a 0us elapsed time. If I reduce it to around 10-25ms (inneriters=1e6), I get occasional 0us elapsed times. Around 2ms (inneriters=1e5), most results measure an elapsed time of 0us.
Hrm. So I'm not familiar with the clocksource on panda. It may be so coarse grained as to not allow for better intervals, but 2.5ms intervals are a little outrageous. In the above you do have a 30us interval, so clearly there are smaller intervals, so I doubt that is the real issue.
2.5ms is much closer to a tick length when HZ=300. Or a sched out and in w/ HZ=1000.
I'm trying to optimize image processing functions, which take on the order of 2-15ms to process. Am I stuck with this timing resolution? I want to be careful to not omit issues like cache performance when timing, as I might if I repeatedly process an image to average the results. Currently, that seems like the best option.
Might the compiler be out smarting you, and you end up with basically two calls to clock_gettime() next to each other? Then it would be more normal to see 0 ns time intervals (if the clocksource is somewhat coarse grained), with the occasional scheduling blip hitting inbetween the timings?
This explanation doesn't match your image timing results though, as I assume you're doing actual work in that case.
Hmmm. I'm a little stumped. Can anyone closer to the OMAP hardware comment?
thanks -john
On Feb 7, 2012, at 7:30 PM, John Stultz wrote:
On Wed, 2012-02-08 at 00:16 +0100, Zygmunt Krynicki wrote:
On 02/07/2012 11:43 PM, Andrew Richardson wrote:
Which clock_t were you using? I think CLOCK_MONOTONIC makes sense for what you are trying to do and perhaps it has different resolution/accuracy.
Hrm. No, that shouldn't be the case. CLOCK_MONOTONIC and CLOCK_REALTIME are driven by the same accumulation, and are only different by an offset.
That said, in the test case you're using CLOCK_MONOTONIC_RAW, which I don't think you really want, as its not NTP freq corrected. In addition it is driven by some slightly different accumulation logic. But you said CLOCK_REALTIME showed the same issue, so its probably not some CLOCK_MONOTONIC_RAW specific bug.
In general, I don't want the time value moving around on me (in case something weird is going on and it's changing too much). This seems to be what most people advise when it comes to profiling something with sub-second execution, but I might be misunderstanding you slightly.
If I crank up the amount of work done between the time calls (timetest.c:18: inneriters = 1e7;) such that the timed loop takes around 72ms, the timing results seem accurate and none of the intermediate calculations result in a 0us elapsed time. If I reduce it to around 10-25ms (inneriters=1e6), I get occasional 0us elapsed times. Around 2ms (inneriters=1e5), most results measure an elapsed time of 0us.
Hrm. So I'm not familiar with the clocksource on panda. It may be so coarse grained as to not allow for better intervals, but 2.5ms intervals are a little outrageous. In the above you do have a 30us interval, so clearly there are smaller intervals, so I doubt that is the real issue.
2.5ms is much closer to a tick length when HZ=300. Or a sched out and in w/ HZ=1000.
Seems a bit too high, right? I did get some low values, such as a 500nanoseconds difference, once. I was expected a harsh lower bound (e.g. a few ms), but a measurement of 500ns elapsed makes that theory unlikely.
I'm trying to optimize image processing functions, which take on the order of 2-15ms to process. Am I stuck with this timing resolution? I want to be careful to not omit issues like cache performance when timing, as I might if I repeatedly process an image to average the results. Currently, that seems like the best option.
Might the compiler be out smarting you, and you end up with basically two calls to clock_gettime() next to each other? Then it would be more normal to see 0 ns time intervals (if the clocksource is somewhat coarse grained), with the occasional scheduling blip hitting inbetween the timings?
This explanation doesn't match your image timing results though, as I assume you're doing actual work in that case.
Hmmm. I'm a little stumped. Can anyone closer to the OMAP hardware comment?
I don't think that's the case. I used -O0, which shouldn't do such things, AFAIK. And, the assembly shows a call instruction (bl) to the function being timed between two call instructions to the timestamp function.
Additionally, the times stabilize to something reasonable when cranking up the loop parameter to 1e7. My x86 machine takes around 50ms, whereas the PandaBoard takes 75ms (once what appears to be power management turns off). This seems very reasonable. I also changed the program that I'm actually interested in timing and the times are reasonable when averaging 100 iterations (at 18ms each)
Andrew
On Tue, 2012-02-07 at 20:21 -0500, Andrew Richardson wrote:
On Feb 7, 2012, at 7:30 PM, John Stultz wrote:
On Wed, 2012-02-08 at 00:16 +0100, Zygmunt Krynicki wrote: Hrm. No, that shouldn't be the case. CLOCK_MONOTONIC and CLOCK_REALTIME are driven by the same accumulation, and are only different by an offset.
That said, in the test case you're using CLOCK_MONOTONIC_RAW, which I don't think you really want, as its not NTP freq corrected. In addition it is driven by some slightly different accumulation logic. But you said CLOCK_REALTIME showed the same issue, so its probably not some CLOCK_MONOTONIC_RAW specific bug.
In general, I don't want the time value moving around on me (in case something weird is going on and it's changing too much). This seems to be what most people advise when it comes to profiling something with sub-second execution, but I might be misunderstanding you slightly.
The difference is "hardware constant" vs "software controlled time constant". And with all things time, its all relative. :)
CLOCK_MONOTONIC_RAW is uncorrected, so a second may not really be a second and things like thermal changes can cause fluctuations in your timing intervals. CLOCK_MONOTONIC is NTP corrected, so a second should be a second and thermal drift should be corrected for, but that depends on how much you trust ntpd.
That said, CLOCK_MONOTONIC can really only be corrected to 500ppm of CLOCK_MONOTONIC_RAW, so I suspect the difference won't really matter that much unless your measuring longer intervals. Its not like CLOCK_REALTIME, which is more problematic as it may be set back and forth any amount of time.
Seems a bit too high, right? I did get some low values, such as a 500nanoseconds difference, once. I was expected a harsh lower bound (e.g. a few ms), but a measurement of 500ns elapsed makes that theory unlikely.
Yea. I suspect something else is at play here.
thanks -john
On 02/07/2012 02:43 PM, Andrew Richardson wrote:
I'm experiencing what appears to be a minimum clock resolution issue in using clock_gettime() on a PandaBoard ES running ubuntu.
Do you have CONFIG_OMAP_32K_TIMER enabled in your kernel? Look at 'dmesg | grep clock' and check for the following:
... OMAP clockevent source: GPTIMER1 at 32768 Hz sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms ...
Most probably this is the answer - by default, recent OMAPs are configured to use less-accurate, but more energy-saving timer (32KHz) in favor of MPU timer.
Disable CONFIG_OMAP_32K_TIMER to switch to MPU timer, and check 'dmesg | grep clock' to see:
... OMAP clockevent source: GPTIMER1 at 38400000 Hz OMAP clocksource: GPTIMER2 at 38400000 Hz sched_clock: 32 bits at 38MHz, resolution 26ns, wraps every 111848ms ...
BTW, I have no ideas why clock_getres(CLOCK_REALTIME,...) returns {0, 1} regardless of underlying clock source. I expect {0, 30517} for 32K timer and {0, 26} for MPU timer.
Dmitry
Ah, very interesting.
dmesg | grep clock
[ 0.000000] OMAP clockevent source: GPTIMER1 at 32768 Hz [ 0.000000] sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms [ 0.309448] omap_hwmod: l4_div_ck: missing clockdomain for l4_div_ck. [ 0.716979] Skipping twl internal clock init and using bootloader value (unknown osc rate) [ 1.001129] Switching to clocksource 32k_counter [ 6.907501] twl_rtc twl_rtc: setting system clock to 2000-01-01 00:00:01 UTC (946684801)
Do you recommend using "Get linaro image tools: method 2 (source code)" ( http://releases.linaro.org/12.01/ubuntu/leb-panda/ ) and building the kernel myself? I think we're, for the most part, unconcerned with power usage for our application (we're robotics researchers and, I believe, computation is only a small fraction of the power draw when compared to the motors).
It seems to me that we would want to disable some of the power-saving changes that have been made, such as this timer, and possibly configure other settings like cache behavior, though I have no idea how they're currently set. I have a bunch of docs from ARM on power and cache config, but I haven't messed around with them as I'm not sure where to start. My best guess is that I would have to rebuild the kernel to start handling that configuration myself. Is that true?
Some people ( http://groups.google.com/group/pandaboard/browse_thread/thread/a18fa3514d130... ) have mentioned enabling line fill and prefetching to speed up memcpy operations, which also seems useful. Is this also a kernel-level setting?
If you think that's the right route, I would appreciate advice on where, within the build process, I need to start changing things to get the settings I want.
Many thanks, Andrew
On Feb 8, 2012, at 12:21 AM, Dmitry Antipov wrote:
On 02/07/2012 02:43 PM, Andrew Richardson wrote:
I'm experiencing what appears to be a minimum clock resolution issue in using clock_gettime() on a PandaBoard ES running ubuntu.
Do you have CONFIG_OMAP_32K_TIMER enabled in your kernel? Look at 'dmesg | grep clock' and check for the following:
... OMAP clockevent source: GPTIMER1 at 32768 Hz sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms ...
Most probably this is the answer - by default, recent OMAPs are configured to use less-accurate, but more energy-saving timer (32KHz) in favor of MPU timer.
Disable CONFIG_OMAP_32K_TIMER to switch to MPU timer, and check 'dmesg | grep clock' to see:
... OMAP clockevent source: GPTIMER1 at 38400000 Hz OMAP clocksource: GPTIMER2 at 38400000 Hz sched_clock: 32 bits at 38MHz, resolution 26ns, wraps every 111848ms ...
BTW, I have no ideas why clock_getres(CLOCK_REALTIME,...) returns {0, 1} regardless of underlying clock source. I expect {0, 30517} for 32K timer and {0, 26} for MPU timer.
Dmitry
On 02/08/2012 01:32 AM, Andrew Richardson wrote:
Do you recommend using "Get linaro image tools: method 2 (source code)" ( http://releases.linaro.org/12.01/ubuntu/leb-panda/ ) and building the kernel myself?
Unfortunately this is the only way. In theory, there are clocksource= boot option, and sysfs interface under /sys/devices/system/clocksource, but, IIUC, there is no way to compile the kernel with both 32K and MPU timers support and then select one of them for the default clock source at the boot time or when the system is running.
It seems to me that we would want to disable some of the power-saving changes that have been made, such as this timer, and possibly configure other settings like cache behavior, though I have no idea how they're currently set. I have a bunch of docs from ARM on power and cache config, but I haven't messed around with them as I'm not sure where to start. My best guess is that I would have to rebuild the kernel to start handling that configuration myself. Is that true?
If you're seriously concerned on the optimization for the particular workload, you definitely should.
Some people ( http://groups.google.com/group/pandaboard/browse_thread/thread/a18fa3514d130... ) have mentioned enabling line fill and prefetching to speed up memcpy operations, which also seems useful. Is this also a kernel-level setting?
Sure. Caching (and it's relationship to real memory speed) is a hard topic. For the starting point, try:
1. Read http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.htm... (this is for A8, but should be more or less applicable to A9);
2. Run 'dmesg | grep -i cache' and check for the something similar to:
L310 cache controller enabled l2x0: 16 ways, CACHE_ID 0x410000c4, AUX_CTRL 0x7e470000, Cache size: 1048576 B
3. Read http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/DDI0246F_l2c310_r3... and realize the meaning of these AUX_CTRL bits;
4. Read arch/arm/mach-omap2/omap4-common.c and arch/arm/mm/cache-l2x0.c, try to play with 'aux_ctrl' bits within omap_l2_cache_init() and check whether it affects your workload. Note this may cause kernel crash and/or prevent the system from booting at all.
Dmitry
On Feb 8, 2012, at 8:55 AM, Dmitry Antipov wrote:
On 02/08/2012 01:32 AM, Andrew Richardson wrote:
Do you recommend using "Get linaro image tools: method 2 (source code)" ( http://releases.linaro.org/12.01/ubuntu/leb-panda/ ) and building the kernel myself?
Unfortunately this is the only way. In theory, there are clocksource= boot option, and sysfs interface under /sys/devices/system/clocksource, but, IIUC, there is no way to compile the kernel with both 32K and MPU timers support and then select one of them for the default clock source at the boot time or when the system is running.
It seems to me that we would want to disable some of the power-saving changes that have been made, such as this timer, and possibly configure other settings like cache behavior, though I have no idea how they're currently set. I have a bunch of docs from ARM on power and cache config, but I haven't messed around with them as I'm not sure where to start. My best guess is that I would have to rebuild the kernel to start handling that configuration myself. Is that true?
If you're seriously concerned on the optimization for the particular workload, you definitely should.
Some people ( http://groups.google.com/group/pandaboard/browse_thread/thread/a18fa3514d130... ) have mentioned enabling line fill and prefetching to speed up memcpy operations, which also seems useful. Is this also a kernel-level setting?
Sure. Caching (and it's relationship to real memory speed) is a hard topic. For the starting point, try:
(this is for A8, but should be more or less applicable to A9);
- Run 'dmesg | grep -i cache' and check for the something similar to:
L310 cache controller enabled l2x0: 16 ways, CACHE_ID 0x410000c4, AUX_CTRL 0x7e470000, Cache size: 1048576 B
and realize the meaning of these AUX_CTRL bits;
- Read arch/arm/mach-omap2/omap4-common.c and arch/arm/mm/cache-l2x0.c, try to play
with 'aux_ctrl' bits within omap_l2_cache_init() and check whether it affects your workload. Note this may cause kernel crash and/or prevent the system from booting at all.
Dmitry
On Feb 8, 2012, at 8:55 AM, Dmitry Antipov wrote:
On 02/08/2012 01:32 AM, Andrew Richardson wrote:
Do you recommend using "Get linaro image tools: method 2 (source code)" ( http://releases.linaro.org/12.01/ubuntu/leb-panda/ ) and building the kernel myself?
Unfortunately this is the only way. In theory, there are clocksource= boot option, and sysfs interface under /sys/devices/system/clocksource, but, IIUC, there is no way to compile the kernel with both 32K and MPU timers support and then select one of them for the default clock source at the boot time or when the system is running.
It seems to me that we would want to disable some of the power-saving changes that have been made, such as this timer, and possibly configure other settings like cache behavior, though I have no idea how they're currently set. I have a bunch of docs from ARM on power and cache config, but I haven't messed around with them as I'm not sure where to start. My best guess is that I would have to rebuild the kernel to start handling that configuration myself. Is that true?
If you're seriously concerned on the optimization for the particular workload, you definitely should.
Some people ( http://groups.google.com/group/pandaboard/browse_thread/thread/a18fa3514d130... ) have mentioned enabling line fill and prefetching to speed up memcpy operations, which also seems useful. Is this also a kernel-level setting?
Sure. Caching (and it's relationship to real memory speed) is a hard topic. For the starting point, try:
(this is for A8, but should be more or less applicable to A9);
- Run 'dmesg | grep -i cache' and check for the something similar to:
L310 cache controller enabled l2x0: 16 ways, CACHE_ID 0x410000c4, AUX_CTRL 0x7e470000, Cache size: 1048576 B
and realize the meaning of these AUX_CTRL bits;
- Read arch/arm/mach-omap2/omap4-common.c and arch/arm/mm/cache-l2x0.c, try to play
with 'aux_ctrl' bits within omap_l2_cache_init() and check whether it affects your workload. Note this may cause kernel crash and/or prevent the system from booting at all.
Very interesting. Looks like I have some reading in my future. Thanks much for the pointers
Andrew
On Wed, 2012-02-08 at 04:32 -0500, Andrew Richardson wrote:
Ah, very interesting.
> dmesg | grep clock [ 0.000000] OMAP clockevent source: GPTIMER1 at 32768 Hz [ 0.000000] sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms
Hrm. So 30us is still much smaller then the 2.5ms you were seeing. So that doesn't fully explain the behavior.
thanks -john
On Wed, 2012-02-08 at 04:32 -0500, Andrew Richardson wrote:
Ah, very interesting.
> dmesg | grep clock [ 0.000000] OMAP clockevent source: GPTIMER1 at 32768 Hz [ 0.000000] sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms
Hrm. So 30us is still much smaller then the 2.5ms you were seeing. So that doesn't fully explain the behavior.
thanks -john
Hi,
Never had any issue with 32K and gettimeofday() on Panda (but just starting to use clock_gettime()). It was used to timestamp events happening every few ms or 100s of us.
I would advise as a check: - read clock_gettime()/gettimeofday() and in parallel 32K register (map and read physical address 0x4A304010) to check behaviour. - There is potential issue (that we have never seen) when reading 32K register. Worked around by calling clock_gettime()/gettimeofday() twice (we never do that and still it works so ...)
We have been doing tests in the past like while(1) {gettimeofday(); printf("time ...")} and it worked correctly, exhibiting the 30.5us accuracy
Regards Fred
Texas Instruments France SA, 821 Avenue Jack Kilby, 06270 Villeneuve Loubet. 036 420 040 R.C.S Antibes. Capital de EUR 753.920
-----Original Message----- From: linaro-dev-bounces@lists.linaro.org [mailto:linaro-dev-bounces@lists.linaro.org] On Behalf Of John Stultz Sent: Wednesday, February 08, 2012 6:09 PM To: Andrew Richardson Cc: linux-omap@vger.kernel.org; linaro-dev@lists.linaro.org Subject: Re: Minimum timing resolution in Ubuntu/Linaro on the PandaBoard ES
On Wed, 2012-02-08 at 04:32 -0500, Andrew Richardson wrote:
Ah, very interesting.
> dmesg | grep clock [ 0.000000] OMAP clockevent source: GPTIMER1 at 32768 Hz [ 0.000000] sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms
Hrm. So 30us is still much smaller then the 2.5ms you were seeing. So that doesn't fully explain the behavior.
thanks -john
_______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
On Tue, 2012-02-07 at 21:21 -0800, Dmitry Antipov wrote:
BTW, I have no ideas why clock_getres(CLOCK_REALTIME,...) returns {0, 1} regardless of underlying clock source. I expect {0, 30517} for 32K timer and {0, 26} for MPU timer.
Yea. I had proposed to export the underlying clocksource's resolution via clock_getres, but it was argued against. The concern is that applications might not expect clock_getres to change while the application is running. Between any clock_getres() call and a time read, the clocksources could change.
But if someone has a different reading of the posix spec, it might be good to revisit this.
thanks -john
Hi,
On Wed, Feb 8, 2012 at 1:21 PM, Dmitry Antipov dmitry.antipov@linaro.org wrote:
On 02/07/2012 02:43 PM, Andrew Richardson wrote:
I'm experiencing what appears to be a minimum clock resolution issue in using clock_gettime() on a PandaBoard ES running ubuntu.
Do you have CONFIG_OMAP_32K_TIMER enabled in your kernel? Look at 'dmesg | grep clock' and check for the following:
... OMAP clockevent source: GPTIMER1 at 32768 Hz sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms ...
Most probably this is the answer - by default, recent OMAPs are configured to use less-accurate, but more energy-saving timer (32KHz) in favor of MPU timer.
Sorry, I have a question about the two kind of timers. No matter CONFIG_OMAP_32K_TIMER is defined or not, 'twd' interrupt count is always increased in '/proc/interrupts', and 'gp timer' interrupt count keeps unchanged, so looks MPU timer is still enabled even CONFIG_OMAP_32K_TIMER is disabled, isn't it?
After some investigation, I found the change[1] can remove local timer of 'twd', and tick will be driven by 'gp timer' interrupt, but I am not sure if it is the right thing to do.
Disable CONFIG_OMAP_32K_TIMER to switch to MPU timer, and check 'dmesg | grep clock' to see:
... OMAP clockevent source: GPTIMER1 at 38400000 Hz OMAP clocksource: GPTIMER2 at 38400000 Hz sched_clock: 32 bits at 38MHz, resolution 26ns, wraps every 111848ms ...
BTW, I have no ideas why clock_getres(CLOCK_REALTIME,...) returns {0, 1} regardless of underlying clock source. I expect {0, 30517} for 32K timer and {0, 26} for MPU timer.
[1], 'not select LOCAL_TIMERS for OMAP4 SMP' diff --git a/arch/arm/mach-omap2/Kconfig b/arch/arm/mach-omap2/Kconfig index d965da4..0036218 100644 --- a/arch/arm/mach-omap2/Kconfig +++ b/arch/arm/mach-omap2/Kconfig @@ -46,7 +46,7 @@ config ARCH_OMAP4 select CPU_V7 select ARM_GIC select HAVE_SMP - select LOCAL_TIMERS if SMP + #select LOCAL_TIMERS if SMP select PL310_ERRATA_588369 select PL310_ERRATA_727915 select ARM_ERRATA_720789
thanks,
On 02/07/2012 02:43 PM, Somebody in the thread at some point said:
Hi -
I'm experiencing what appears to be a minimum clock resolution issue in using clock_gettime() on a PandaBoard ES running ubuntu.
Actually this is a known problem
https://bugs.launchpad.net/linaro-landing-team-ti/+bug/873453
Dave Long has been looking at it for a while, it's reported solved on 4430 Panda now, but on 4460 Panda he currently believes that there's an issue with latency at 32kHz timer register read action, ie, the cpu can read stale data from there under some circumstances.
He mentioned yesterday that this was getting discussed on the Arm Linux mailing list too.
-Andy