On Feb 7, 2012, at 7:30 PM, John Stultz wrote:
On Wed, 2012-02-08 at 00:16 +0100, Zygmunt Krynicki wrote:
On 02/07/2012 11:43 PM, Andrew Richardson wrote:

Which clock_t were you using? I think CLOCK_MONOTONIC makes sense for
what you are trying to do and perhaps it has different resolution/accuracy.

Hrm. No, that shouldn't be the case. CLOCK_MONOTONIC and CLOCK_REALTIME
are driven by the same accumulation, and are only different by an
offset.

That said, in the test case you're using CLOCK_MONOTONIC_RAW, which I
don't think you really want, as its not NTP freq corrected. In addition
it is driven by some slightly different accumulation logic.  But you
said CLOCK_REALTIME showed the same issue, so its probably not some
CLOCK_MONOTONIC_RAW specific bug.

In general, I don't want the time value moving around on me (in case something weird is going on and it's changing too much). This seems to be what most people advise when it comes to profiling something with sub-second execution, but I might be misunderstanding you slightly.



If I crank up the amount of work done between the time calls
(timetest.c:18: inneriters = 1e7;) such that the timed loop takes around
72ms, the timing results seem accurate and none of the intermediate
calculations result in a 0us elapsed time. If I reduce it to around
10-25ms (inneriters=1e6), I get occasional 0us elapsed times. Around 2ms
(inneriters=1e5), most results measure an elapsed time of 0us.

Hrm. So I'm not familiar with the clocksource on panda. It may be so
coarse grained as to not allow for better intervals, but 2.5ms intervals
are a little outrageous. In the above you do have a 30us interval, so
clearly there are smaller intervals, so I doubt that is the real issue.

2.5ms is much closer to a tick length when HZ=300. Or a sched out and in
w/ HZ=1000.

Seems a bit too high, right? I did get some low values, such as a 500nanoseconds difference, once. I was expected a harsh lower bound (e.g. a few ms), but a measurement of 500ns elapsed makes that theory unlikely.



I'm trying to optimize image processing functions, which take on the
order of 2-15ms to process. Am I stuck with this timing resolution? I
want to be careful to not omit issues like cache performance when
timing, as I might if I repeatedly process an image to average the
results. Currently, that seems like the best option.

Might the compiler be out smarting you, and you end up with basically
two calls to clock_gettime() next to each other? Then it would be more
normal to see 0 ns time intervals (if the clocksource is somewhat coarse
grained), with the occasional scheduling blip hitting inbetween the
timings?

This explanation doesn't match your image timing results though, as I
assume you're doing actual work in that case.

Hmmm. I'm a little stumped. Can anyone closer to the OMAP hardware
comment?

I don't think that's the case. I used -O0, which shouldn't do such things, AFAIK. And, the assembly shows a call instruction (bl) to the function being timed between two call instructions to the timestamp function.

Additionally, the times stabilize to something reasonable when cranking up the loop parameter to 1e7. My x86 machine takes around 50ms, whereas the PandaBoard takes 75ms (once what appears to be power management turns off). This seems very reasonable. I also changed the program that I'm actually interested in timing and the times are reasonable when averaging 100 iterations (at 18ms each)


Andrew