On 02/09/15 01:47, Daniel Thompson wrote:
On 09/02/15 09:28, Will Deacon wrote:
On Sun, Feb 08, 2015 at 12:02:37PM +0000, Daniel Thompson wrote:
Currently sched_clock(), a very hot code path, is not optimized to minimise its cache profile. In particular:
cd is not ____cacheline_aligned,
struct clock_data does not distinguish between hotpath and coldpath data, reducing locality of reference in the hotpath,
Some hotpath data is missing from struct clock_data and is marked __read_mostly (which more or less guarantees it will not share a cache line with cd).
This patch corrects these problems by extracting all hotpath data into a separate structure and using ____cacheline_aligned to ensure the hotpath uses a single (64 byte) cache line.
Have you got any performance figures for this change, or is this just a theoretical optimisation? It would be interesting to see what effect this has on systems with 32-byte cachelines and also scenarios where there's contention on the sequence counter.
Most of my testing has focused on proving the NMI safety parts of the patch work as advertised so its mostly theoretical.
However there are some numbers from simple tight loop calls to sched_clock (Stephen Boyd's results are more interesting than mine because I observe pretty wild quantization effects that render the results hard to trust): http://thread.gmane.org/gmane.linux.kernel/1871157/focus=1879265
Not sure what useful figures would be useful for a contended sequence counter. Firstly the counter is taken for write at 7/8 wrap time of the times so even for the fastest timers the interval is likely to be >3s and is very short duration. Additionally, the NMI safety changes make it possible to read the timer whilst it is being updated so it is only during the very short struct-copy/write/struct-copy/write update sequence that we will observe the extra cache line used for a read. Benchmarks that show the effect of update are therefore non-trivial to construct.
Here's the raw numbers for the tight loop. I noticed that if I don't use perf I get a larger number of calls per 10s, most likely because we aren't doing anything else. These are very lightly loaded systems, i.e. busybox ramdisk with nothing going on. Kernel version is v3.19-rc4. The CPU is Krait on msm8960 and msm8974, except on msm8974 it has the arm architected timers backing sched_clock() vs. our own custom timer IP on msm8960. The cache line size is 64 bytes. I also ran it on msm8660 which is a Scorpion CPU with the same timer as msm8960 (custom timer IP) and a cache line size of 32 bytes. Unfortunately nobody has ported Scorpion over to perf events, so we don't hardware events.
msm8960 (before patch) ---------------------- # perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test Made 14528449 calls in 10000000290 ns Made 14528925 calls in 10000000142 ns Made 14524549 calls in 10000000587 ns Made 14528164 calls in 10000000734 ns Made 14524468 calls in 10000000290 ns Made 14527198 calls in 10000000438 ns Made 14523508 calls in 10000000734 ns Made 14527894 calls in 10000000290 ns Made 14529609 calls in 10000000734 ns Made 14523114 calls in 10000000142 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10009.635016 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% ) 7 context-switches # 0.001 K/sec ( +- 16.16% ) 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.006 K/sec 4003806350 cycles # 0.400 GHz ( +- 0.00% ) 0 stalled-cycles-frontend # 0.00% frontend cycles idle 0 stalled-cycles-backend # 0.00% backend cycles idle 921924235 instructions # 0.23 insns per cycle ( +- 0.01% ) 0 branches # 0.000 K/sec 58521151 branch-misses # 5.846 M/sec ( +- 0.01% )
10.011767657 seconds time elapsed ( +- 0.00% )
msm8960 (after patch) --------------------- # perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test Made 19626366 calls in 10000000587 ns Made 19623708 calls in 10000000142 ns Made 19623282 calls in 10000000290 ns Made 19625304 calls in 10000000290 ns Made 19625151 calls in 10000000291 ns Made 19624906 calls in 10000000290 ns Made 19625383 calls in 10000000143 ns Made 19625235 calls in 10000000290 ns Made 19624969 calls in 10000000290 ns Made 19625209 calls in 10000000438 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10009.883401 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% ) 7 context-switches # 0.001 K/sec ( +- 15.88% ) 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.006 K/sec 4003901511 cycles # 0.400 GHz ( +- 0.00% ) 0 stalled-cycles-frontend # 0.00% frontend cycles idle 0 stalled-cycles-backend # 0.00% backend cycles idle 1164635790 instructions # 0.29 insns per cycle ( +- 0.00% ) 0 branches # 0.000 K/sec 20039814 branch-misses # 2.002 M/sec ( +- 0.00% )
10.012092383 seconds time elapsed ( +- 0.00% )
msm8974 (before patch) ---------------------- # perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test Made 21289694 calls in 10000000083 ns Made 21289072 calls in 10000000082 ns Made 21289550 calls in 10000000395 ns Made 21288892 calls in 10000000291 ns Made 21288987 calls in 10000000135 ns Made 21289140 calls in 10000000395 ns Made 21289161 calls in 10000000395 ns Made 21288911 calls in 10000000239 ns Made 21289204 calls in 10000000135 ns Made 21288738 calls in 10000000135 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10003.839348 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% ) 4 context-switches # 0.000 K/sec ( +- 3.70% ) 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.006 K/sec 6146323757 cycles # 0.614 GHz ( +- 0.00% ) 0 stalled-cycles-frontend # 0.00% frontend cycles idle 0 stalled-cycles-backend # 0.00% backend cycles idle 1155527762 instructions # 0.19 insns per cycle ( +- 0.00% ) 107186099 branches # 10.714 M/sec ( +- 0.00% ) 35548359 branch-misses # 33.17% of all branches ( +- 0.00% )
10.004769053 seconds time elapsed ( +- 0.00% )
msm8974 (after patch) --------------------- # perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test Made 21289357 calls in 10000000239 ns Made 21384961 calls in 10000000396 ns Made 22105925 calls in 10000000238 ns Made 27384126 calls in 10000000239 ns Made 22107737 calls in 10000000134 ns Made 21368867 calls in 10000000239 ns Made 22106065 calls in 10000000395 ns Made 27384196 calls in 10000000083 ns Made 22107334 calls in 10000000291 ns Made 21365426 calls in 10000000291 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10003.753333 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% ) 7 context-switches # 0.001 K/sec ( +- 18.18% ) 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.006 K/sec 6837664600 cycles # 0.684 GHz ( +- 6.74% ) 0 stalled-cycles-frontend # 0.00% frontend cycles idle 0 stalled-cycles-backend # 0.00% backend cycles idle 1148993903 instructions # 0.17 insns per cycle ( +- 3.32% ) 115049358 branches # 11.501 M/sec ( +- 3.31% ) 42520513 branch-misses # 36.96% of all branches ( +- 5.00% )
10.004769533 seconds time elapsed ( +- 0.00% )
msm8660 (before patch) ----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test Made 14099029 calls in 10000000586 ns Made 14099227 calls in 10000000735 ns Made 14098763 calls in 10000000439 ns Made 14099042 calls in 10000000291 ns Made 14099273 calls in 10000000290 ns Made 14100377 calls in 10000000586 ns Made 14100183 calls in 10000000586 ns Made 14099220 calls in 10000000586 ns Made 14098853 calls in 10000000587 ns Made 14099368 calls in 10000000142 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10006.700528 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% ) 11 context-switches # 0.001 K/sec ( +- 10.38% ) 0 cpu-migrations # 0.000 K/sec 56 page-faults # 0.006 K/sec 0 cycles # 0.000 GHz 0 stalled-cycles-frontend # 0.00% frontend cycles idle 0 stalled-cycles-backend # 0.00% backend cycles idle 0 instructions 0 branches # 0.000 K/sec 0 branch-misses # 0.000 K/sec
10.008796161 seconds time elapsed ( +- 0.00% )
msm8660 (after patch) --------------------- # perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test Made 20555901 calls in 10000000438 ns Made 15510019 calls in 10000000142 ns Made 15510371 calls in 10000000587 ns Made 15509184 calls in 10000000439 ns Made 15509068 calls in 10000000291 ns Made 15510719 calls in 10000000439 ns Made 15508899 calls in 10000000291 ns Made 15509206 calls in 10000000587 ns Made 15509057 calls in 10000000290 ns Made 15509178 calls in 10000000735 ns
Performance counter stats for 'modprobe sched_clock_test' (10 runs):
10009.491416 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% ) 13 context-switches # 0.001 K/sec ( +- 10.82% ) 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.006 K/sec 0 cycles # 0.000 GHz 0 stalled-cycles-frontend # 0.00% frontend cycles idle 0 stalled-cycles-backend # 0.00% backend cycles idle 0 instructions 0 branches # 0.000 K/sec 0 branch-misses # 0.000 K/sec
10.011834087 seconds time elapsed ( +- 0.00% )