On Wed, 2015-04-22 at 20:56 +0200, Thomas Gleixner wrote:
On Wed, 22 Apr 2015, Eric Dumazet wrote:
Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers") for a specific example of the problems that can be raised.
If you have a problem with the core timer code then it should be fixed there and not worked around in some place which will ruin stuff for power saving interested users. I'm so tired of this 'I fix it in my sandbox' attitude, really. If the core code has a shortcoming we fix it there right away because you are probably not the only one who runs into that shortcoming. So if we don't fix it in the core we end up with a metric ton of slightly different (or broken) workarounds which affect the workload/system characteristics of other people.
Just for the record. Even the changelog of this commit is blatantly wrong:
"We can see that timers get migrated into a single cpu, presumably idle at the time timers are set up."
Spare me the obvious typo. A 'not' is missing.
The timer migration moves timers to non idle cpus to leave the idle ones alone for power saving sake.
I can see and understand the reason why you want to avoid that, but I have to ask the question whether this pinning is the correct behaviour under all workloads and system characteristics. If yes, then the patch is the right answer, if no, then it is simply the wrong approach.
but /proc/sys/kernel/timer_migration adds a fair overhead in many workloads.
get_nohz_timer_target() has to touch 3 cache lines per cpu...
And this is something we can fix and completely avoid if we think about it. Looking at the code I have to admit that the out of line call and the sysctl variable lookup is silly. But its not rocket science to fix this.
Awesome.