Hi Ingo/Thomas/Peter,
While queuing a timer, we try to migrate it to a non-idle core if the local core is idle, but we don't try that if the timer is re-armed from its handler.
There were few unsolved problems due to which it was avoided until now. But there are cases where solving these problems can be useful. When the timer is always re-armed from its handler, it never migrates to other cores. And many a times, it ends up waking an idle core to just service the timer, which could have been handled by a non-idle core.
Peter suggested [1] few changes which can make that work and the first patch does exactly that. The second one is a minor improvement, that replaces 'running_timer' pointer with 'busy'. That variable was required as part of a sanity check during CPU hot-unplug operation. I was not sure if we should drop this extra variable ('running_timer' or 'busy') and the sanity check.
Because we are using another bit from base pointer to keep track of running status of timer, we get a warning on blackfin, as it doesn't respect ____cacheline_aligned [2].
kernel/time/timer.c: In function 'init_timers': kernel/time/timer.c:1731:2: error: call to '__compiletime_assert_1731' declared with attribute error: BUILD_BUG_ON failed: __alignof__(struct tvec_base) & TIMER_FLAG_MASK
-- viresh
[1] https://lkml.org/lkml/2015/3/28/32 [2] https://lkml.org/lkml/2015/3/29/178
Cc: Steven Miao realmz6@gmail.com
Viresh Kumar (2): timer: Avoid waking up an idle-core by migrate running timer timer: Replace base-> 'running_timer' with 'busy'
include/linux/timer.h | 3 +- kernel/time/timer.c | 102 ++++++++++++++++++++++++++++++++++++++------------ 2 files changed, 81 insertions(+), 24 deletions(-)
While queuing a timer, we try to migrate it to a non-idle core if the local core is idle, but we don't try that if the timer is re-armed from its handler.
There were few unsolved problems due to which it was avoided until now. But there are cases where solving these problems can be useful. When the timer is always re-armed from its handler, it never migrates to other cores. And many a times, it ends up waking an idle core to just service the timer, which could have been handled by a non-idle core.
Problems to solve, if we allow such migrations:
- Serialization of timer with itself. Its possible that timer may fire on the new base, before the timer handler finishes on old base.
- del_timer_sync() can't detect that the timer's handler has not finished.
__mod_timer migrates the timer with following code:
spin_lock(&old_base->lock); timer_set_base(timer, NULL); spin_unlock(&old_base->lock);
spin_lock(&new_base->lock); timer_set_base(timer, new_base); spin_unlock(&new_base->lock);
Once the new_base->lock is released, del_timer_sync() can take the new_base->lock and will get new_base->running_timer != timer. del_timer_sync() will then remove the timer and return while timer's handler hasn't finished yet on the old base.
To fix these issues, lets use another bit from base pointer to mark if a timer's handler is currently running or not. This can be used to verify timer's state in del_timer_sync().
Before running timer's handler we must ensure timer isn't running on any other CPU. If it is, then process other expired timers first, if any, and then wait until the handler finishes.
Cc: Steven Miao realmz6@gmail.com Suggested-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/timer.h | 3 +- kernel/time/timer.c | 93 ++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 76 insertions(+), 20 deletions(-)
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197e1587..68bf09d69352 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -67,8 +67,9 @@ extern struct tvec_base boot_tvec_bases; */ #define TIMER_DEFERRABLE 0x1LU #define TIMER_IRQSAFE 0x2LU +#define TIMER_RUNNING 0x4LU
-#define TIMER_FLAG_MASK 0x3LU +#define TIMER_FLAG_MASK 0x7LU
#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \ .entry = { .prev = TIMER_ENTRY_STATIC }, \ diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..364644811485 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -105,6 +105,21 @@ static inline unsigned int tbase_get_irqsafe(struct tvec_base *base) return ((unsigned int)(unsigned long)base & TIMER_IRQSAFE); }
+static inline unsigned int timer_running(struct timer_list *timer) +{ + return ((unsigned int)(unsigned long)timer->base & TIMER_RUNNING); +} + +static inline void timer_set_running(struct timer_list *timer) +{ + timer->base = (struct tvec_base *)((unsigned long)timer->base | TIMER_RUNNING); +} + +static inline void timer_clear_running(struct timer_list *timer) +{ + timer->base = (struct tvec_base *)((unsigned long)timer->base & ~TIMER_RUNNING); +} + static inline struct tvec_base *tbase_get_base(struct tvec_base *base) { return ((struct tvec_base *)((unsigned long)base & ~TIMER_FLAG_MASK)); @@ -781,21 +796,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) { - /* - * We are trying to schedule the timer on the local CPU. - * However we can't change timer's base while it is running, - * otherwise del_timer_sync() can't detect that the timer's - * handler yet has not finished. This also guarantees that - * the timer is serialized wrt itself. - */ - if (likely(base->running_timer != timer)) { - /* See the comment in lock_timer_base() */ - timer_set_base(timer, NULL); - spin_unlock(&base->lock); - base = new_base; - spin_lock(&base->lock); - timer_set_base(timer, base); - } + /* See the comment in lock_timer_base() */ + timer_set_base(timer, NULL); + spin_unlock(&base->lock); + base = new_base; + spin_lock(&base->lock); + timer_set_base(timer, base); }
timer->expires = expires; @@ -1016,7 +1022,7 @@ int try_to_del_timer_sync(struct timer_list *timer)
base = lock_timer_base(timer, &flags);
- if (base->running_timer != timer) { + if (!timer_running(timer)) { timer_stats_timer_clear_start_info(timer); ret = detach_if_pending(timer, base, true); } @@ -1050,12 +1056,12 @@ EXPORT_SYMBOL(try_to_del_timer_sync); * ---- ---- * <SOFTIRQ> * call_timer_fn(); - * base->running_timer = mytimer; + * timer_set_running(mytimer); * spin_lock_irq(somelock); * <IRQ> * spin_lock(somelock); * del_timer_sync(mytimer); - * while (base->running_timer == mytimer); + * while (timer_running(mytimer)); * * Now del_timer_sync() will never return and never release somelock. * The interrupt on the other CPU is waiting to grab somelock but @@ -1189,12 +1195,41 @@ static inline void __run_timers(struct tvec_base *base) cascade(base, &base->tv5, INDEX(3)); ++base->timer_jiffies; list_replace_init(base->tv1.vec + index, head); + +again: while (!list_empty(head)) { void (*fn)(unsigned long); unsigned long data; bool irqsafe;
- timer = list_first_entry(head, struct timer_list,entry); + timer = list_first_entry(head, struct timer_list, entry); + + if (unlikely(timer_running(timer))) { + /* + * The only way to get here is if the handler, + * running on another base, re-queued itself on + * this base, and the handler hasn't finished + * yet. + */ + + if (list_is_last(&timer->entry, head)) { + /* + * Drop lock, so that TIMER_RUNNING can + * be cleared on another base. + */ + spin_unlock(&base->lock); + + while (timer_running(timer)) + cpu_relax(); + + spin_lock(&base->lock); + } else { + /* Rotate the list and try someone else */ + list_move_tail(&timer->entry, head); + } + goto again; + } + fn = timer->function; data = timer->data; irqsafe = tbase_get_irqsafe(timer->base); @@ -1202,6 +1237,7 @@ static inline void __run_timers(struct tvec_base *base) timer_stats_account_timer(timer);
base->running_timer = timer; + timer_set_running(timer); detach_expired_timer(timer, base);
if (irqsafe) { @@ -1213,6 +1249,25 @@ static inline void __run_timers(struct tvec_base *base) call_timer_fn(timer, fn, data); spin_lock_irq(&base->lock); } + + /* + * Handler running on this base, re-queued itself on + * another base ? + */ + if (unlikely(timer->base != base)) { + unsigned long flags; + struct tvec_base *tbase; + + spin_unlock(&base->lock); + + tbase = lock_timer_base(timer, &flags); + timer_clear_running(timer); + spin_unlock(&tbase->lock); + + spin_lock(&base->lock); + } else { + timer_clear_running(timer); + } } } base->running_timer = NULL;
On Tue, Mar 31, 2015 at 12:25:16PM +0530, Viresh Kumar wrote:
Cc: Steven Miao realmz6@gmail.com
+#define TIMER_FLAG_MASK 0x7LU
So Steven, this will break compilation on blackfin because that makes ____cacheline_aligned a NOP while we assume it will generate __attribute__((__aligned__(SMP_CACHE_BYTES))) with the further assumption that SMP_CACHE_BYTES >= 8.
Can you explain why blackfin chooses to break ____cacheline_align for UP? We have the *_in_smp variants to distinguish these cases.
On Tue, 31 Mar 2015, Viresh Kumar wrote:
@@ -1189,12 +1195,41 @@ static inline void __run_timers(struct tvec_base *base) cascade(base, &base->tv5, INDEX(3)); ++base->timer_jiffies; list_replace_init(base->tv1.vec + index, head);
+again: while (!list_empty(head)) { void (*fn)(unsigned long); unsigned long data; bool irqsafe;
timer = list_first_entry(head, struct timer_list,entry);
timer = list_first_entry(head, struct timer_list, entry);
if (unlikely(timer_running(timer))) {
/*
* The only way to get here is if the handler,
* running on another base, re-queued itself on
* this base, and the handler hasn't finished
* yet.
*/
if (list_is_last(&timer->entry, head)) {
/*
* Drop lock, so that TIMER_RUNNING can
* be cleared on another base.
*/
spin_unlock(&base->lock);
while (timer_running(timer))
cpu_relax();
spin_lock(&base->lock);
} else {
/* Rotate the list and try someone else */
list_move_tail(&timer->entry, head);
}
Can we please stick that timer into the next bucket and be done with it?
goto again;
"continue;" is old school, right?
}
fn = timer->function; data = timer->data; irqsafe = tbase_get_irqsafe(timer->base);
@@ -1202,6 +1237,7 @@ static inline void __run_timers(struct tvec_base *base) timer_stats_account_timer(timer); base->running_timer = timer;
timer_set_running(timer); detach_expired_timer(timer, base);
if (irqsafe) { @@ -1213,6 +1249,25 @@ static inline void __run_timers(struct tvec_base *base) call_timer_fn(timer, fn, data); spin_lock_irq(&base->lock); }
/*
* Handler running on this base, re-queued itself on
* another base ?
*/
if (unlikely(timer->base != base)) {
unsigned long flags;
struct tvec_base *tbase;
spin_unlock(&base->lock);
tbase = lock_timer_base(timer, &flags);
timer_clear_running(timer);
spin_unlock(&tbase->lock);
spin_lock(&base->lock);
} else {
timer_clear_running(timer);
}
Aside of the above this is really horrible. Why not doing the obvious:
__mod_timer()
if (base != newbase) { if (timer_running()) { list_add(&base->migration_list); goto out_unlock; } .....
__run_timers()
while(!list_empty(head)) { .... }
if (unlikely(!list_empty(&base->migration_list)) { /* dequeue and requeue again */ }
Simple, isn't it?
No new flags in the timer base, no concurrent expiry, no extra conditionals in the expiry path. Just a single extra check at the end of the softirq handler for this rare condition instead of imposing all that nonsense to the hotpath.
We might even optimize that:
if (timer_running()) { list_add(&base->migration_list); base->preferred_target = newbase; goto out_unlock; }
if (unlikely(!list_empty(&base->migration_list)) { /* dequeue and requeue again */ while (!list_empty(&base->migration_list)) { dequeue_timer(); newbase = base->preferred_target; unlock(base); lock(newbase); enqueue(newbase); unlock(newbase); lock(base); } }
Thanks,
tglx
Hi Thomas,
On Wednesday 15 April 2015 04:43 AM, Thomas Gleixner wrote:
No new flags in the timer base, no concurrent expiry, no extra conditionals in the expiry path. Just a single extra check at the end of the softirq handler for this rare condition instead of imposing all that nonsense to the hotpath.
Here is what I could get to based on your suggestions:
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..c412511d34b8 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -78,6 +78,8 @@ struct tvec_root { struct tvec_base { spinlock_t lock; struct timer_list *running_timer; + struct list_head migration_list; + struct tvec_base *preferred_target; unsigned long timer_jiffies; unsigned long next_timer; unsigned long active_timers; @@ -785,10 +787,18 @@ __mod_timer(struct timer_list *timer, unsigned long expires, * We are trying to schedule the timer on the local CPU. * However we can't change timer's base while it is running, * otherwise del_timer_sync() can't detect that the timer's - * handler yet has not finished. This also guarantees that - * the timer is serialized wrt itself. + * handler yet has not finished. + * + * Move timer to migration_list which can be processed after all + * timers in current cycle are serviced. This also guarantees + * that the timer is serialized wrt itself. */ - if (likely(base->running_timer != timer)) { + if (unlikely(base->running_timer == timer)) { + timer->expires = expires; + base->preferred_target = new_base; + list_add_tail(&timer->entry, &base->migration_list); + goto out_unlock; + } else { /* See the comment in lock_timer_base() */ timer_set_base(timer, NULL); spin_unlock(&base->lock); @@ -1214,6 +1224,33 @@ static inline void __run_timers(struct tvec_base *base) spin_lock_irq(&base->lock); } } + + /* + * Timer handler re-armed timer and we didn't wanted to add that + * on a idle local CPU. Its handler has finished now, lets + * enqueue it again. + */ + if (unlikely(!list_empty(&base->migration_list))) { + struct tvec_base *new_base = base->preferred_target; + + do { + timer = list_first_entry(&base->migration_list, + struct timer_list, entry); + + __list_del(timer->entry.prev, timer->entry.next); + + /* See the comment in lock_timer_base() */ + timer_set_base(timer, NULL); + spin_unlock(&base->lock); + + spin_lock(&new_base->lock); + timer_set_base(timer, new_base); + internal_add_timer(new_base, timer); + spin_unlock(&new_base->lock); + + spin_lock(&base->lock); + } while (!list_empty(&base->migration_list)); + } } base->running_timer = NULL; spin_unlock_irq(&base->lock); @@ -1583,6 +1620,7 @@ static int init_timers_cpu(int cpu) for (j = 0; j < TVR_SIZE; j++) INIT_LIST_HEAD(base->tv1.vec + j);
+ INIT_LIST_HEAD(&base->migration_list); base->timer_jiffies = jiffies; base->next_timer = base->timer_jiffies; base->active_timers = 0;
Does this look any better ?
I tested this on Exynos (Dual ARM Cortex-A9) board, with ubuntu over it. System was fairly idle and I did 'dmesg > /dev/null' on one of the CPUs in an infinite loop, to get CPUs out of idle.
I have got few more concerns/diffs over this to further optimize things, but kept them separate so that I can drop them if they aren't correct.
1.) Should the above list_empty(migration_list) block be added out of the
while (time_after_eq(jiffies, base->timer_jiffies))
block ? So that we check it only once per timer interrupt.
Also ->running_timer is set to the last serviced timer and a del_timer_sync() might be waiting to remove it. But we continue with the migration list first, without clearing it first. Not sure if this is important at all..
2.) By the time we finish serving all pending timers, local CPU might not be idle anymore OR the target CPU may become idle.
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index c412511d34b8..d75d31e9dc49 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1233,6 +1233,14 @@ static inline void __run_timers(struct tvec_base *base) if (unlikely(!list_empty(&base->migration_list))) { struct tvec_base *new_base = base->preferred_target;
+ if (!idle_cpu(base->cpu)) { + /* Local CPU not idle anymore */ + new_base = base; + } else if (idle_cpu(new_base->cpu)) { + /* Re-evaluate base, target CPU has gone idle */ + new_base = per_cpu(tvec_bases, get_nohz_timer_target(false)); + } + do { timer = list_first_entry(&base->migration_list, struct timer_list, entry);
The first if block is getting hit a lot of times in my tests. But the second one has never been hit (Probably because of only two CPUs, not sure).
2.) It might be better to update preferred_target every time we choose a difference base. This may help us avoid calling get_nohz_timer_target() in the second if block above.
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index d75d31e9dc49..558cd98abd87 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -783,6 +783,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) { + base->preferred_target = new_base; + /* * We are trying to schedule the timer on the local CPU. * However we can't change timer's base while it is running, @@ -795,7 +797,6 @@ __mod_timer(struct timer_list *timer, unsigned long expires, */ if (unlikely(base->running_timer == timer)) { timer->expires = expires; - base->preferred_target = new_base; list_add_tail(&timer->entry, &base->migration_list); goto out_unlock; } else {
-- viresh
* viresh kumar viresh.kumar@linaro.org wrote:
/*
* Timer handler re-armed timer and we didn't wanted to add that
* on a idle local CPU. Its handler has finished now, lets
* enqueue it again.
*/
That single comment has 5 grammatical errors!
Thanks,
Ingo
On Fri, 17 Apr 2015, viresh kumar wrote:
Does this look any better ?
Yes, but:
1.) Should the above list_empty(migration_list) block be added out of the
while (time_after_eq(jiffies, base->timer_jiffies))
block ? So that we check it only once per timer interrupt.
That's what I suggested.
Also ->running_timer is set to the last serviced timer and a del_timer_sync() might be waiting to remove it. But we continue with the migration list first, without clearing it first. Not sure if this is important at all..
Of course it is and doing it outside of the loop solves that issue.
2.) By the time we finish serving all pending timers, local CPU might not be idle anymore OR the target CPU may become idle.
That's another good reason to move that migration list outside of the while loop.
2.) It might be better to update preferred_target every time we choose a difference base. This may help us avoid calling get_nohz_timer_target() in the second if block above.
Yuck no. You can do that in that migration code and not add more pointless crap to the normal case.
Are you realizing that __mod_timer() is a massive hotpath for network heavy workloads?
This stuff needs to be debloated and not mindlessly packed with more corner case features.
Thanks,
tglx
On Tue, 2015-04-21 at 23:32 +0200, Thomas Gleixner wrote:
Are you realizing that __mod_timer() is a massive hotpath for network heavy workloads?
BTW I was considering using mod_timer_pinned() from these networking timers (ie sk_reset_timer())
get_nohz_timer_target() sounds cool for laptop users, but is one cause for bad responses to DDOS, when the selected cpu gets stressed.
This is the reason I used mod_timer_pinned() in commit 789f558cfb3680ae ("tcp/dccp: get rid of central timewait timer")
On Tue, Apr 21, 2015 at 02:54:55PM -0700, Eric Dumazet wrote:
On Tue, 2015-04-21 at 23:32 +0200, Thomas Gleixner wrote:
Are you realizing that __mod_timer() is a massive hotpath for network heavy workloads?
BTW I was considering using mod_timer_pinned() from these networking timers (ie sk_reset_timer())
get_nohz_timer_target() sounds cool for laptop users, but is one cause for bad responses to DDOS, when the selected cpu gets stressed.
This is the reason I used mod_timer_pinned() in commit 789f558cfb3680ae ("tcp/dccp: get rid of central timewait timer")
Hmm, that sounds unfortunate, this would wreck life for the power aware laptop/tablet etc.. people.
There is already a sysctl to manage this, is that not enough to mitigate this problem on the server side of things?
On Wed, 2015-04-22 at 17:29 +0200, Peter Zijlstra wrote:
Hmm, that sounds unfortunate, this would wreck life for the power aware laptop/tablet etc.. people.
There is already a sysctl to manage this, is that not enough to mitigate this problem on the server side of things?
The thing is : 99% of networking timers never fire.
But when they _do_ fire, and host is under attack, they all fire on unrelated cpu and this one can not keep up.
Added latencies fire monitoring alerts.
Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers") for a specific example of the problems that can be raised.
When we set a timer to fire in 10 seconds, knowing the _current_ idle state for cpus is of no help.
Add to this that softirq processing is not considered as making current cpu as non idle.
networking tried hard to use cpu affinities (and all techniques described in Documentation/networking/scaling.txt), but /proc/sys/kernel/timer_migration adds a fair overhead in many workloads.
get_nohz_timer_target() has to touch 3 cache lines per cpu...
Its in the top 10 in "perf top" profiles on servers with 72 threads.
This /proc/sys/kernel/timer_migration should have been instead :
/proc/sys/kernel/timer_on_a_single_cpu_for_laptop_sake
On Wed, 22 Apr 2015, Eric Dumazet wrote:
Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers") for a specific example of the problems that can be raised.
If you have a problem with the core timer code then it should be fixed there and not worked around in some place which will ruin stuff for power saving interested users. I'm so tired of this 'I fix it in my sandbox' attitude, really. If the core code has a shortcoming we fix it there right away because you are probably not the only one who runs into that shortcoming. So if we don't fix it in the core we end up with a metric ton of slightly different (or broken) workarounds which affect the workload/system characteristics of other people.
Just for the record. Even the changelog of this commit is blatantly wrong:
"We can see that timers get migrated into a single cpu, presumably idle at the time timers are set up."
The timer migration moves timers to non idle cpus to leave the idle ones alone for power saving sake.
I can see and understand the reason why you want to avoid that, but I have to ask the question whether this pinning is the correct behaviour under all workloads and system characteristics. If yes, then the patch is the right answer, if no, then it is simply the wrong approach.
but /proc/sys/kernel/timer_migration adds a fair overhead in many workloads.
get_nohz_timer_target() has to touch 3 cache lines per cpu...
And this is something we can fix and completely avoid if we think about it. Looking at the code I have to admit that the out of line call and the sysctl variable lookup is silly. But its not rocket science to fix this.
Thanks,
tglx
On Wed, 2015-04-22 at 20:56 +0200, Thomas Gleixner wrote:
On Wed, 22 Apr 2015, Eric Dumazet wrote:
Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers") for a specific example of the problems that can be raised.
If you have a problem with the core timer code then it should be fixed there and not worked around in some place which will ruin stuff for power saving interested users. I'm so tired of this 'I fix it in my sandbox' attitude, really. If the core code has a shortcoming we fix it there right away because you are probably not the only one who runs into that shortcoming. So if we don't fix it in the core we end up with a metric ton of slightly different (or broken) workarounds which affect the workload/system characteristics of other people.
Just for the record. Even the changelog of this commit is blatantly wrong:
"We can see that timers get migrated into a single cpu, presumably idle at the time timers are set up."
Spare me the obvious typo. A 'not' is missing.
The timer migration moves timers to non idle cpus to leave the idle ones alone for power saving sake.
I can see and understand the reason why you want to avoid that, but I have to ask the question whether this pinning is the correct behaviour under all workloads and system characteristics. If yes, then the patch is the right answer, if no, then it is simply the wrong approach.
but /proc/sys/kernel/timer_migration adds a fair overhead in many workloads.
get_nohz_timer_target() has to touch 3 cache lines per cpu...
And this is something we can fix and completely avoid if we think about it. Looking at the code I have to admit that the out of line call and the sysctl variable lookup is silly. But its not rocket science to fix this.
Awesome.
On Wed, 22 Apr 2015, Eric Dumazet wrote:
On Wed, 2015-04-22 at 20:56 +0200, Thomas Gleixner wrote:
On Wed, 22 Apr 2015, Eric Dumazet wrote:
Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers") for a specific example of the problems that can be raised.
If you have a problem with the core timer code then it should be fixed there and not worked around in some place which will ruin stuff for power saving interested users. I'm so tired of this 'I fix it in my sandbox' attitude, really. If the core code has a shortcoming we fix it there right away because you are probably not the only one who runs into that shortcoming. So if we don't fix it in the core we end up with a metric ton of slightly different (or broken) workarounds which affect the workload/system characteristics of other people.
Just for the record. Even the changelog of this commit is blatantly wrong:
"We can see that timers get migrated into a single cpu, presumably idle at the time timers are set up."
Spare me the obvious typo. A 'not' is missing.
:)
The timer migration moves timers to non idle cpus to leave the idle ones alone for power saving sake.
I can see and understand the reason why you want to avoid that, but I have to ask the question whether this pinning is the correct behaviour under all workloads and system characteristics. If yes, then the patch is the right answer, if no, then it is simply the wrong approach.
I take your 'Awesome' below as a no then.
but /proc/sys/kernel/timer_migration adds a fair overhead in many workloads.
get_nohz_timer_target() has to touch 3 cache lines per cpu...
And this is something we can fix and completely avoid if we think about it. Looking at the code I have to admit that the out of line call and the sysctl variable lookup is silly. But its not rocket science to fix this.
Awesome.
Here you go. Completely untested, at least it compiles.
Thanks,
tglx ---
Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -343,14 +343,10 @@ extern int runqueue_is_locked(int cpu); #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) extern void nohz_balance_enter_idle(int cpu); extern void set_cpu_sd_state_idle(void); -extern int get_nohz_timer_target(int pinned); +extern int get_nohz_timer_target(void); #else static inline void nohz_balance_enter_idle(int cpu) { } static inline void set_cpu_sd_state_idle(void) { } -static inline int get_nohz_timer_target(int pinned) -{ - return smp_processor_id(); -} #endif
/* Index: linux/include/linux/sched/sysctl.h =================================================================== --- linux.orig/include/linux/sched/sysctl.h +++ linux/include/linux/sched/sysctl.h @@ -57,24 +57,12 @@ extern unsigned int sysctl_numa_balancin extern unsigned int sysctl_sched_migration_cost; extern unsigned int sysctl_sched_nr_migrate; extern unsigned int sysctl_sched_time_avg; -extern unsigned int sysctl_timer_migration; extern unsigned int sysctl_sched_shares_window;
int sched_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos); #endif -#ifdef CONFIG_SCHED_DEBUG -static inline unsigned int get_sysctl_timer_migration(void) -{ - return sysctl_timer_migration; -} -#else -static inline unsigned int get_sysctl_timer_migration(void) -{ - return 1; -} -#endif
/* * control realtime throttling: Index: linux/include/linux/timer.h =================================================================== --- linux.orig/include/linux/timer.h +++ linux/include/linux/timer.h @@ -254,6 +254,16 @@ extern void run_local_timers(void); struct hrtimer; extern enum hrtimer_restart it_real_fn(struct hrtimer *);
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) +#include <linux/sysctl.h> + +extern unsigned int sysctl_timer_migration; +int timer_migration_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); +extern struct static_key timer_migration_enabled; +#endif + unsigned long __round_jiffies(unsigned long j, int cpu); unsigned long __round_jiffies_relative(unsigned long j, int cpu); unsigned long round_jiffies(unsigned long j); Index: linux/kernel/sched/core.c =================================================================== --- linux.orig/kernel/sched/core.c +++ linux/kernel/sched/core.c @@ -593,13 +593,12 @@ void resched_cpu(int cpu) * selecting an idle cpu will add more delays to the timers than intended * (as that cpu's timer base may not be uptodate wrt jiffies etc). */ -int get_nohz_timer_target(int pinned) +int get_nohz_timer_target(void) { - int cpu = smp_processor_id(); - int i; + int i, cpu = smp_processor_id(); struct sched_domain *sd;
- if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu)) + if (!idle_cpu(cpu)) return cpu;
rcu_read_lock(); @@ -7088,8 +7087,6 @@ void __init sched_init_smp(void) } #endif /* CONFIG_SMP */
-const_debug unsigned int sysctl_timer_migration = 1; - int in_sched_functions(unsigned long addr) { return in_lock_functions(addr) || Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -349,15 +349,6 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, - { - .procname = "timer_migration", - .data = &sysctl_timer_migration, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = &zero, - .extra2 = &one, - }, #endif /* CONFIG_SMP */ #ifdef CONFIG_NUMA_BALANCING { @@ -1132,6 +1123,15 @@ static struct ctl_table kern_table[] = { .extra1 = &zero, .extra2 = &one, }, +#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) + { + .procname = "timer_migration", + .data = &sysctl_timer_migration, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = timer_migration_handler, + }, +#endif { } };
Index: linux/kernel/time/hrtimer.c =================================================================== --- linux.orig/kernel/time/hrtimer.c +++ linux/kernel/time/hrtimer.c @@ -190,6 +190,23 @@ hrtimer_check_target(struct hrtimer *tim #endif }
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) +static inline struct hrtimer_cpu_base *get_target_base(int pinned) +{ + if (pinned) + return this_cpu_ptr(&hrtimer_bases); + if (static_key_true(&timer_migration_enabled)) + return &per_cpu(hrtimer_bases, get_nohz_timer_target()); + return this_cpu_ptr(&hrtimer_bases); +} +#else + +static inline struct hrtimer_cpu_base *get_target_base(int pinned) +{ + return this_cpu_ptr(&hrtimer_bases); +} +#endif + /* * Switch the timer base to the current CPU when possible. */ @@ -197,14 +214,13 @@ static inline struct hrtimer_clock_base switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, int pinned) { + struct hrtimer_cpu_base *new_cpu_base, *this_base; struct hrtimer_clock_base *new_base; - struct hrtimer_cpu_base *new_cpu_base; - int this_cpu = smp_processor_id(); - int cpu = get_nohz_timer_target(pinned); int basenum = base->index;
+ this_base = this_cpu_ptr(&hrtimer_bases); + new_cpu_base = get_target_base(pinned); again: - new_cpu_base = &per_cpu(hrtimer_bases, cpu); new_base = &new_cpu_base->clock_base[basenum];
if (base != new_base) { @@ -225,17 +241,19 @@ again: raw_spin_unlock(&base->cpu_base->lock); raw_spin_lock(&new_base->cpu_base->lock);
- if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) { - cpu = this_cpu; + if (new_cpu_base != this_base && + hrtimer_check_target(timer, new_base)) { raw_spin_unlock(&new_base->cpu_base->lock); raw_spin_lock(&base->cpu_base->lock); + new_cpu_base = this_base; timer->base = base; goto again; } timer->base = new_base; } else { - if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) { - cpu = this_cpu; + if (new_cpu_base != this_base && + hrtimer_check_target(timer, new_base)) { + new_cpu_base = this_base; goto again; } } Index: linux/kernel/time/timer.c =================================================================== --- linux.orig/kernel/time/timer.c +++ linux/kernel/time/timer.c @@ -104,6 +104,49 @@ EXPORT_SYMBOL(boot_tvec_bases);
static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) +unsigned int sysctl_timer_migration = 1; +struct static_key timer_migration_enabled = STATIC_KEY_INIT_TRUE; + +int timer_migration_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + static DEFINE_MUTEX(mutex); + bool keyon; + int ret; + + mutex_lock(&mutex); + ret = proc_dointvec(table, write, buffer, lenp, ppos); + if (ret || !write) + goto unlock; + + keyon = static_key_enabled(&timer_migration_enabled); + if (sysctl_timer_migration && !keyon ) + static_key_slow_inc(&timer_migration_enabled); + else if (!sysctl_timer_migration && keyon) + static_key_slow_dec(&timer_migration_enabled); + +unlock: + mutex_unlock(&mutex); + return ret; +} + +static inline struct tvec_base *get_target_base(int pinned) +{ + if (pinned) + return __this_cpu_read(tvec_bases); + if (static_key_true(&timer_migration_enabled)) + return per_cpu(tvec_bases, get_nohz_timer_target()); + return __this_cpu_read(tvec_bases); +} +#else +static inline struct tvec_base *get_target_base(int pinned) +{ + return __this_cpu_read(tvec_bases); +} +#endif + /* Functions below help us manage 'deferrable' flag */ static inline unsigned int tbase_get_deferrable(struct tvec_base *base) { @@ -774,7 +817,7 @@ __mod_timer(struct timer_list *timer, un { struct tvec_base *base, *new_base; unsigned long flags; - int ret = 0 , cpu; + int ret = 0;
timer_stats_timer_set_start_info(timer); BUG_ON(!timer->function); @@ -787,8 +830,7 @@ __mod_timer(struct timer_list *timer, un
debug_activate(timer, expires);
- cpu = get_nohz_timer_target(pinned); - new_base = per_cpu(tvec_bases, cpu); + new_base = get_target_base(pinned);
if (base != new_base) { /*
On Wed, 2015-04-22 at 23:56 +0200, Thomas Gleixner wrote:
-int get_nohz_timer_target(int pinned) +int get_nohz_timer_target(void) {
- int cpu = smp_processor_id();
- int i;
- int i, cpu = smp_processor_id(); struct sched_domain *sd;
- if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
- if (!idle_cpu(cpu)) return cpu;
Maybe also test in_serving_softirq() ?
if (in_serving_softirq() || !idle_cpu(cpu)) return cpu;
There is a fundamental problem with networking load : Many cpus appear to be idle from scheduler perspective because no user/kernel task is running.
CPUs servicing NIC queues can be very busy handling thousands of packets per second, yet have no user/kernel task running.
idle_cpu() return code is : this cpu is idle. hmmmm, really ?
cpus are busy, *and* have to access alien data/locks to activate timers that hardly fire anyway.
When idle_cpu() finally gives the right indication, it is too late : ksoftirqd might be running on the wrong cpu. Innocent cpus, overwhelmed by a sudden timer load and locked into a service loop.
This cannot resist to a DOS, and even with non malicious traffic, the overhead is high.
Thanks.
On Wed, 22 Apr 2015, Eric Dumazet wrote:
On Wed, 2015-04-22 at 23:56 +0200, Thomas Gleixner wrote:
-int get_nohz_timer_target(int pinned) +int get_nohz_timer_target(void) {
- int cpu = smp_processor_id();
- int i;
- int i, cpu = smp_processor_id(); struct sched_domain *sd;
- if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
- if (!idle_cpu(cpu)) return cpu;
Maybe also test in_serving_softirq() ?
if (in_serving_softirq() || !idle_cpu(cpu)) return cpu;
There is a fundamental problem with networking load : Many cpus appear to be idle from scheduler perspective because no user/kernel task is running.
CPUs servicing NIC queues can be very busy handling thousands of packets per second, yet have no user/kernel task running.
idle_cpu() return code is : this cpu is idle. hmmmm, really ?
cpus are busy, *and* have to access alien data/locks to activate timers that hardly fire anyway.
When idle_cpu() finally gives the right indication, it is too late : ksoftirqd might be running on the wrong cpu. Innocent cpus, overwhelmed by a sudden timer load and locked into a service loop.
This cannot resist to a DOS, and even with non malicious traffic, the overhead is high.
You definitely have a point from the high throughput networking perspective.
Though in a power optimizing scenario with minimal network traffic this might be the wrong decision. We have to gather data from the power maniacs whether this matters or not. The FULL_NO_HZ camp might be pretty unhappy about the above.
Thanks,
tglx
On Thu, 2015-04-23 at 14:45 +0200, Thomas Gleixner wrote:
You definitely have a point from the high throughput networking perspective.
Though in a power optimizing scenario with minimal network traffic this might be the wrong decision. We have to gather data from the power maniacs whether this matters or not. The FULL_NO_HZ camp might be pretty unhappy about the above.
Sure, I understand.
To make this clear, here the profile on a moderately loaded TCP server, pushing ~20Gbits of data. Most of TCP output is ACK clock driven (thus from softirq context).
(using regular sendmsg() system calls, that why the get_nohz_timer_target() is 'only' second in the profile, but add the find_next_bit() to it and this is very close being at first position)
PerfTop: 4712 irqs/sec kernel:96.7% exact: 0.0% [4000Hz cycles], (all, 72 CPUs) ----------------------------------------------------------------------------------------- 10.16% [kernel] [k] copy_user_enhanced_fast_string 5.66% [kernel] [k] get_nohz_timer_target 5.59% [kernel] [k] _raw_spin_lock 2.53% [kernel] [k] __netif_receive_skb_core 2.27% [kernel] [k] find_next_bit 1.90% [kernel] [k] tcp_ack
Maybe a reasonable heuristic would be to change /proc/sys/kernel/timer_migration default to 0 on hosts with more than 32 cpus.
profile with timer_migration = 0
PerfTop: 3656 irqs/sec kernel:94.3% exact: 0.0% [4000Hz cycles], (all, 72 CPUs) ----------------------------------------------------------------------------------------- 13.95% [kernel] [k] copy_user_enhanced_fast_string 4.65% [kernel] [k] _raw_spin_lock 2.57% [kernel] [k] __netif_receive_skb_core 2.33% [kernel] [k] tcp_ack
Thanks.
On Sat, 25 Apr 2015, Eric Dumazet wrote:
On Thu, 2015-04-23 at 14:45 +0200, Thomas Gleixner wrote:
You definitely have a point from the high throughput networking perspective.
Though in a power optimizing scenario with minimal network traffic this might be the wrong decision. We have to gather data from the power maniacs whether this matters or not. The FULL_NO_HZ camp might be pretty unhappy about the above.
Sure, I understand.
To make this clear, here the profile on a moderately loaded TCP server, pushing ~20Gbits of data. Most of TCP output is ACK clock driven (thus from softirq context).
(using regular sendmsg() system calls, that why the get_nohz_timer_target() is 'only' second in the profile, but add the find_next_bit() to it and this is very close being at first position)
PerfTop: 4712 irqs/sec kernel:96.7% exact: 0.0% [4000Hz cycles], (all, 72 CPUs)
10.16% [kernel] [k] copy_user_enhanced_fast_string 5.66% [kernel] [k] get_nohz_timer_target 5.59% [kernel] [k] _raw_spin_lock 2.53% [kernel] [k] __netif_receive_skb_core 2.27% [kernel] [k] find_next_bit 1.90% [kernel] [k] tcp_ack
Maybe a reasonable heuristic would be to change /proc/sys/kernel/timer_migration default to 0 on hosts with more than 32 cpus.
profile with timer_migration = 0
PerfTop: 3656 irqs/sec kernel:94.3% exact: 0.0% [4000Hz cycles], (all, 72 CPUs)
13.95% [kernel] [k] copy_user_enhanced_fast_string 4.65% [kernel] [k] _raw_spin_lock 2.57% [kernel] [k] __netif_receive_skb_core 2.33% [kernel] [k] tcp_ack
Is that with the static key patch applied?
Thanks,
tglx
On Tue, 2015-05-05 at 15:00 +0200, Thomas Gleixner wrote:
On Sat, 25 Apr 2015, Eric Dumazet wrote:
On Thu, 2015-04-23 at 14:45 +0200, Thomas Gleixner wrote:
You definitely have a point from the high throughput networking perspective.
Though in a power optimizing scenario with minimal network traffic this might be the wrong decision. We have to gather data from the power maniacs whether this matters or not. The FULL_NO_HZ camp might be pretty unhappy about the above.
Sure, I understand.
To make this clear, here the profile on a moderately loaded TCP server, pushing ~20Gbits of data. Most of TCP output is ACK clock driven (thus from softirq context).
(using regular sendmsg() system calls, that why the get_nohz_timer_target() is 'only' second in the profile, but add the find_next_bit() to it and this is very close being at first position)
PerfTop: 4712 irqs/sec kernel:96.7% exact: 0.0% [4000Hz cycles], (all, 72 CPUs)
10.16% [kernel] [k] copy_user_enhanced_fast_string 5.66% [kernel] [k] get_nohz_timer_target 5.59% [kernel] [k] _raw_spin_lock 2.53% [kernel] [k] __netif_receive_skb_core 2.27% [kernel] [k] find_next_bit 1.90% [kernel] [k] tcp_ack
Maybe a reasonable heuristic would be to change /proc/sys/kernel/timer_migration default to 0 on hosts with more than 32 cpus.
profile with timer_migration = 0
PerfTop: 3656 irqs/sec kernel:94.3% exact: 0.0% [4000Hz cycles], (all, 72 CPUs)
13.95% [kernel] [k] copy_user_enhanced_fast_string 4.65% [kernel] [k] _raw_spin_lock 2.57% [kernel] [k] __netif_receive_skb_core 2.33% [kernel] [k] tcp_ack
Is that with the static key patch applied?
This was without.
I applied your patch on current linus tree, but for some reason my 72 cpu host is not liking the resulting kernel. I had to ask for a repair, and this might take a while.
Note your kernel works correctly on other hosts, but with 48 or 32 cpus, so this must be something unrelated.
I'll let you know when I get more interesting data.
Thanks
On Tue, 31 Mar 2015, Viresh Kumar wrote:
@@ -1213,6 +1249,25 @@ static inline void __run_timers(struct tvec_base *base) call_timer_fn(timer, fn, data); spin_lock_irq(&base->lock); }
/*
* Handler running on this base, re-queued itself on
* another base ?
*/
if (unlikely(timer->base != base)) {
unsigned long flags;
struct tvec_base *tbase;
spin_unlock(&base->lock);
tbase = lock_timer_base(timer, &flags);
timer_clear_running(timer);
spin_unlock(&tbase->lock);
spin_lock(&base->lock);
} else {
timer_clear_running(timer);
}
And just for the record:
Dereferencing timer _AFTER_ the callback function is a big NONO. The callback function is allowed to free the timer. See the comment in call_timer_fn()
Oh well,
tglx
We don't need to track the running timer for a cpu base anymore, but sill need to check business of base for sanity checking during CPU hotplug.
Lets replace 'running_timer' with 'busy' for handle that efficiently.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/time/timer.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 364644811485..2db05206594b 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -77,7 +77,7 @@ struct tvec_root {
struct tvec_base { spinlock_t lock; - struct timer_list *running_timer; + bool busy; unsigned long timer_jiffies; unsigned long next_timer; unsigned long active_timers; @@ -1180,6 +1180,8 @@ static inline void __run_timers(struct tvec_base *base) spin_unlock_irq(&base->lock); return; } + + base->busy = true; while (time_after_eq(jiffies, base->timer_jiffies)) { struct list_head work_list; struct list_head *head = &work_list; @@ -1236,7 +1238,6 @@ static inline void __run_timers(struct tvec_base *base)
timer_stats_account_timer(timer);
- base->running_timer = timer; timer_set_running(timer); detach_expired_timer(timer, base);
@@ -1270,7 +1271,7 @@ static inline void __run_timers(struct tvec_base *base) } } } - base->running_timer = NULL; + base->busy = false; spin_unlock_irq(&base->lock); }
@@ -1675,7 +1676,7 @@ static void migrate_timers(int cpu) spin_lock_irq(&new_base->lock); spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
- BUG_ON(old_base->running_timer); + BUG_ON(old_base->busy);
for (i = 0; i < TVR_SIZE; i++) migrate_timer_list(new_base, old_base->tv1.vec + i);
On 31 March 2015 at 12:25, Viresh Kumar viresh.kumar@linaro.org wrote:
While queuing a timer, we try to migrate it to a non-idle core if the local core is idle, but we don't try that if the timer is re-armed from its handler.
There were few unsolved problems due to which it was avoided until now. But there are cases where solving these problems can be useful. When the timer is always re-armed from its handler, it never migrates to other cores. And many a times, it ends up waking an idle core to just service the timer, which could have been handled by a non-idle core.
Peter suggested [1] few changes which can make that work and the first patch does exactly that. The second one is a minor improvement, that replaces 'running_timer' pointer with 'busy'. That variable was required as part of a sanity check during CPU hot-unplug operation. I was not sure if we should drop this extra variable ('running_timer' or 'busy') and the sanity check.
Peter reminded me that I failed to tell if it really worked or not :)
So yes it worked. I tested this on a Dual-core ARM cortex-A15 board with 5 timers getting re-armed 100 times each from their handler.
Most of the time the remote CPU was idle (along with the local one) and so migration didn't happen.
But as and when the local CPU was idle and remote one wasn't, timers got successfully migrated.
My branches are also tested by Fengguang's build-bot, and in case of any of wreckage on other machines, we will be informed :)
-- viresh
linaro-kernel@lists.linaro.org