Hi,
'active_bases' indicates which clock-base have active timers. While it is updated (almost) correctly, it is hardly used.
And so this is an attempt to improve the code that iterates over all clock-bases.
The first patch fixes a issue that will result in a bug after the second commit, and the second commit creates a macro for_each_active_base() and uses it at multiple places.
V1->V2: - Dropped ffs() and wrote own routine __next_bit().
Viresh Kumar (2): hrtimer: update '->active_bases' before calling hrtimer_force_reprogram() hrtimer: create for_each_active_base() to iterate over active clock-bases
kernel/time/hrtimer.c | 70 ++++++++++++++++++++++++++++++++------------------- 1 file changed, 44 insertions(+), 26 deletions(-)
'active_bases' indicates which clock-base have active timers. While it is updated correctly, it is hardly used. Next commit will start using it to make code more efficient, but before that we need to fix a problem.
While removing hrtimers, in __remove_hrtimer(): - We first remove the hrtimer from the queue. - Then reprogram clockevent device if required (hrtimer_force_reprogram()). - And then finally clear 'active_bases', if no more timers are pending on the current clock base (from which we are removing the hrtimer).
hrtimer_force_reprogram() needs to loop over all active clock bases to find the next expiry event, and while doing so it will use 'active_bases' (after next commit). And it will find the current base active, as we haven't cleared it until now, even if current clock base has no more hrtimers queued.
The next commit will skip validating what timerqueue_getnext() returns, as that is guaranteed to be valid for an active base, and the above stated problem will result in a crash then (Because timerqueue_getnext() will return NULL for the current clock base).
So, fix this issue by clearing active_bases before calling hrtimer_force_reprogram().
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/time/hrtimer.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index bee0c1f78091..3152f327c988 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -879,6 +879,9 @@ static void __remove_hrtimer(struct hrtimer *timer,
next_timer = timerqueue_getnext(&base->active); timerqueue_del(&base->active, &timer->node); + if (!timerqueue_getnext(&base->active)) + base->cpu_base->active_bases &= ~(1 << base->index); + if (&timer->node == next_timer) { #ifdef CONFIG_HIGH_RES_TIMERS /* Reprogram the clock event device. if enabled */ @@ -892,8 +895,6 @@ static void __remove_hrtimer(struct hrtimer *timer, } #endif } - if (!timerqueue_getnext(&base->active)) - base->cpu_base->active_bases &= ~(1 << base->index); out: timer->state = newstate; }
At several instances we iterate over all possible clock-bases for a particular cpu-base. Whereas, we only need to iterate over active bases.
We already have per cpu-base 'active_bases' field, which is updated on addition/removal of hrtimers.
This patch creates for_each_active_base(), which uses 'active_bases' to iterate only over active bases.
This also updates code which iterates over clock-bases.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/time/hrtimer.c | 65 ++++++++++++++++++++++++++++++++------------------- 1 file changed, 41 insertions(+), 24 deletions(-)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 3152f327c988..9da63e9ee63b 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -110,6 +110,31 @@ static inline int hrtimer_clockid_to_base(clockid_t clock_id) }
+static inline int __next_bit(unsigned int active_bases, int bit) +{ + do { + if (active_bases & (1 << bit)) + return bit; + } while (++bit < HRTIMER_MAX_CLOCK_BASES); + + /* We should never reach here */ + return 0; +} + +/* + * for_each_active_base: iterate over all active clock bases + * @_bit: 'int' variable for internal purpose + * @_base: holds pointer to a active clock base + * @_cpu_base: cpu base to iterate on + * @_active_bases: 'unsigned int' variable for internal purpose + */ +#define for_each_active_base(_bit, _base, _cpu_base, _active_bases) \ + for ((_active_bases) = (_cpu_base)->active_bases, (_bit) = -1; \ + (_active_bases) && \ + ((_bit) = __next_bit(_active_bases, ++_bit), \ + (_base) = (_cpu_base)->clock_base + _bit); \ + (_active_bases) &= ~(1 << (_bit))) + /* * Get the coarse grained time at the softirq based on xtime and * wall_to_monotonic. @@ -443,19 +468,15 @@ static inline void debug_deactivate(struct hrtimer *timer) #if defined(CONFIG_NO_HZ_COMMON) || defined(CONFIG_HIGH_RES_TIMERS) static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base) { - struct hrtimer_clock_base *base = cpu_base->clock_base; + struct hrtimer_clock_base *base; ktime_t expires, expires_next = { .tv64 = KTIME_MAX }; + struct hrtimer *timer; + unsigned int active_bases; int i;
- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) { - struct timerqueue_node *next; - struct hrtimer *timer; - - next = timerqueue_getnext(&base->active); - if (!next) - continue; - - timer = container_of(next, struct hrtimer, node); + for_each_active_base(i, base, cpu_base, active_bases) { + timer = container_of(timerqueue_getnext(&base->active), + struct hrtimer, node); expires = ktime_sub(hrtimer_get_expires(timer), base->offset); if (expires.tv64 < expires_next.tv64) expires_next = expires; @@ -1245,6 +1266,8 @@ void hrtimer_interrupt(struct clock_event_device *dev) { struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); ktime_t expires_next, now, entry_time, delta; + struct hrtimer_clock_base *base; + unsigned int active_bases; int i, retries = 0;
BUG_ON(!cpu_base->hres_active); @@ -1264,15 +1287,10 @@ void hrtimer_interrupt(struct clock_event_device *dev) */ cpu_base->expires_next.tv64 = KTIME_MAX;
- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { - struct hrtimer_clock_base *base; + for_each_active_base(i, base, cpu_base, active_bases) { struct timerqueue_node *node; ktime_t basenow;
- if (!(cpu_base->active_bases & (1 << i))) - continue; - - base = cpu_base->clock_base + i; basenow = ktime_add(now, base->offset);
while ((node = timerqueue_getnext(&base->active))) { @@ -1435,16 +1453,13 @@ void hrtimer_run_queues(void) struct timerqueue_node *node; struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); struct hrtimer_clock_base *base; + unsigned int active_bases; int index, gettime = 1;
if (hrtimer_hres_active()) return;
- for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) { - base = &cpu_base->clock_base[index]; - if (!timerqueue_getnext(&base->active)) - continue; - + for_each_active_base(index, base, cpu_base, active_bases) { if (gettime) { hrtimer_get_softirq_time(cpu_base); gettime = 0; @@ -1665,6 +1680,8 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, static void migrate_hrtimers(int scpu) { struct hrtimer_cpu_base *old_base, *new_base; + struct hrtimer_clock_base *clock_base; + unsigned int active_bases; int i;
BUG_ON(cpu_online(scpu)); @@ -1680,9 +1697,9 @@ static void migrate_hrtimers(int scpu) raw_spin_lock(&new_base->lock); raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { - migrate_hrtimer_list(&old_base->clock_base[i], - &new_base->clock_base[i]); + for_each_active_base(i, clock_base, old_base, active_bases) { + migrate_hrtimer_list(clock_base, + &new_base->clock_base[clock_base->index]); }
raw_spin_unlock(&old_base->lock);
On Tue, Apr 07, 2015 at 07:40:53AM +0530, Viresh Kumar wrote:
At several instances we iterate over all possible clock-bases for a particular cpu-base. Whereas, we only need to iterate over active bases.
We already have per cpu-base 'active_bases' field, which is updated on addition/removal of hrtimers.
This patch creates for_each_active_base(), which uses 'active_bases' to iterate only over active bases.
This also updates code which iterates over clock-bases.
This Changelog is very thin on compelling reasons to do this. Not to mention you did no analysis on the code generated.
On Tue, 7 Apr 2015, Viresh Kumar wrote:
At several instances we iterate over all possible clock-bases for a particular cpu-base. Whereas, we only need to iterate over active bases.
We already have per cpu-base 'active_bases' field, which is updated on addition/removal of hrtimers.
This patch creates for_each_active_base(), which uses 'active_bases' to iterate only over active bases.
I'm really not too excited about this incomprehensible macro mess and especially not about the code it generates.
x86_64 i386 ARM power
Mainline 7668 6942 8077 10253
+ Patch 8068 7294 8313 10861
+400 +352 +236 +608
That's insane.
What's wrong with just adding
if (!(cpu_base->active_bases & (1 << i))) continue;
to the iterating sites?
Index: linux/kernel/time/hrtimer.c =================================================================== --- linux.orig/kernel/time/hrtimer.c +++ linux/kernel/time/hrtimer.c @@ -451,6 +451,9 @@ static ktime_t __hrtimer_get_next_event( struct timerqueue_node *next; struct hrtimer *timer;
+ if (!(cpu_base->active_bases & (1 << i))) + continue; + next = timerqueue_getnext(&base->active); if (!next) continue; @@ -1441,6 +1444,9 @@ void hrtimer_run_queues(void) return;
for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) { + if (!(cpu_base->active_bases & (1 << index))) + continue; + base = &cpu_base->clock_base[index]; if (!timerqueue_getnext(&base->active)) continue; @@ -1681,6 +1687,8 @@ static void migrate_hrtimers(int scpu) raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { + if (!(old_base->active_bases & (1 << i))) + continue; migrate_hrtimer_list(&old_base->clock_base[i], &new_base->clock_base[i]); }
Now the code size increase is in a sane range:
x86_64 i386 ARM power
Mainline 7668 6942 8077 10253
+ patch 7748 6990 8113 10365
+80 +48 +36 +112
So your variant is at least 5 times larger than the simple and obvious solution.
I told you before to look at the resulting binary code changes and sanity check whether they are in the right ball park.
Thanks,
tglx
On 9 April 2015 at 01:41, Thomas Gleixner tglx@linutronix.de wrote:
I'm really not too excited about this incomprehensible macro mess and especially not about the code it generates.
x86_64 i386 ARM power
Mainline 7668 6942 8077 10253
Patch 8068 7294 8313 10861
+400 +352 +236 +608
That's insane.
After Peter's mail yesterday, I did check it on x86_64 and it surely looked a lot bigger.
What's wrong with just adding
if (!(cpu_base->active_bases & (1 << i))) continue;
to the iterating sites?
Index: linux/kernel/time/hrtimer.c
--- linux.orig/kernel/time/hrtimer.c +++ linux/kernel/time/hrtimer.c @@ -451,6 +451,9 @@ static ktime_t __hrtimer_get_next_event( struct timerqueue_node *next; struct hrtimer *timer;
if (!(cpu_base->active_bases & (1 << i)))
continue;
next = timerqueue_getnext(&base->active); if (!next) continue;
Isn't the check we already have here lightweight enough for this ? timerqueue_getnext() returns head->next..
What benefit are we getting with this extra check ?
Maybe we can drop 'active_bases' from struct hrtimer_cpu_base ?
'active_bases' can be used effectively, if we can quit early from this loop, i.e. by checking for !active_bases on every iteration.
But that generates a lot more code and is probably not that helpful for small loop size that we have here.
-- viresh
* Thomas Gleixner tglx@linutronix.de wrote:
On Tue, 7 Apr 2015, Viresh Kumar wrote:
At several instances we iterate over all possible clock-bases for a particular cpu-base. Whereas, we only need to iterate over active bases.
We already have per cpu-base 'active_bases' field, which is updated on addition/removal of hrtimers.
This patch creates for_each_active_base(), which uses 'active_bases' to iterate only over active bases.
I'm really not too excited about this incomprehensible macro mess and especially not about the code it generates.
x86_64 i386 ARM power
Mainline 7668 6942 8077 10253
Patch 8068 7294 8313 10861
+400 +352 +236 +608
That's insane.
What's wrong with just adding
if (!(cpu_base->active_bases & (1 << i))) continue;
to the iterating sites?
Index: linux/kernel/time/hrtimer.c
--- linux.orig/kernel/time/hrtimer.c +++ linux/kernel/time/hrtimer.c @@ -451,6 +451,9 @@ static ktime_t __hrtimer_get_next_event( struct timerqueue_node *next; struct hrtimer *timer;
if (!(cpu_base->active_bases & (1 << i)))
continue;
Btw., does cpu_base->active_bases even make sense? hrtimer bases are fundamentally percpu, and to check whether there are any pending timers is a very simple check:
base->active->next != NULL
So I'd rather suggest taking a direct look at the head, instead of calculating bit positions, masks, etc.
Furthermore, we never actually use cpu_base->active_bases as a 'summary' value (which is the main point of bitmasks in general), so I'd remove that complication altogether.
This would speed up various hrtimer primitives like hrtimer_remove()/add and simplify the code. It would be a net code shrink as well.
Totally untested patch below. It gives:
text data bss dec hex filename 7502 427 0 7929 1ef9 hrtimer.o.before 7422 427 0 7849 1ea9 hrtimer.o.after
and half of that code removal is from hot paths.
This would simplify the followup step of skipping over inactive bases as well.
Thanks,
Ingo
include/linux/hrtimer.h | 2 -- kernel/time/hrtimer.c | 7 ++----- 2 files changed, 2 insertions(+), 7 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index 05f6df1fdf5b..d65b858a94c1 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -166,7 +166,6 @@ enum hrtimer_base_type { * @lock: lock protecting the base and associated clock bases * and timers * @cpu: cpu number - * @active_bases: Bitfield to mark bases with active timers * @clock_was_set: Indicates that clock was set from irq context. * @expires_next: absolute time of the next event which was scheduled * via clock_set_next_event() @@ -182,7 +181,6 @@ enum hrtimer_base_type { struct hrtimer_cpu_base { raw_spinlock_t lock; unsigned int cpu; - unsigned int active_bases; unsigned int clock_was_set; #ifdef CONFIG_HIGH_RES_TIMERS ktime_t expires_next; diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 76d4bd962b19..63e21df6c086 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -848,7 +848,6 @@ static int enqueue_hrtimer(struct hrtimer *timer, debug_activate(timer);
timerqueue_add(&base->active, &timer->node); - base->cpu_base->active_bases |= 1 << base->index;
/* * HRTIMER_STATE_ENQUEUED is or'ed to the current state to preserve the @@ -892,8 +891,6 @@ static void __remove_hrtimer(struct hrtimer *timer, } #endif } - if (!timerqueue_getnext(&base->active)) - base->cpu_base->active_bases &= ~(1 << base->index); out: timer->state = newstate; } @@ -1268,10 +1265,10 @@ void hrtimer_interrupt(struct clock_event_device *dev) struct timerqueue_node *node; ktime_t basenow;
- if (!(cpu_base->active_bases & (1 << i))) + base = cpu_base->clock_base + i; + if (!base->active.next) continue;
- base = cpu_base->clock_base + i; basenow = ktime_add(now, base->offset);
while ((node = timerqueue_getnext(&base->active))) {
* Ingo Molnar mingo@kernel.org wrote:
This would speed up various hrtimer primitives like hrtimer_remove()/add and simplify the code. It would be a net code shrink as well.
Totally untested patch below. It gives:
text data bss dec hex filename 7502 427 0 7929 1ef9 hrtimer.o.before 7422 427 0 7849 1ea9 hrtimer.o.after
and half of that code removal is from hot paths.
This would simplify the followup step of skipping over inactive bases as well.
The followup step is attached below (untested as well).
Note that all other iterations already had a check for active.next, so the patch doesn't even bloat anything:
text data bss dec hex filename 7422 427 0 7849 1ea9 hrtimer.o.before 7422 427 0 7849 1ea9 hrtimer.o.after
(I did a rename within migrate_hrtimers() because it used 'cpu_base' vs 'clock_base' inconsistently in a confusing (to me) manner.)
I'd also suggest the removal of the timerqueue_getnext() obfuscation: it 'sounds' complex but in reality it's a simple dereference to active.next. I think this is what triggered this rather pointless maintenance of active_bases.
Thanks,
Ingo
--- kernel/time/hrtimer.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-)
Index: tip/kernel/time/hrtimer.c =================================================================== --- tip.orig/kernel/time/hrtimer.c +++ tip/kernel/time/hrtimer.c @@ -1660,29 +1660,35 @@ static void migrate_hrtimer_list(struct
static void migrate_hrtimers(int scpu) { - struct hrtimer_cpu_base *old_base, *new_base; + struct hrtimer_cpu_base *old_cpu_base, *new_cpu_base; int i;
BUG_ON(cpu_online(scpu)); tick_cancel_sched_timer(scpu);
local_irq_disable(); - old_base = &per_cpu(hrtimer_bases, scpu); - new_base = this_cpu_ptr(&hrtimer_bases); + old_cpu_base = &per_cpu(hrtimer_bases, scpu); + new_cpu_base = this_cpu_ptr(&hrtimer_bases); /* * The caller is globally serialized and nobody else * takes two locks at once, deadlock is not possible. */ - raw_spin_lock(&new_base->lock); - raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING); + raw_spin_lock(&new_cpu_base->lock); + raw_spin_lock_nested(&old_cpu_base->lock, SINGLE_DEPTH_NESTING);
for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { - migrate_hrtimer_list(&old_base->clock_base[i], - &new_base->clock_base[i]); + struct hrtimer_clock_base *old_base = old_cpu_base->clock_base + i; + struct hrtimer_clock_base *new_base; + + if (!old_base->active.next) + continue; + + new_base = new_cpu_base->clock_base + i; + migrate_hrtimer_list(old_base, new_base); }
- raw_spin_unlock(&old_base->lock); - raw_spin_unlock(&new_base->lock); + raw_spin_unlock(&old_cpu_base->lock); + raw_spin_unlock(&new_cpu_base->lock);
/* Check, if we got expired work to do */ __hrtimer_peek_ahead_timers();
(fixed the subject to include the patch title. Patch quoted below.)
* Ingo Molnar mingo@kernel.org wrote:
- Ingo Molnar mingo@kernel.org wrote:
This would speed up various hrtimer primitives like hrtimer_remove()/add and simplify the code. It would be a net code shrink as well.
Totally untested patch below. It gives:
text data bss dec hex filename 7502 427 0 7929 1ef9 hrtimer.o.before 7422 427 0 7849 1ea9 hrtimer.o.after
and half of that code removal is from hot paths.
This would simplify the followup step of skipping over inactive bases as well.
The followup step is attached below (untested as well).
Note that all other iterations already had a check for active.next, so the patch doesn't even bloat anything:
text data bss dec hex filename 7422 427 0 7849 1ea9 hrtimer.o.before 7422 427 0 7849 1ea9 hrtimer.o.after
(I did a rename within migrate_hrtimers() because it used 'cpu_base' vs 'clock_base' inconsistently in a confusing (to me) manner.)
I'd also suggest the removal of the timerqueue_getnext() obfuscation: it 'sounds' complex but in reality it's a simple dereference to active.next. I think this is what triggered this rather pointless maintenance of active_bases.
Thanks,
Ingo
kernel/time/hrtimer.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-)
Index: tip/kernel/time/hrtimer.c
--- tip.orig/kernel/time/hrtimer.c +++ tip/kernel/time/hrtimer.c @@ -1660,29 +1660,35 @@ static void migrate_hrtimer_list(struct static void migrate_hrtimers(int scpu) {
- struct hrtimer_cpu_base *old_base, *new_base;
- struct hrtimer_cpu_base *old_cpu_base, *new_cpu_base; int i;
BUG_ON(cpu_online(scpu)); tick_cancel_sched_timer(scpu); local_irq_disable();
- old_base = &per_cpu(hrtimer_bases, scpu);
- new_base = this_cpu_ptr(&hrtimer_bases);
- old_cpu_base = &per_cpu(hrtimer_bases, scpu);
- new_cpu_base = this_cpu_ptr(&hrtimer_bases); /*
*/
- The caller is globally serialized and nobody else
- takes two locks at once, deadlock is not possible.
- raw_spin_lock(&new_base->lock);
- raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
- raw_spin_lock(&new_cpu_base->lock);
- raw_spin_lock_nested(&old_cpu_base->lock, SINGLE_DEPTH_NESTING);
for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
migrate_hrtimer_list(&old_base->clock_base[i],
&new_base->clock_base[i]);
struct hrtimer_clock_base *old_base = old_cpu_base->clock_base + i;
struct hrtimer_clock_base *new_base;
if (!old_base->active.next)
continue;
new_base = new_cpu_base->clock_base + i;
}migrate_hrtimer_list(old_base, new_base);
- raw_spin_unlock(&old_base->lock);
- raw_spin_unlock(&new_base->lock);
- raw_spin_unlock(&old_cpu_base->lock);
- raw_spin_unlock(&new_cpu_base->lock);
/* Check, if we got expired work to do */ __hrtimer_peek_ahead_timers();
* Ingo Molnar mingo@kernel.org wrote:
I'd also suggest the removal of the timerqueue_getnext() obfuscation: it 'sounds' complex but in reality it's a simple dereference to active.next. I think this is what triggered this rather pointless maintenance of active_bases.
The patch below does this - it's more readable to me in this fashion, the 'next' field is already very clearly named, no need for a (longer!) helper function there.
I lightly tested the 3 patches with hrtimers enabled under a virtual machine.
Thanks,
Ingo
--- kernel/time/hrtimer.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
Index: tip/kernel/time/hrtimer.c =================================================================== --- tip.orig/kernel/time/hrtimer.c +++ tip/kernel/time/hrtimer.c @@ -451,7 +451,7 @@ static ktime_t __hrtimer_get_next_event( struct timerqueue_node *next; struct hrtimer *timer;
- next = timerqueue_getnext(&base->active); + next = base->active.next; if (!next) continue;
@@ -876,7 +876,7 @@ static void __remove_hrtimer(struct hrti if (!(timer->state & HRTIMER_STATE_ENQUEUED)) goto out;
- next_timer = timerqueue_getnext(&base->active); + next_timer = base->active.next; timerqueue_del(&base->active, &timer->node); if (&timer->node == next_timer) { #ifdef CONFIG_HIGH_RES_TIMERS @@ -1271,7 +1271,7 @@ retry:
basenow = ktime_add(now, base->offset);
- while ((node = timerqueue_getnext(&base->active))) { + while ((node = base->active.next)) { struct hrtimer *timer;
timer = container_of(node, struct hrtimer, node); @@ -1438,7 +1438,7 @@ void hrtimer_run_queues(void)
for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) { base = &cpu_base->clock_base[index]; - if (!timerqueue_getnext(&base->active)) + if (!base->active.next) continue;
if (gettime) { @@ -1448,7 +1448,7 @@ void hrtimer_run_queues(void)
raw_spin_lock(&cpu_base->lock);
- while ((node = timerqueue_getnext(&base->active))) { + while ((node = base->active.next)) { struct hrtimer *timer;
timer = container_of(node, struct hrtimer, node); @@ -1631,7 +1631,7 @@ static void migrate_hrtimer_list(struct struct hrtimer *timer; struct timerqueue_node *node;
- while ((node = timerqueue_getnext(&old_base->active))) { + while ((node = old_base->active.next)) { timer = container_of(node, struct hrtimer, node); BUG_ON(hrtimer_callback_running(timer)); debug_deactivate(timer);
We iterate over the timerqueue_node list of base->active.next frequently, but the iterator variables are a colorful and inconsistent mix: 'next', 'next_timer' (!), 'node' and 'node_next'.
For the 5 iterations we've invented 4 separate names for the same thing ... sigh.
So standardize the naming to 'node_next': this both tells us that it's an rbtree node, and that it's also a next element in a list.
No code changed:
text data bss dec hex filename 7754 427 0 8181 1ff5 hrtimer.o.before 7754 427 0 8181 1ff5 hrtimer.o.after
md5: de9af3b01d50c72af2d8b026ed22704a hrtimer.o.before.asm de9af3b01d50c72af2d8b026ed22704a hrtimer.o.after.asm
Signed-off-by: Ingo Molnar mingo@kernel.org --- kernel/time/hrtimer.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-)
Index: tip/kernel/time/hrtimer.c =================================================================== --- tip.orig/kernel/time/hrtimer.c +++ tip/kernel/time/hrtimer.c @@ -448,14 +448,14 @@ static ktime_t __hrtimer_get_next_event( int i;
for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) { - struct timerqueue_node *next; + struct timerqueue_node *node_next; struct hrtimer *timer;
- next = base->active.next; - if (!next) + node_next = base->active.next; + if (!node_next) continue;
- timer = container_of(next, struct hrtimer, node); + timer = container_of(node_next, struct hrtimer, node); expires = ktime_sub(hrtimer_get_expires(timer), base->offset); if (expires.tv64 < expires_next.tv64) expires_next = expires; @@ -872,13 +872,13 @@ static void __remove_hrtimer(struct hrti struct hrtimer_clock_base *base, unsigned long newstate, int reprogram) { - struct timerqueue_node *next_timer; + struct timerqueue_node *node_next; if (!(timer->state & HRTIMER_STATE_ENQUEUED)) goto out;
- next_timer = base->active.next; + node_next = base->active.next; timerqueue_del(&base->active, &timer->node); - if (&timer->node == next_timer) { + if (&timer->node == node_next) { #ifdef CONFIG_HIGH_RES_TIMERS /* Reprogram the clock event device. if enabled */ if (reprogram && hrtimer_hres_active()) { @@ -1262,7 +1262,7 @@ retry:
for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { struct hrtimer_clock_base *base; - struct timerqueue_node *node; + struct timerqueue_node *node_next; ktime_t basenow;
base = cpu_base->clock_base + i; @@ -1271,10 +1271,10 @@ retry:
basenow = ktime_add(now, base->offset);
- while ((node = base->active.next)) { + while ((node_next = base->active.next)) { struct hrtimer *timer;
- timer = container_of(node, struct hrtimer, node); + timer = container_of(node_next, struct hrtimer, node);
/* * The immediate goal for using the softexpires is @@ -1428,7 +1428,7 @@ void hrtimer_run_pending(void) */ void hrtimer_run_queues(void) { - struct timerqueue_node *node; + struct timerqueue_node *node_next; struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); struct hrtimer_clock_base *base; int index, gettime = 1; @@ -1448,10 +1448,10 @@ void hrtimer_run_queues(void)
raw_spin_lock(&cpu_base->lock);
- while ((node = base->active.next)) { + while ((node_next = base->active.next)) { struct hrtimer *timer;
- timer = container_of(node, struct hrtimer, node); + timer = container_of(node_next, struct hrtimer, node); if (base->softirq_time.tv64 <= hrtimer_get_expires_tv64(timer)) break; @@ -1629,10 +1629,10 @@ static void migrate_hrtimer_list(struct struct hrtimer_clock_base *new_base) { struct hrtimer *timer; - struct timerqueue_node *node; + struct timerqueue_node *node_next;
- while ((node = old_base->active.next)) { - timer = container_of(node, struct hrtimer, node); + while ((node_next = old_base->active.next)) { + timer = container_of(node_next, struct hrtimer, node); BUG_ON(hrtimer_callback_running(timer)); debug_deactivate(timer);
On Thu, Apr 09, 2015 at 08:28:41AM +0200, Ingo Molnar wrote:
Btw., does cpu_base->active_bases even make sense? hrtimer bases are fundamentally percpu, and to check whether there are any pending timers is a very simple check:
base->active->next != NULL
Yeah, that's 3 pointer dereferences from cpu_base, iow you traded a single bit test on an already loaded word for 3 potential cacheline misses.
* Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 09, 2015 at 08:28:41AM +0200, Ingo Molnar wrote:
Btw., does cpu_base->active_bases even make sense? hrtimer bases are fundamentally percpu, and to check whether there are any pending timers is a very simple check:
base->active->next != NULL
Yeah, that's 3 pointer dereferences from cpu_base, iow you traded a single bit test on an already loaded word for 3 potential cacheline misses.
But the clock bases are not aligned to cachelines, and we have 4 of them. So in practice when we access one, we'll load the next one anyway.
Furthermore the simplification is measurable, and a fair bit of it is in various fast paths. I'd rather trade a bit of a cacheline footprint for less overall complexity and faster code.
Thanks,
Ingo
* Ingo Molnar mingo@kernel.org wrote:
- Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 09, 2015 at 08:28:41AM +0200, Ingo Molnar wrote:
Btw., does cpu_base->active_bases even make sense? hrtimer bases are fundamentally percpu, and to check whether there are any pending timers is a very simple check:
base->active->next != NULL
Yeah, that's 3 pointer dereferences from cpu_base, iow you traded a single bit test on an already loaded word for 3 potential cacheline misses.
But the clock bases are not aligned to cachelines, and we have 4 of them. So in practice when we access one, we'll load the next one anyway.
Furthermore the simplification is measurable, and a fair bit of it is in various fast paths. I'd rather trade a bit of a cacheline footprint for less overall complexity and faster code.
Plus, look at this code in hrtimer_run_queues():
for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) { base = &cpu_base->clock_base[index]; if (!base->active.next) continue;
if (gettime) { hrtimer_get_softirq_time(cpu_base); gettime = 0; }
if at least one base is active (on my fairly standard system all cpus have at least one active hrtimer base all the time - and many cpus have two bases active), then we run hrtimer_get_softirq_time(), which dirties the cachelines of all 4 clock bases:
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim; base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono; base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot; base->clock_base[HRTIMER_BASE_TAI].softirq_time = tai;
so in practice we not only touch every cacheline in every timer interrupt, but we _dirty_ them, even the inactive ones.
So I'd strongly argue in favor of this patch series of simplification: it makes the code simpler and faster, and won't impact cache footprint in practice.
Thanks,
Ingo
On Thu, 9 Apr 2015, Ingo Molnar wrote:
Plus, look at this code in hrtimer_run_queues():
for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) { base = &cpu_base->clock_base[index]; if (!base->active.next) continue; if (gettime) { hrtimer_get_softirq_time(cpu_base); gettime = 0; }
if at least one base is active (on my fairly standard system all cpus have at least one active hrtimer base all the time - and many cpus have two bases active), then we run hrtimer_get_softirq_time(), which dirties the cachelines of all 4 clock bases:
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim; base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono; base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot; base->clock_base[HRTIMER_BASE_TAI].softirq_time = tai;
This is the non highres case and we actually could optimize that case as well to avoid the cache line polution.
Thanks,
tglx
On Thu, Apr 09, 2015 at 09:20:39AM +0200, Ingo Molnar wrote:
if at least one base is active (on my fairly standard system all cpus have at least one active hrtimer base all the time - and many cpus have two bases active), then we run hrtimer_get_softirq_time(), which dirties the cachelines of all 4 clock bases:
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim; base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono; base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot; base->clock_base[HRTIMER_BASE_TAI].softirq_time = tai;
so in practice we not only touch every cacheline in every timer interrupt, but we _dirty_ them, even the inactive ones.
Urgh we should really _really_ kill that entire softirq mess.
All it needs it hrtimer_start*() returning -ENOTIME when it cannot queue the timer.
Of course, all that needs it auditing all hrtimer_start*() callsites, which is a lot of work.
On Thu, 9 Apr 2015, Peter Zijlstra wrote:
On Thu, Apr 09, 2015 at 09:20:39AM +0200, Ingo Molnar wrote:
if at least one base is active (on my fairly standard system all cpus have at least one active hrtimer base all the time - and many cpus have two bases active), then we run hrtimer_get_softirq_time(), which dirties the cachelines of all 4 clock bases:
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim; base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono; base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot; base->clock_base[HRTIMER_BASE_TAI].softirq_time = tai;
so in practice we not only touch every cacheline in every timer interrupt, but we _dirty_ them, even the inactive ones.
Urgh we should really _really_ kill that entire softirq mess.
That's the !highres part. We cannot kill that one unless we remove all support for machines which do not provide hardware for highres support.
Now the softirq_time thing is an optimization which we added back in the days when hrtimer went into the tree and Roman complained about the base->get_time() invocation being overkill.
The reasoning behing this was that low resolution systems do not need accurate time for the expiry and the forwarding because everything happens tick aligned.
So for !HIGHRES we have:
static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer) { return timer->base->softirq_time; }
and for the HIGHRES case:
static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer) { return timer->base->get_time(); }
Here are the usage sites of this:
drivers/power/reset/ltc2952-poweroff.c: now = hrtimer_cb_get_time(timer); kernel/sched/core.c: now = hrtimer_cb_get_time(period_timer); kernel/sched/deadline.c: now = hrtimer_cb_get_time(&dl_se->dl_timer); kernel/sched/fair.c: now = hrtimer_cb_get_time(timer); kernel/sched/rt.c: now = hrtimer_cb_get_time(timer); kernel/time/posix-timers.c: ktime_t now = hrtimer_cb_get_time(timer); sound/drivers/dummy.c: dpcm->base_time = hrtimer_cb_get_time(&dpcm->timer); sound/drivers/dummy.c: delta = ktime_us_delta(hrtimer_cb_get_time(&dpcm->timer),
So the only ones where this optimization might matter is the clock monotonic one. The few users of posix interval timers which use something else than CLOCK_MONO should not matter much.
I'd be happy to kill all of this and consolidate everything on the HIGHRES implementation.
Thanks,
tglx
On Thu, Apr 09, 2015 at 11:18:52AM +0200, Thomas Gleixner wrote:
On Thu, 9 Apr 2015, Peter Zijlstra wrote:
On Thu, Apr 09, 2015 at 09:20:39AM +0200, Ingo Molnar wrote:
if at least one base is active (on my fairly standard system all cpus have at least one active hrtimer base all the time - and many cpus have two bases active), then we run hrtimer_get_softirq_time(), which dirties the cachelines of all 4 clock bases:
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim; base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono; base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot; base->clock_base[HRTIMER_BASE_TAI].softirq_time = tai;
so in practice we not only touch every cacheline in every timer interrupt, but we _dirty_ them, even the inactive ones.
Urgh we should really _really_ kill that entire softirq mess.
That's the !highres part. We cannot kill that one unless we remove all support for machines which do not provide hardware for highres support.
Oops, sorry I mixed up the two. Doesn't take away from the fact that I'd like to get rid of the highres softirq fallback path.
On Thu, 9 Apr 2015, Peter Zijlstra wrote:
On Thu, Apr 09, 2015 at 11:18:52AM +0200, Thomas Gleixner wrote:
On Thu, 9 Apr 2015, Peter Zijlstra wrote:
On Thu, Apr 09, 2015 at 09:20:39AM +0200, Ingo Molnar wrote:
if at least one base is active (on my fairly standard system all cpus have at least one active hrtimer base all the time - and many cpus have two bases active), then we run hrtimer_get_softirq_time(), which dirties the cachelines of all 4 clock bases:
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim; base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono; base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot; base->clock_base[HRTIMER_BASE_TAI].softirq_time = tai;
so in practice we not only touch every cacheline in every timer interrupt, but we _dirty_ them, even the inactive ones.
Urgh we should really _really_ kill that entire softirq mess.
That's the !highres part. We cannot kill that one unless we remove all support for machines which do not provide hardware for highres support.
Oops, sorry I mixed up the two. Doesn't take away from the fact that I'd like to get rid of the highres softirq fallback path.
Indeed.
On 04/09/2015 02:48 PM, Thomas Gleixner wrote:
On Thu, 9 Apr 2015, Peter Zijlstra wrote:
On Thu, Apr 09, 2015 at 09:20:39AM +0200, Ingo Molnar wrote:
if at least one base is active (on my fairly standard system all cpus have at least one active hrtimer base all the time - and many cpus have two bases active), then we run hrtimer_get_softirq_time(), which dirties the cachelines of all 4 clock bases:
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim; base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono; base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot; base->clock_base[HRTIMER_BASE_TAI].softirq_time = tai;
so in practice we not only touch every cacheline in every timer interrupt, but we _dirty_ them, even the inactive ones.
Urgh we should really _really_ kill that entire softirq mess.
That's the !highres part. We cannot kill that one unless we remove all support for machines which do not provide hardware for highres support.
Now the softirq_time thing is an optimization which we added back in the days when hrtimer went into the tree and Roman complained about the base->get_time() invocation being overkill.
The reasoning behing this was that low resolution systems do not need accurate time for the expiry and the forwarding because everything happens tick aligned.
So for !HIGHRES we have:
static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer) { return timer->base->softirq_time; }
Why is this called softirq_time when the hrtimer is being serviced in the hard irq context ?
Regards Preeti U Murthy
On Mon, 13 Apr 2015, Preeti U Murthy wrote:
On 04/09/2015 02:48 PM, Thomas Gleixner wrote:
On Thu, 9 Apr 2015, Peter Zijlstra wrote:
On Thu, Apr 09, 2015 at 09:20:39AM +0200, Ingo Molnar wrote:
if at least one base is active (on my fairly standard system all cpus have at least one active hrtimer base all the time - and many cpus have two bases active), then we run hrtimer_get_softirq_time(), which dirties the cachelines of all 4 clock bases:
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim; base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono; base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot; base->clock_base[HRTIMER_BASE_TAI].softirq_time = tai;
so in practice we not only touch every cacheline in every timer interrupt, but we _dirty_ them, even the inactive ones.
Urgh we should really _really_ kill that entire softirq mess.
That's the !highres part. We cannot kill that one unless we remove all support for machines which do not provide hardware for highres support.
Now the softirq_time thing is an optimization which we added back in the days when hrtimer went into the tree and Roman complained about the base->get_time() invocation being overkill.
The reasoning behing this was that low resolution systems do not need accurate time for the expiry and the forwarding because everything happens tick aligned.
So for !HIGHRES we have:
static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer) { return timer->base->softirq_time; }
Why is this called softirq_time when the hrtimer is being serviced in the hard irq context ?
For historical reasons.
Thanks,
tglx
On Thu, Apr 09, 2015 at 09:09:17AM +0200, Ingo Molnar wrote:
- Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 09, 2015 at 08:28:41AM +0200, Ingo Molnar wrote:
Btw., does cpu_base->active_bases even make sense? hrtimer bases are fundamentally percpu, and to check whether there are any pending timers is a very simple check:
base->active->next != NULL
Yeah, that's 3 pointer dereferences from cpu_base, iow you traded a single bit test on an already loaded word for 3 potential cacheline misses.
But the clock bases are not aligned to cachelines, and we have 4 of them. So in practice when we access one, we'll load the next one anyway.
$ pahole -C hrtimer_clock_base defconfig-build/kernel/time/timer.o struct hrtimer_clock_base { struct hrtimer_cpu_base * cpu_base; /* 0 8 */ int index; /* 8 4 */ clockid_t clockid; /* 12 4 */ struct timerqueue_head active; /* 16 16 */ ktime_t resolution; /* 32 8 */ ktime_t (*get_time)(void); /* 40 8 */ ktime_t softirq_time; /* 48 8 */ ktime_t offset; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */
/* size: 64, cachelines: 1, members: 8 */ };
They _should_ be aligned :-)
Furthermore the simplification is measurable, and a fair bit of it is in various fast paths. I'd rather trade a bit of a cacheline footprint for less overall complexity and faster code.
cacheline misses hurt a lot, and the bitmask isn't really complex.
* Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 09, 2015 at 09:09:17AM +0200, Ingo Molnar wrote:
- Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 09, 2015 at 08:28:41AM +0200, Ingo Molnar wrote:
Btw., does cpu_base->active_bases even make sense? hrtimer bases are fundamentally percpu, and to check whether there are any pending timers is a very simple check:
base->active->next != NULL
Yeah, that's 3 pointer dereferences from cpu_base, iow you traded a single bit test on an already loaded word for 3 potential cacheline misses.
But the clock bases are not aligned to cachelines, and we have 4 of them. So in practice when we access one, we'll load the next one anyway.
$ pahole -C hrtimer_clock_base defconfig-build/kernel/time/timer.o struct hrtimer_clock_base { struct hrtimer_cpu_base * cpu_base; /* 0 8 */ int index; /* 8 4 */ clockid_t clockid; /* 12 4 */ struct timerqueue_head active; /* 16 16 */ ktime_t resolution; /* 32 8 */ ktime_t (*get_time)(void); /* 40 8 */ ktime_t softirq_time; /* 48 8 */ ktime_t offset; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */
/* size: 64, cachelines: 1, members: 8 */
};
They _should_ be aligned :-)
Maybe, but they aren't currently - and aligning them has costs as well.
Furthermore the simplification is measurable, and a fair bit of it is in various fast paths. I'd rather trade a bit of a cacheline footprint for less overall complexity and faster code.
cacheline misses hurt a lot, and the bitmask isn't really complex.
See my other mail: in practice we already dirty all of these cachelines in the hrtimer irq...
Thanks,
Ingo
On Thu, 9 Apr 2015, Ingo Molnar wrote:
Btw., does cpu_base->active_bases even make sense? hrtimer bases are fundamentally percpu, and to check whether there are any pending timers is a very simple check:
base->active->next != NULL
So I'd rather suggest taking a direct look at the head, instead of calculating bit positions, masks, etc.
Furthermore, we never actually use cpu_base->active_bases as a 'summary' value (which is the main point of bitmasks in general), so I'd remove that complication altogether.
This would speed up various hrtimer primitives like hrtimer_remove()/add and simplify the code. It would be a net code shrink as well.
Well. You trade a bit more code against touching cache lines to figure out whether the clock base has active timers or not. So for a lot of scenarios where only clock monotonic is used you touch 3 cache lines for nothing.
I'm about to send out a patch which actually makes better use of the active_bases field without creating a code size explosion.
Thanks,
tglx
* Thomas Gleixner tglx@linutronix.de wrote:
On Thu, 9 Apr 2015, Ingo Molnar wrote:
Btw., does cpu_base->active_bases even make sense? hrtimer bases are fundamentally percpu, and to check whether there are any pending timers is a very simple check:
base->active->next != NULL
So I'd rather suggest taking a direct look at the head, instead of calculating bit positions, masks, etc.
Furthermore, we never actually use cpu_base->active_bases as a 'summary' value (which is the main point of bitmasks in general), so I'd remove that complication altogether.
This would speed up various hrtimer primitives like hrtimer_remove()/add and simplify the code. It would be a net code shrink as well.
Well. You trade a bit more code against touching cache lines to figure out whether the clock base has active timers or not. So for a lot of scenarios where only clock monotonic is used you touch 3 cache lines for nothing.
In the (typical) case it will touch one extra cacheline - and removes a fair bit of complexity which 80 bytes (that touches two cachelines):
7502 427 0 7929 1ef9 hrtimer.o.before 7422 427 0 7849 1ea9 hrtimer.o.after
So even if we were to optimize for cache footprint (which isn't the only factor we optimize for)it looks like a win-win scenario to me, even if you ignore the speedup and the simpler code structure...
Ok?
I'm about to send out a patch which actually makes better use of the active_bases field without creating a code size explosion.
So please lets do this series first - it achieves the same thing, with less cache used and faster code.
Thanks,
Ingo
linaro-kernel@lists.linaro.org