A delayed work to schedule vmstat_shepherd() is queued at periodic intervals for internal working of vmstat core. This work and its timer end up waking an idle cpu sometimes, as this always stays on CPU0.
Because we re-queue the work from its handler, idle_cpu() returns false and so the timer (used by delayed work) never migrates to any other CPU.
This may not be the desired behavior always as waking up an idle CPU to queue work on few other CPUs isn't good from power-consumption point of view.
In order to avoid waking up an idle core, we can replace schedule_delayed_work() with a normal work plus a separate timer. The timer handler will then queue the work after re-arming the timer. If the CPU was idle before the timer fired, idle_cpu() will mostly return true and the next timer shall be migrated to a non-idle CPU.
But the timer core has a limitation, when the timer is re-armed from its handler, timer core disables migration of that timer to other cores. Details of that limitation are present in kernel/time/timer.c:__mod_timer() routine.
Another simple yet effective solution can be to keep two timers with same handler and keep toggling between them, so that the above limitation doesn't hold true anymore.
This patch replaces schedule_delayed_work() with schedule_work() plus two timers. After this, it was seen that the timer and its do get migrated to other non-idle CPUs, when the local cpu is idle.
Tested-by: Vinayak Menon vinmenon@codeaurora.org Tested-by: Shiraz Hashim shashim@codeaurora.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- This patch isn't sent to say its the best way forward, but to get a discussion started on the same.
mm/vmstat.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c index 4f5cd974e11a..d45e4243a046 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1424,8 +1424,18 @@ static bool need_update(int cpu) * inactivity. */ static void vmstat_shepherd(struct work_struct *w); +static DECLARE_WORK(shepherd, vmstat_shepherd);
-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd); +/* + * Two timers are used here to avoid waking up an idle CPU. If a single timer is + * kept, then re-arming the timer from its handler doesn't let us migrate it. + */ +static struct timer_list shepherd_timer[2]; +#define toggle_timer() (shepherd_timer_index = !shepherd_timer_index, \ + &shepherd_timer[shepherd_timer_index]) + +static void vmstat_shepherd_timer(unsigned long data); +static int shepherd_timer_index;
static void vmstat_shepherd(struct work_struct *w) { @@ -1441,15 +1451,19 @@ static void vmstat_shepherd(struct work_struct *w) &per_cpu(vmstat_work, cpu), 0);
put_online_cpus(); +}
- schedule_delayed_work(&shepherd, - round_jiffies_relative(sysctl_stat_interval)); +static void vmstat_shepherd_timer(unsigned long data) +{ + mod_timer(toggle_timer(), + jiffies + round_jiffies_relative(sysctl_stat_interval)); + schedule_work(&shepherd);
}
static void __init start_shepherd_timer(void) { - int cpu; + int cpu, i = -1;
for_each_possible_cpu(cpu) INIT_DELAYED_WORK(per_cpu_ptr(&vmstat_work, cpu), @@ -1459,8 +1473,13 @@ static void __init start_shepherd_timer(void) BUG(); cpumask_copy(cpu_stat_off, cpu_online_mask);
- schedule_delayed_work(&shepherd, - round_jiffies_relative(sysctl_stat_interval)); + while (++i < 2) { + init_timer(&shepherd_timer[i]); + shepherd_timer[i].function = vmstat_shepherd_timer; + }; + + mod_timer(toggle_timer(), + jiffies + round_jiffies_relative(sysctl_stat_interval)); }
static void vmstat_cpu_dead(int node)
On Thu, 26 Mar 2015 11:09:01 +0530 Viresh Kumar viresh.kumar@linaro.org wrote:
A delayed work to schedule vmstat_shepherd() is queued at periodic intervals for internal working of vmstat core. This work and its timer end up waking an idle cpu sometimes, as this always stays on CPU0.
Because we re-queue the work from its handler, idle_cpu() returns false and so the timer (used by delayed work) never migrates to any other CPU.
This may not be the desired behavior always as waking up an idle CPU to queue work on few other CPUs isn't good from power-consumption point of view.
In order to avoid waking up an idle core, we can replace schedule_delayed_work() with a normal work plus a separate timer. The timer handler will then queue the work after re-arming the timer. If the CPU was idle before the timer fired, idle_cpu() will mostly return true and the next timer shall be migrated to a non-idle CPU.
But the timer core has a limitation, when the timer is re-armed from its handler, timer core disables migration of that timer to other cores. Details of that limitation are present in kernel/time/timer.c:__mod_timer() routine.
Another simple yet effective solution can be to keep two timers with same handler and keep toggling between them, so that the above limitation doesn't hold true anymore.
This patch replaces schedule_delayed_work() with schedule_work() plus two timers. After this, it was seen that the timer and its do get migrated to other non-idle CPUs, when the local cpu is idle.
Shouldn't this be viewed as a shortcoming of the core timer code?
vmstat_shepherd() is merely rescheduling itself with schedule_delayed_work(). That's a dead bog simple operation and if it's producing suboptimal behaviour then we shouldn't be fixing it with elaborate workarounds in the caller?
On 27 March 2015 at 01:48, Andrew Morton akpm@linux-foundation.org wrote:
Shouldn't this be viewed as a shortcoming of the core timer code?
Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, but they are rejected for obviously reasons [1].
vmstat_shepherd() is merely rescheduling itself with schedule_delayed_work(). That's a dead bog simple operation and if it's producing suboptimal behaviour then we shouldn't be fixing it with elaborate workarounds in the caller?
I understand that, and that's why I sent it as an RFC to get the discussion started. Does anyone else have got another (acceptable) idea to get this resolved ?
-- viresh
[1] http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
On Fri, Mar 27, 2015 at 10:19:54AM +0530, Viresh Kumar wrote:
On 27 March 2015 at 01:48, Andrew Morton akpm@linux-foundation.org wrote:
Shouldn't this be viewed as a shortcoming of the core timer code?
Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, but they are rejected for obviously reasons [1].
vmstat_shepherd() is merely rescheduling itself with schedule_delayed_work(). That's a dead bog simple operation and if it's producing suboptimal behaviour then we shouldn't be fixing it with elaborate workarounds in the caller?
I understand that, and that's why I sent it as an RFC to get the discussion started. Does anyone else have got another (acceptable) idea to get this resolved ?
So the issue seems to be that we need base->running_timer in order to tell if a callback is running, right?
We could align the base on 8 bytes to gain an extra bit in the pointer and use that bit to indicate the running state. Then these sites can spin on that bit while we can change the actual base pointer.
Since the timer->base pointer is locked through the base->lock and hand-over is safe vs lock_timer_base, this should all work.
On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:
On Fri, Mar 27, 2015 at 10:19:54AM +0530, Viresh Kumar wrote:
On 27 March 2015 at 01:48, Andrew Morton akpm@linux-foundation.org wrote:
Shouldn't this be viewed as a shortcoming of the core timer code?
Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, but they are rejected for obviously reasons [1].
vmstat_shepherd() is merely rescheduling itself with schedule_delayed_work(). That's a dead bog simple operation and if it's producing suboptimal behaviour then we shouldn't be fixing it with elaborate workarounds in the caller?
I understand that, and that's why I sent it as an RFC to get the discussion started. Does anyone else have got another (acceptable) idea to get this resolved ?
So the issue seems to be that we need base->running_timer in order to tell if a callback is running, right?
We could align the base on 8 bytes to gain an extra bit in the pointer and use that bit to indicate the running state. Then these sites can spin on that bit while we can change the actual base pointer.
Even though tvec_base has ____cacheline_aligned stuck on, most are allocated using kzalloc_node() which does not actually respect that but already guarantees a minimum u64 alignment, so I think we can use that third bit without too much magic.
On Fri, 27 Mar 2015, Peter Zijlstra wrote:
We could align the base on 8 bytes to gain an extra bit in the pointer and use that bit to indicate the running state. Then these sites can spin on that bit while we can change the actual base pointer.
Even though tvec_base has ____cacheline_aligned stuck on, most are allocated using kzalloc_node() which does not actually respect that but already guarantees a minimum u64 alignment, so I think we can use that third bit without too much magic.
Create a new slab cache for this purpose that does the proper aligning?
On Fri, Mar 27, 2015 at 06:11:44AM -0500, Christoph Lameter wrote:
On Fri, 27 Mar 2015, Peter Zijlstra wrote:
We could align the base on 8 bytes to gain an extra bit in the pointer and use that bit to indicate the running state. Then these sites can spin on that bit while we can change the actual base pointer.
Even though tvec_base has ____cacheline_aligned stuck on, most are allocated using kzalloc_node() which does not actually respect that but already guarantees a minimum u64 alignment, so I think we can use that third bit without too much magic.
Create a new slab cache for this purpose that does the proper aligning?
That is certainly a possibility, but we'll only ever allocate nr_cpus-1 entries from it, a whole new slab cache might be overkill.
What's not clear to me is why that thing is allocated at all, AFAICT something like:
static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
Should do the right thing and be much simpler.
On Fri, 27 Mar 2015, Peter Zijlstra wrote:
On Fri, Mar 27, 2015 at 06:11:44AM -0500, Christoph Lameter wrote:
Create a new slab cache for this purpose that does the proper aligning?
That is certainly a possibility, but we'll only ever allocate nr_cpus-1 entries from it, a whole new slab cache might be overkill.
This will certainly be aliased to some other slab cache so not much overhead is created.
On 27 March 2015 at 17:32, Peter Zijlstra peterz@infradead.org wrote:
What's not clear to me is why that thing is allocated at all, AFAICT something like:
static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
Should do the right thing and be much simpler.
Does this comment from timers.c answers your query ?
/* * This is for the boot CPU - we use compile-time * static initialisation because per-cpu memory isn't * ready yet and because the memory allocators are not * initialised either. */
On Sat, Mar 28, 2015 at 09:58:38AM +0530, Viresh Kumar wrote:
On 27 March 2015 at 17:32, Peter Zijlstra peterz@infradead.org wrote:
What's not clear to me is why that thing is allocated at all, AFAICT something like:
static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
Should do the right thing and be much simpler.
Does this comment from timers.c answers your query ?
/* * This is for the boot CPU - we use compile-time * static initialisation because per-cpu memory isn't * ready yet and because the memory allocators are not * initialised either. */
No. The reason is because __TIMER_INITIALIZER() needs to set ->base to a valid pointer and we cannot get a compile time pointer to per-cpu entries because we don't know where we'll map the section, even for the boot cpu.
This in turn means we need that boot_tvec_bases thing.
Now the reason we need to set ->base to a valid pointer is because we've made NULL special.
Arguably we could create another special pointer, but since that's init time only we get to add code for that will 'never' be used, more special cases.
Its all a bit of a bother, but it makes sense.
On Fri, Mar 27, 2015 at 3:00 PM, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:
So the issue seems to be that we need base->running_timer in order to tell if a callback is running, right?
We could align the base on 8 bytes to gain an extra bit in the pointer and use that bit to indicate the running state. Then these sites can spin on that bit while we can change the actual base pointer.
Even though tvec_base has ____cacheline_aligned stuck on, most are allocated using kzalloc_node() which does not actually respect that but already guarantees a minimum u64 alignment, so I think we can use that third bit without too much magic.
Right. So what I tried earlier was very much similar to you are suggesting. The only difference was that I haven't made much attempts towards saving memory.
But Thomas didn't like it for sure and I believe that Rant will hold true for what you are suggesting as well, isn't it ?
http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
On Sat, Mar 28, 2015 at 09:48:28AM +0530, Viresh Kumar wrote:
On Fri, Mar 27, 2015 at 3:00 PM, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:
So the issue seems to be that we need base->running_timer in order to tell if a callback is running, right?
We could align the base on 8 bytes to gain an extra bit in the pointer and use that bit to indicate the running state. Then these sites can spin on that bit while we can change the actual base pointer.
Even though tvec_base has ____cacheline_aligned stuck on, most are allocated using kzalloc_node() which does not actually respect that but already guarantees a minimum u64 alignment, so I think we can use that third bit without too much magic.
Right. So what I tried earlier was very much similar to you are suggesting. The only difference was that I haven't made much attempts towards saving memory.
But Thomas didn't like it for sure and I believe that Rant will hold true for what you are suggesting as well, isn't it ?
http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
Well, for one your patch is indeed disgusting. But yes I'm aware Thomas wants to rewrite the timer thing. But Thomas is away for a little while and if this really needs to happen then it does.
Alternatively the thing hocko suggests is an utter fail too. You cannot stuff that into hardirq context, that's insane.
On 28 March 2015 at 15:23, Peter Zijlstra peterz@infradead.org wrote:
Well, for one your patch is indeed disgusting.
Yeah, I agree :)
But yes I'm aware Thomas wants to rewrite the timer thing. But Thomas is away for a little while and if this really needs to happen then it does.
Sometime back I was trying to use another bit from base pointer for marking a timer as PINNED:
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197e1587..e7184f57449c 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -67,8 +67,9 @@ extern struct tvec_base boot_tvec_bases; */ #define TIMER_DEFERRABLE 0x1LU #define TIMER_IRQSAFE 0x2LU +#define TIMER_PINNED 0x4LU -#define TIMER_FLAG_MASK 0x3LU +#define TIMER_FLAG_MASK 0x7LU
And Fenguang's build-bot showed the problem (only) on blackfin [1].
config: make ARCH=blackfin allyesconfig
All error/warnings:
kernel/timer.c: In function 'init_timers': >> kernel/timer.c:1683:2: error: call to '__compiletime_assert_1683' >> declared with attribute error: BUILD_BUG_ON failed: >> __alignof__(struct tvec_base) & TIMER_FLAG_MASK
So probably we need to make 'base' aligned to 8 bytes ?
So, what you are suggesting is something like this (untested):
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197e1587..68bf09d69352 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -67,8 +67,9 @@ extern struct tvec_base boot_tvec_bases; */ #define TIMER_DEFERRABLE 0x1LU #define TIMER_IRQSAFE 0x2LU +#define TIMER_RUNNING 0x4LU
-#define TIMER_FLAG_MASK 0x3LU +#define TIMER_FLAG_MASK 0x7LU
#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \ .entry = { .prev = TIMER_ENTRY_STATIC }, \ diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..8f9efa64bd34 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -105,6 +105,21 @@ static inline unsigned int tbase_get_irqsafe(struct tvec_base *base) return ((unsigned int)(unsigned long)base & TIMER_IRQSAFE); }
+static inline unsigned int tbase_get_running(struct tvec_base *base) +{ + return ((unsigned int)(unsigned long)base & TIMER_RUNNING); +} + +static inline unsigned int tbase_set_running(struct tvec_base *base) +{ + return ((unsigned int)(unsigned long)base | TIMER_RUNNING); +} + +static inline unsigned int tbase_clear_running(struct tvec_base *base) +{ + return ((unsigned int)(unsigned long)base & ~TIMER_RUNNING); +} + static inline struct tvec_base *tbase_get_base(struct tvec_base *base) { return ((struct tvec_base *)((unsigned long)base & ~TIMER_FLAG_MASK)); @@ -781,21 +796,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) { - /* - * We are trying to schedule the timer on the local CPU. - * However we can't change timer's base while it is running, - * otherwise del_timer_sync() can't detect that the timer's - * handler yet has not finished. This also guarantees that - * the timer is serialized wrt itself. - */ - if (likely(base->running_timer != timer)) { - /* See the comment in lock_timer_base() */ - timer_set_base(timer, NULL); - spin_unlock(&base->lock); - base = new_base; - spin_lock(&base->lock); - timer_set_base(timer, base); - } + /* See the comment in lock_timer_base() */ + timer_set_base(timer, NULL); + spin_unlock(&base->lock); + base = new_base; + spin_lock(&base->lock); + timer_set_base(timer, base); }
timer->expires = expires; @@ -1016,7 +1022,7 @@ int try_to_del_timer_sync(struct timer_list *timer)
base = lock_timer_base(timer, &flags);
- if (base->running_timer != timer) { + if (tbase_get_running(timer->base)) { timer_stats_timer_clear_start_info(timer); ret = detach_if_pending(timer, base, true); } @@ -1202,6 +1208,7 @@ static inline void __run_timers(struct tvec_base *base) timer_stats_account_timer(timer);
base->running_timer = timer; + tbase_set_running(timer->base); detach_expired_timer(timer, base);
if (irqsafe) { @@ -1216,6 +1223,7 @@ static inline void __run_timers(struct tvec_base *base) } } base->running_timer = NULL; + tbase_clear_running(timer->base); spin_unlock_irq(&base->lock); }
------------x--------------------x----------------------
Right?
Now there are few issues I see here (Sorry if they are all imaginary): - In case a timer re-arms itself from its handler and is migrated from CPU A to B, what happens if the re-armed timer fires before the first handler finishes ? i.e. timer->fn() hasn't finished running on CPU A and it has fired again on CPU B. Wouldn't this expose us to a lot of other problems? It wouldn't be serialized to itself anymore ?
- Because the timer has migrated to another CPU, the locking in __run_timers() needs to be fixed. And that will make it complicated ..
- __run_timer() doesn't lock bases of other CPUs, and it has to do it now.. - We probably need to take locks of both local CPU and the one to which timer migrated.
- Its possible now that there can be more than one running timer for a base, which wasn't true earlier. Not sure if it will break something.
Thanks for your continuous support to reply to my (sometimes stupid) queries.
-- viresh
[1] https://lists.01.org/pipermail/kbuild-all/2014-April/003982.html
On 28 March 2015 at 17:27, viresh kumar viresh.kumar@linaro.org wrote:
On 28 March 2015 at 15:23, Peter Zijlstra peterz@infradead.org wrote:
Well, for one your patch is indeed disgusting.
Yeah, I agree :)
Sigh..
Sorry for the series of *nonsense* mails before the last one.
Its some thunderbird *BUG* which did that, I was accessing my mail from both gmail's interface and thunderbird and somehow this happened. I have hit the send button only once.
Really sorry for the noise.
(The last mail has few inquiries towards the end and a thanks note, just to identify it..)
On Sat, Mar 28, 2015 at 05:27:23PM +0530, viresh kumar wrote:
So probably we need to make 'base' aligned to 8 bytes ?
Yeah, something like the below (at the very end) should ensure the thing is cacheline aligned, that should give us a fair few bits.
So, what you are suggesting is something like this (untested):
@@ -1202,6 +1208,7 @@ static inline void __run_timers(struct tvec_base *base) timer_stats_account_timer(timer);
base->running_timer = timer;
tbase_set_running(timer->base); detach_expired_timer(timer, base); if (irqsafe) {
@@ -1216,6 +1223,7 @@ static inline void __run_timers(struct tvec_base *base) } } base->running_timer = NULL;
tbase_clear_running(timer->base); spin_unlock_irq(&base->lock);
}
That's broken. You need to clear running on all the timers you set it on. Furthermore, you need to revalidate timer->base == base after call_timer_fn().
Something like so:
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..489ce182f8ec 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1213,6 +1213,21 @@ static inline void __run_timers(struct tvec_base *base) call_timer_fn(timer, fn, data); spin_lock_irq(&base->lock); } + + if (unlikely(timer->base != base)) { + unsigned long flags; + struct tvec_base *tbase; + + spin_unlock(&base->lock); + + tbase = lock_timer_base(timer, &flags); + tbase_clear_running(timer->base); + spin_unlock(&tbase->lock); + + spin_lock(&base->lock); + } else { + tbase_clear_running(timer->base); + } } } base->running_timer = NULL;
Also, once you have tbase_running, we can take base->running_timer out altogether.
Now there are few issues I see here (Sorry if they are all imaginary):
- In case a timer re-arms itself from its handler and is migrated from CPU A to B, what happens if the re-armed timer fires before the first handler finishes ? i.e. timer->fn() hasn't finished running on CPU A and it has fired again on CPU B. Wouldn't this expose us to a lot of other problems? It wouldn't be serialized to itself anymore ?
What I said above.
- Because the timer has migrated to another CPU, the locking in __run_timers() needs to be fixed. And that will make it complicated ..
Hardly.
- __run_timer() doesn't lock bases of other CPUs, and it has to do it now..
Yep, but rarely.
- We probably need to take locks of both local CPU and the one to which timer migrated.
Nope, or rather, not at the same time. That's what the NULL magic buys us.
- Its possible now that there can be more than one running timer for a base, which wasn't true earlier. Not sure if it will break something.
Only if you messed it up real bad :-)
--- kernel/time/timer.c | 36 ++++++++---------------------------- 1 file changed, 8 insertions(+), 28 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..c8c45bf50b2e 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -93,6 +93,7 @@ struct tvec_base { struct tvec_base boot_tvec_bases; EXPORT_SYMBOL(boot_tvec_bases); static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases; +static DEFINE_PER_CPU(struct tvec_base, __tvec_bases);
/* Functions below help us manage 'deferrable' flag */ static inline unsigned int tbase_get_deferrable(struct tvec_base *base) @@ -1534,46 +1535,25 @@ EXPORT_SYMBOL(schedule_timeout_uninterruptible);
static int init_timers_cpu(int cpu) { - int j; - struct tvec_base *base; + struct tvec_base *base = per_cpu(tvec_bases, cpu); static char tvec_base_done[NR_CPUS]; + int j;
if (!tvec_base_done[cpu]) { static char boot_done;
- if (boot_done) { - /* - * The APs use this path later in boot - */ - base = kzalloc_node(sizeof(*base), GFP_KERNEL, - cpu_to_node(cpu)); - if (!base) - return -ENOMEM; - - /* Make sure tvec_base has TIMER_FLAG_MASK bits free */ - if (WARN_ON(base != tbase_get_base(base))) { - kfree(base); - return -ENOMEM; - } - per_cpu(tvec_bases, cpu) = base; + if (!boot_done) { + boot_done = 1; /* skip the boot cpu */ } else { - /* - * This is for the boot CPU - we use compile-time - * static initialisation because per-cpu memory isn't - * ready yet and because the memory allocators are not - * initialised either. - */ - boot_done = 1; - base = &boot_tvec_bases; + base = per_cpu_ptr(&__tvec_bases); + per_cpu(tvec_bases, cpu) = base; } + spin_lock_init(&base->lock); tvec_base_done[cpu] = 1; base->cpu = cpu; - } else { - base = per_cpu(tvec_bases, cpu); }
- for (j = 0; j < TVN_SIZE; j++) { INIT_LIST_HEAD(base->tv5.vec + j); INIT_LIST_HEAD(base->tv4.vec + j);
On Sat, Mar 28, 2015 at 02:44:57PM +0100, Peter Zijlstra wrote:
Now there are few issues I see here (Sorry if they are all imaginary):
- In case a timer re-arms itself from its handler and is migrated from CPU A to B, what happens if the re-armed timer fires before the first handler finishes ? i.e. timer->fn() hasn't finished running on CPU A and it has fired again on CPU B. Wouldn't this expose us to a lot of other problems? It wouldn't be serialized to itself anymore ?
What I said above.
What I didn't say, but had thought of is that __run_timer() should skip any timer that has RUNNING set -- for obvious reasons :-)
On 29 March 2015 at 15:54, Peter Zijlstra peterz@infradead.org wrote:
On Sat, Mar 28, 2015 at 02:44:57PM +0100, Peter Zijlstra wrote:
Now there are few issues I see here (Sorry if they are all imaginary):
- In case a timer re-arms itself from its handler and is migrated from CPU A to B, what happens if the re-armed timer fires before the first handler finishes ? i.e. timer->fn() hasn't finished running on CPU A and it has fired again on CPU B. Wouldn't this expose us to a lot of other problems? It wouldn't be serialized to itself anymore ?
What I said above.
What I didn't say, but had thought of is that __run_timer() should skip any timer that has RUNNING set -- for obvious reasons :-)
Below is copied from your first reply, and so you probably already said that ? :)
Also, once you have tbase_running, we can take base->running_timer out altogether.
I wanted to clarify if I understood it correctly..
Are you saying that:
Case 1.) if we find tbase_running on cpuY (because it was rearmed from its handler on cpuX and has got migrated to cpuY), then we should drop the timer from the list without calling its handler (as that is already running in parallel) ?
Or
Case 2.) we keep retrying for it, until the time the other handler finishes?
I have few queries for both the cases.
Case 1.) Will that be fair to the timer user as the timer may get lost completely. If we skip the timer on cpuY here, it wouldn't be enqueued again and so will be lost.
Case 2.) We kept waiting for the first handler to finish .. - cpuY may waste some cycles as it kept waiting for handler to finish on cpuX .. - We may need to perform base unlock/lock on cpuY, so that cpuX can take cpuY's lock to reset tbase_running. And that might be racy, not sure.
-- viresh
On Mon, Mar 30, 2015 at 05:32:16PM +0530, Viresh Kumar wrote:
On 29 March 2015 at 15:54, Peter Zijlstra peterz@infradead.org wrote:
What I didn't say, but had thought of is that __run_timer() should skip any timer that has RUNNING set -- for obvious reasons :-)
Below is copied from your first reply, and so you probably already said that ? :)
Also, once you have tbase_running, we can take base->running_timer out altogether.
No, I means something else with that. We can remove the tvec_base::running_timer field. Everything that uses that can use tbase_running() AFAICT.
I wanted to clarify if I understood it correctly..
Are you saying that:
Case 2.) we keep retrying for it, until the time the other handler finishes?
That.
If we remove it from the list before we call ->fn. Therefore, even if migrate happens, it will not see a RUNNING timer entry, seeing how its not actually on any lists.
The only way to get on a list while running is if ->fn() requeues itself _on_another_base_. When that happens, we need to wait for it to complete running.
Case 2.) We kept waiting for the first handler to finish ..
- cpuY may waste some cycles as it kept waiting for handler to finish on cpuX ..
True, rather silly to requeue a timer on the same jiffy as its already running through, but yes, an unlikely possibility.
You can run another timer while we wait -- if there is another of course.
- We may need to perform base unlock/lock on cpuY, so that cpuX can take cpuY's
lock to reset tbase_running. And that might be racy, not sure.
Drop yes, racy not so much I think.
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..1394f9540348 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1189,12 +1189,39 @@ static inline void __run_timers(struct tvec_base *base) cascade(base, &base->tv5, INDEX(3)); ++base->timer_jiffies; list_replace_init(base->tv1.vec + index, head); + +again: while (!list_empty(head)) { void (*fn)(unsigned long); unsigned long data; bool irqsafe;
- timer = list_first_entry(head, struct timer_list,entry); + timer = list_first_entry(head, struct timer_list, entry); + if (unlikely(tbase_running(timer))) { + /* Only one timer on the list, force wait. */ + if (unlikely(head->next == head->prev)) { + spin_unlock(&base->lock); + + /* + * The only way to get here is if the + * handler requeued itself on another + * base, this guarantees the timer will + * not go away. + */ + while (tbase_running(timer)) + cpu_relax(); + + spin_lock(&base->lock); + } else { + /* + * Otherwise, rotate the list and try + * someone else. + */ + list_move_tail(&timer->entry, head); + } + goto again; + } + fn = timer->function; data = timer->data; irqsafe = tbase_get_irqsafe(timer->base);
On 30 March 2015 at 18:17, Peter Zijlstra peterz@infradead.org wrote:
No, I means something else with that. We can remove the tvec_base::running_timer field. Everything that uses that can use tbase_running() AFAICT.
Okay, there is one instance which still needs it.
migrate_timers():
BUG_ON(old_base->running_timer);
What I wasn't sure about it is if we get can drop this statement or not. If we decide not to drop it, then we can convert running_timer into a bool.
Drop yes, racy not so much I think.
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..1394f9540348 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1189,12 +1189,39 @@ static inline void __run_timers(struct tvec_base *base) cascade(base, &base->tv5, INDEX(3)); ++base->timer_jiffies; list_replace_init(base->tv1.vec + index, head);
+again: while (!list_empty(head)) { void (*fn)(unsigned long); unsigned long data; bool irqsafe;
timer = list_first_entry(head, struct timer_list,entry);
timer = list_first_entry(head, struct timer_list, entry);
if (unlikely(tbase_running(timer))) {
/* Only one timer on the list, force wait. */
if (unlikely(head->next == head->prev)) {
spin_unlock(&base->lock);
/*
* The only way to get here is if the
* handler requeued itself on another
* base, this guarantees the timer will
* not go away.
*/
while (tbase_running(timer))
cpu_relax();
spin_lock(&base->lock);
} else {
/*
* Otherwise, rotate the list and try
* someone else.
*/
list_move_tail(&timer->entry, head);
}
goto again;
}
fn = timer->function; data = timer->data; irqsafe = tbase_get_irqsafe(timer->base);
Yeah, so I have written something similar only. Wasn't sure about what you wrote earlier. Thanks for the clarification.
On Mon, Mar 30, 2015 at 06:44:22PM +0530, Viresh Kumar wrote:
On 30 March 2015 at 18:17, Peter Zijlstra peterz@infradead.org wrote:
No, I means something else with that. We can remove the tvec_base::running_timer field. Everything that uses that can use tbase_running() AFAICT.
Okay, there is one instance which still needs it.
migrate_timers():
BUG_ON(old_base->running_timer);
What I wasn't sure about it is if we get can drop this statement or not. If we decide not to drop it, then we can convert running_timer into a bool.
Yeah, so that _should_ not trigger (obviously), and while I agree with the sentiment of sanity checks, I'm not sure its worth keeping that variable around just for that.
Anyway, while I'm looking at struct tvec_base I notice the cpu member should be second after the lock, that'll save 8 bytes on the structure on 64bit machines.
On 30 March 2015 at 19:29, Peter Zijlstra peterz@infradead.org wrote:
Yeah, so that _should_ not trigger (obviously), and while I agree with the sentiment of sanity checks, I'm not sure its worth keeping that variable around just for that.
I read it as I can remove it then ? :)
Anyway, while I'm looking at struct tvec_base I notice the cpu member should be second after the lock, that'll save 8 bytes on the structure on 64bit machines.
Hmm, I tried it on my macbook-pro.
$ uname -a Linux vireshk 3.13.0-46-generic #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
config: arch/x86/configs/x86_64_defconfig
And all I get it is 8256 bytes, with or without the change.
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..afc5d74678df 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -77,12 +77,12 @@ struct tvec_root {
struct tvec_base { spinlock_t lock; + int cpu; struct timer_list *running_timer; unsigned long timer_jiffies; unsigned long next_timer; unsigned long active_timers; unsigned long all_timers; - int cpu; struct tvec_root tv1; struct tvec tv2; struct tvec tv3;
On Mon, Mar 30, 2015 at 09:47:01PM +0530, Viresh Kumar wrote:
And all I get it is 8256 bytes, with or without the change.
Duh, rounded up to cacheline boundary ;-)
Trades two 4 byte holes at the start for a bigger 'hole' at the end.
struct tvec_base { spinlock_t lock; /* 0 2 */
/* XXX 6 bytes hole, try to pack */
struct timer_list * running_timer; /* 8 8 */ long unsigned int timer_jiffies; /* 16 8 */ long unsigned int next_timer; /* 24 8 */ long unsigned int active_timers; /* 32 8 */ long unsigned int all_timers; /* 40 8 */ int cpu; /* 48 4 */
/* XXX 4 bytes hole, try to pack */
struct tvec_root tv1; /* 56 4096 */ /* --- cacheline 64 boundary (4096 bytes) was 56 bytes ago --- */ struct tvec tv2; /* 4152 1024 */ /* --- cacheline 80 boundary (5120 bytes) was 56 bytes ago --- */ struct tvec tv3; /* 5176 1024 */ /* --- cacheline 96 boundary (6144 bytes) was 56 bytes ago --- */ struct tvec tv4; /* 6200 1024 */ /* --- cacheline 112 boundary (7168 bytes) was 56 bytes ago --- */ struct tvec tv5; /* 7224 1024 */ /* --- cacheline 128 boundary (8192 bytes) was 56 bytes ago --- */
/* size: 8256, cachelines: 129, members: 12 */ /* sum members: 8238, holes: 2, sum holes: 10 */ /* padding: 8 */ };
vs
struct tvec_base { spinlock_t lock; /* 0 2 */
/* XXX 2 bytes hole, try to pack */
int cpu; /* 4 4 */ struct timer_list * running_timer; /* 8 8 */ long unsigned int timer_jiffies; /* 16 8 */ long unsigned int next_timer; /* 24 8 */ long unsigned int active_timers; /* 32 8 */ long unsigned int all_timers; /* 40 8 */ struct tvec_root tv1; /* 48 4096 */ /* --- cacheline 64 boundary (4096 bytes) was 48 bytes ago --- */ struct tvec tv2; /* 4144 1024 */ /* --- cacheline 80 boundary (5120 bytes) was 48 bytes ago --- */ struct tvec tv3; /* 5168 1024 */ /* --- cacheline 96 boundary (6144 bytes) was 48 bytes ago --- */ struct tvec tv4; /* 6192 1024 */ /* --- cacheline 112 boundary (7168 bytes) was 48 bytes ago --- */ struct tvec tv5; /* 7216 1024 */ /* --- cacheline 128 boundary (8192 bytes) was 48 bytes ago --- */
/* size: 8256, cachelines: 129, members: 12 */ /* sum members: 8238, holes: 1, sum holes: 2 */ /* padding: 16 */ };
On 28 March 2015 at 19:14, Peter Zijlstra peterz@infradead.org wrote:
Yeah, something like the below (at the very end) should ensure the thing is cacheline aligned, that should give us a fair few bits.
kernel/time/timer.c | 36 ++++++++---------------------------- 1 file changed, 8 insertions(+), 28 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 2d3f5c504939..c8c45bf50b2e 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -93,6 +93,7 @@ struct tvec_base { struct tvec_base boot_tvec_bases; EXPORT_SYMBOL(boot_tvec_bases); static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases; +static DEFINE_PER_CPU(struct tvec_base, __tvec_bases);
/* Functions below help us manage 'deferrable' flag */ static inline unsigned int tbase_get_deferrable(struct tvec_base *base) @@ -1534,46 +1535,25 @@ EXPORT_SYMBOL(schedule_timeout_uninterruptible);
static int init_timers_cpu(int cpu) {
int j;
struct tvec_base *base;
struct tvec_base *base = per_cpu(tvec_bases, cpu); static char tvec_base_done[NR_CPUS];
int j; if (!tvec_base_done[cpu]) { static char boot_done;
if (boot_done) {
/*
* The APs use this path later in boot
*/
base = kzalloc_node(sizeof(*base), GFP_KERNEL,
cpu_to_node(cpu));
if (!base)
return -ENOMEM;
/* Make sure tvec_base has TIMER_FLAG_MASK bits free */
if (WARN_ON(base != tbase_get_base(base))) {
kfree(base);
return -ENOMEM;
}
per_cpu(tvec_bases, cpu) = base;
if (!boot_done) {
boot_done = 1; /* skip the boot cpu */ } else {
/*
* This is for the boot CPU - we use compile-time
* static initialisation because per-cpu memory isn't
* ready yet and because the memory allocators are not
* initialised either.
*/
boot_done = 1;
base = &boot_tvec_bases;
base = per_cpu_ptr(&__tvec_bases);
per_cpu(tvec_bases, cpu) = base; }
spin_lock_init(&base->lock); tvec_base_done[cpu] = 1; base->cpu = cpu;
} else {
base = per_cpu(tvec_bases, cpu); }
for (j = 0; j < TVN_SIZE; j++) { INIT_LIST_HEAD(base->tv5.vec + j); INIT_LIST_HEAD(base->tv4.vec + j);
Even after this with following diff gives me the same warning on blackfin..
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197e1587..58bc28d9cef2 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -68,7 +68,7 @@ extern struct tvec_base boot_tvec_bases; #define TIMER_DEFERRABLE 0x1LU #define TIMER_IRQSAFE 0x2LU
-#define TIMER_FLAG_MASK 0x3LU +#define TIMER_FLAG_MASK 0x7LU
#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \ .entry = { .prev = TIMER_ENTRY_STATIC }, \
---------x--------------------x----------------------
Warning:
config: blackfin-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/ma... -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout ca713e393c6eceb54e803df204772a3d6e6c7981 # save the attached .config to linux build tree make.cross ARCH=blackfin
All error/warnings:
kernel/time/timer.c: In function 'init_timers':
kernel/time/timer.c:1648:2: error: call to '__compiletime_assert_1648' declared with attribute error: BUILD_BUG_ON failed: __alignof__(struct tvec_base) & TIMER_FLAG_MASK
On Sun, Mar 29, 2015 at 05:31:32PM +0530, Viresh Kumar wrote:
Warning:
config: blackfin-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/ma... -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout ca713e393c6eceb54e803df204772a3d6e6c7981 # save the attached .config to linux build tree make.cross ARCH=blackfin
All error/warnings:
kernel/time/timer.c: In function 'init_timers':
kernel/time/timer.c:1648:2: error: call to '__compiletime_assert_1648' declared with attribute error: BUILD_BUG_ON failed: __alignof__(struct tvec_base) & TIMER_FLAG_MASK
Ha, this is because blackfin is broken.
blackfin doesn't respect ____cacheline_aligned and NOPs it for UP builds. Why it thinks {__,}__cacheline_aligned semantics should differ between SMP/UP is a mystery to me, we have the &_in_smp primitives for that.
So just ignore this, let the blackfin people deal with it.
On Sat 28-03-15 10:53:22, Peter Zijlstra wrote: [...]
Alternatively the thing hocko suggests is an utter fail too. You cannot stuff that into hardirq context, that's insane.
I guess you are referring to http://article.gmane.org/gmane.linux.kernel.mm/127569, right?
Why cannot we do something like refresh_cpu_vm_stats from the IRQ context? Especially the first zone stat part. The per-cpu pagesets is more costly and it would need a special treatment, alright. A simple way would be to splice the lists from the per-cpu context and then free those pages from the kthread context.
I am still wondering why those two things were squashed into a single place. Why kswapd is not doing the pcp cleanup?
On Mon, Mar 30, 2015 at 05:08:18PM +0200, Michal Hocko wrote:
On Sat 28-03-15 10:53:22, Peter Zijlstra wrote: [...]
Alternatively the thing hocko suggests is an utter fail too. You cannot stuff that into hardirq context, that's insane.
I guess you are referring to http://article.gmane.org/gmane.linux.kernel.mm/127569, right?
Why cannot we do something like refresh_cpu_vm_stats from the IRQ context? Especially the first zone stat part.
Big machines have big zone counts. There are machines with >200 nodes. Although with the current trend of bigger nodes, the number of nodes seems to come down as well. Still.
The per-cpu pagesets is more costly and it would need a special treatment, alright. A simple way would be to splice the lists from the per-cpu context and then free those pages from the kthread context.
I am still wondering why those two things were squashed into a single place. Why kswapd is not doing the pcp cleanup?
Probably because they could be. The problem with kswapd is that its per node, not per cpu.
On Mon, 30 Mar 2015, Michal Hocko wrote:
Why cannot we do something like refresh_cpu_vm_stats from the IRQ context? Especially the first zone stat part. The per-cpu pagesets is more costly and it would need a special treatment, alright. A simple way would be to splice the lists from the per-cpu context and then free those pages from the kthread context.
That would work.
I am still wondering why those two things were squashed into a single place. Why kswapd is not doing the pcp cleanup?
They were squashed together by me for conveniences sake. They could be separated. AFAICT the pcp cleanup could be done only on demand and we already have logic for that when flushihng via IPI.
On Thu 26-03-15 11:09:01, Viresh Kumar wrote:
A delayed work to schedule vmstat_shepherd() is queued at periodic intervals for internal working of vmstat core. This work and its timer end up waking an idle cpu sometimes, as this always stays on CPU0.
Because we re-queue the work from its handler, idle_cpu() returns false and so the timer (used by delayed work) never migrates to any other CPU.
This may not be the desired behavior always as waking up an idle CPU to queue work on few other CPUs isn't good from power-consumption point of view.
Wouldn't something like I was suggesting few months back (http://article.gmane.org/gmane.linux.kernel.mm/127569) solve this problem as well? Scheduler should be idle aware, no? I mean it shouldn't wake up an idle CPU if the task might run on another one.
On 27 March 2015 at 19:49, Michal Hocko mhocko@suse.cz wrote:
Wouldn't something like I was suggesting few months back (http://article.gmane.org/gmane.linux.kernel.mm/127569) solve this problem as well? Scheduler should be idle aware, no? I mean it shouldn't wake up an idle CPU if the task might run on another one.
Probably yes. Lets see what others have to say about it..
linaro-kernel@lists.linaro.org