Hi All,
This is V2 Resend of my sched_select_cpu() work. Resend because didn't got much attention on V2. Including more guys now in cc :)
In order to save power, it would be useful to schedule work onto non-IDLE cpus instead of waking up an IDLE one.
To achieve this, we need scheduler to guide kernel frameworks (like: timers & workqueues) on which is the most preferred CPU that must be used for these tasks.
This patchset is about implementing this concept.
- The first patch adds sched_select_cpu() routine which returns the preferred cpu which is non-idle. - Second patch removes idle_cpu() calls from timer & hrtimer. - Third patch is about adapting this change in workqueue framework. - Fourth patch add migration capability in running timer
Earlier discussions over v1 can be found here: http://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html
Earlier discussions over this concept were done at last LPC: http://summit.linuxplumbersconf.org/lpc-2012/meeting/90/lpc2012-sched-timer-...
Module created for testing this behavior is present here: http://git.linaro.org/gitweb?p=people/vireshk/module.git%3Ba=summary
Following are the steps followed in test module: 1. Run single work on each cpu 2. This work will start a timer after x (tested with 10) jiffies of delay 3. Timer routine queues a work... (This may be called from idle or non-idle cpu) and starts the same timer again STEP 3 is done for n number of times (i.e. queuing n works, one after other) 4. All works will call a single routine, which will count following per cpu: - Total works processed by a CPU - Total works processed by a CPU, which are queued from it - Total works processed by a CPU, which aren't queued from it
Setup: ----- - ARM Vexpress TC2 - big.LITTLE CPU - Core 0-1: A15, 2-4: A7 - rootfs: linaro-ubuntu-nano
Results: ------- Without Workqueue Modification, i.e. PATCH 3/3: [ 2493.022335] Workqueue Analyser: works processsed by CPU0, Total: 1000, Own: 0, migrated: 0 [ 2493.047789] Workqueue Analyser: works processsed by CPU1, Total: 1000, Own: 0, migrated: 0 [ 2493.072918] Workqueue Analyser: works processsed by CPU2, Total: 1000, Own: 0, migrated: 0 [ 2493.098576] Workqueue Analyser: works processsed by CPU3, Total: 1000, Own: 0, migrated: 0 [ 2493.123702] Workqueue Analyser: works processsed by CPU4, Total: 1000, Own: 0, migrated: 0
With Workqueue Modification, i.e. PATCH 3/3: [ 2493.022335] Workqueue Analyser: works processsed by CPU0, Total: 1002, Own: 999, migrated: 3 [ 2493.047789] Workqueue Analyser: works processsed by CPU1, Total: 998, Own: 997, migrated: 1 [ 2493.072918] Workqueue Analyser: works processsed by CPU2, Total: 1013, Own: 996, migrated: 17 [ 2493.098576] Workqueue Analyser: works processsed by CPU3, Total: 998, Own: 993, migrated: 5 [ 2493.123702] Workqueue Analyser: works processsed by CPU4, Total: 989, Own: 987, migrated: 2
V2->V2-Resend ------------- - Included timer migration patch in the same thread.
V1->V2 ----- - New SD_* macros removed now and earlier ones used - sched_select_cpu() rewritten and it includes the check on current cpu's idleness. - cpu_idle() calls from timer and hrtimer removed now. - Patch 2/3 from V1, removed as it doesn't apply to latest workqueue branch from tejun. - CONFIG_MIGRATE_WQ removed and so is wq_select_cpu() - sched_select_cpu() called only from __queue_work() - got tejun/for-3.7 branch in my tree, before making workqueue changes.
Viresh Kumar (4): sched: Create sched_select_cpu() to give preferred CPU for power saving timer: hrtimer: Don't check idle_cpu() before calling get_nohz_timer_target() workqueue: Schedule work on non-idle cpu instead of current one timer: Migrate running timer
include/linux/sched.h | 16 ++++++++++-- include/linux/timer.h | 2 ++ kernel/hrtimer.c | 2 +- kernel/sched/core.c | 69 +++++++++++++++++++++++++++++++-------------------- kernel/timer.c | 50 ++++++++++++++++++++++--------------- kernel/workqueue.c | 4 +-- 6 files changed, 91 insertions(+), 52 deletions(-)
In order to save power, it would be useful to schedule work onto non-IDLE cpus instead of waking up an IDLE one.
To achieve this, we need scheduler to guide kernel frameworks (like: timers & workqueues) on which is the most preferred CPU that must be used for these tasks.
This routine returns the preferred cpu which is non-idle. It accepts a bitwise OR of SD_* flags present in linux/sched.h. If the local CPU isn't idle, it is returned back. If it is idle, then we must look for another CPU which have all the flags passed as argument as set. Also, as this activity is part of load balancing only, SD_LOAD_BALANCE must also be set for selected domain.
This patch reuses the code from get_nohz_timer_target() routine, which had similar implementation. get_nohz_timer_target() is also modified to use sched_select_cpu() now.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/sched.h | 16 ++++++++++-- kernel/sched/core.c | 69 +++++++++++++++++++++++++++++++-------------------- 2 files changed, 56 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 0dd42a0..24f546d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -232,14 +232,26 @@ extern void init_idle_bootup_task(struct task_struct *idle);
extern int runqueue_is_locked(int cpu);
-#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) +#ifdef CONFIG_SMP +extern int sched_select_cpu(unsigned int sd_flags); + +#ifdef CONFIG_NO_HZ extern void nohz_balance_enter_idle(int cpu); extern void set_cpu_sd_state_idle(void); -extern int get_nohz_timer_target(void); +/* + * In the semi idle case, use the nearest busy cpu for migrating timers + * from an idle cpu. This is good for power-savings. + * + * We don't do similar optimization for completely idle system, as + * selecting an idle cpu will add more delays to the timers than intended + * (as that cpu's timer base may not be uptodate wrt jiffies etc). + */ +#define get_nohz_timer_target() sched_select_cpu(0) #else static inline void nohz_balance_enter_idle(int cpu) { } static inline void set_cpu_sd_state_idle(void) { } #endif +#endif /* CONFIG_SMP */
/* * Only dump TASK_* tasks. (0 for all tasks) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..cf1a420 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -542,33 +542,6 @@ void resched_cpu(int cpu)
#ifdef CONFIG_NO_HZ /* - * In the semi idle case, use the nearest busy cpu for migrating timers - * from an idle cpu. This is good for power-savings. - * - * We don't do similar optimization for completely idle system, as - * selecting an idle cpu will add more delays to the timers than intended - * (as that cpu's timer base may not be uptodate wrt jiffies etc). - */ -int get_nohz_timer_target(void) -{ - int cpu = smp_processor_id(); - int i; - struct sched_domain *sd; - - rcu_read_lock(); - for_each_domain(cpu, sd) { - for_each_cpu(i, sched_domain_span(sd)) { - if (!idle_cpu(i)) { - cpu = i; - goto unlock; - } - } - } -unlock: - rcu_read_unlock(); - return cpu; -} -/* * When add_timer_on() enqueues a timer into the timer wheel of an * idle CPU then this timer might expire before the next timer event * which is scheduled to wake up that CPU. In case of a completely @@ -639,6 +612,48 @@ void sched_avg_update(struct rq *rq) } }
+/* + * This routine returns the preferred cpu which is non-idle. It accepts a + * bitwise OR of SD_* flags present in linux/sched.h. If the local CPU isn't + * idle, it is returned back. If it is idle, then we must look for another CPU + * which have all the flags passed as argument as set. Also, as this activity is + * part of load balancing only, SD_LOAD_BALANCE must also be set for selected + * domain. + */ +int sched_select_cpu(unsigned int sd_flags) +{ + struct sched_domain *sd; + int cpu = smp_processor_id(); + int i; + + /* If Current cpu isn't idle, don't migrate anything */ + if (!idle_cpu(cpu)) + return cpu; + + /* Add SD_LOAD_BALANCE to flags */ + sd_flags |= SD_LOAD_BALANCE; + + rcu_read_lock(); + for_each_domain(cpu, sd) { + /* + * If sd doesnt' have both sd_flags and SD_LOAD_BALANCE set, + * skip sd. + */ + if ((sd->flags & sd_flags) != sd_flags) + continue; + + for_each_cpu(i, sched_domain_span(sd)) { + if (!idle_cpu(i)) { + cpu = i; + goto unlock; + } + } + } +unlock: + rcu_read_unlock(); + return cpu; +} + #else /* !CONFIG_SMP */ void resched_task(struct task_struct *p) {
Check for current cpu's idleness already done in implementation of sched_select_cpu() which is called by get_nohz_timer_target(). So, no need to call idle_cpu() twice, once from sched_select_cpu() and once from timer and hrtimer before calling get_nohz_timer_target().
This patch removes calls to idle_cpu() from timer and hrtimer.
This also reduces an extra call to smp_processor_id() when get_nohz_timer_target() is called.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/hrtimer.c | 2 +- kernel/timer.c | 8 +++++--- 2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..74bdaf6 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -159,7 +159,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, static int hrtimer_get_target(int this_cpu, int pinned) { #ifdef CONFIG_NO_HZ - if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) + if (!pinned && get_sysctl_timer_migration()) return get_nohz_timer_target(); #endif return this_cpu; diff --git a/kernel/timer.c b/kernel/timer.c index 367d008..1170ece 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -735,11 +735,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
debug_activate(timer, expires);
- cpu = smp_processor_id(); - #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP) - if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) + if (!pinned && get_sysctl_timer_migration()) cpu = get_nohz_timer_target(); + else + cpu = smp_processor_id(); +#else + cpu = smp_processor_id(); #endif new_base = per_cpu(tvec_bases, cpu);
Workqueues queues work on current cpu, if the caller haven't passed a preferred cpu. This may wake up an idle CPU, which is actually not required.
This work can be processed by any CPU and so we must select a non-idle CPU here. This patch adds in support in workqueue framework to get preferred CPU details from the scheduler, instead of using current CPU.
Most of the time when a work is queued, the current cpu isn't idle and so we will choose it only. There are cases when a cpu is idle when it queues some work. For example, consider following scenario: - A cpu has programmed a timer and is IDLE now. - CPU gets into interrupt handler due to timer and queues a work. As the CPU is currently IDLE, we can queue this work to some other CPU.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/workqueue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 042d221..d32efa2 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1238,7 +1238,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, struct global_cwq *last_gcwq;
if (cpu == WORK_CPU_UNBOUND) - cpu = raw_smp_processor_id(); + cpu = sched_select_cpu(0);
/* * It's multi cpu. If @work was previously on a different @@ -1383,7 +1383,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, if (gcwq) lcpu = gcwq->cpu; if (lcpu == WORK_CPU_UNBOUND) - lcpu = raw_smp_processor_id(); + lcpu = sched_select_cpu(0); } else { lcpu = WORK_CPU_UNBOUND; }
Hello, Viresh.
On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:
Workqueues queues work on current cpu, if the caller haven't passed a preferred cpu. This may wake up an idle CPU, which is actually not required.
This work can be processed by any CPU and so we must select a non-idle CPU here. This patch adds in support in workqueue framework to get preferred CPU details from the scheduler, instead of using current CPU.
Most of the time when a work is queued, the current cpu isn't idle and so we will choose it only. There are cases when a cpu is idle when it queues some work. For example, consider following scenario:
- A cpu has programmed a timer and is IDLE now.
- CPU gets into interrupt handler due to timer and queues a work. As the CPU is currently IDLE, we can queue this work to some other CPU.
I'm pretty skeptical about this. queue_work() w/o explicit CPU assignment has always guaranteed that the work item will be executed on the same CPU. I don't think there are too many users depending on that but am not sure at all that there are none. I asked you last time that you would at the very least need to audit most users but it doesn't seem like there has been any effort there. Given the seemingly low rate of migration and subtlety of such bugs, things could get nasty to debug.
That said, if the obtained benefit is big enough, sure, we can definitely change the behavior, which isn't all that great to begin with, and try to shake out the bugs quicky by e.g. forcing all work items to execute on different CPUs, but you're presenting a few percent of work items being migrated to a different CPU from an already active CPU, which doesn't seem like such a big benefit to me even if the migration target CPU is somewhat more efficient. How much powersaving are we talking about?
Thanks.
Hi Tejun,
On 26 November 2012 22:45, Tejun Heo tj@kernel.org wrote:
On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:
I'm pretty skeptical about this. queue_work() w/o explicit CPU assignment has always guaranteed that the work item will be executed on the same CPU. I don't think there are too many users depending on that but am not sure at all that there are none. I asked you last time that you would at the very least need to audit most users but it doesn't seem like there has been any effort there.
My bad. I completely missed/forgot that comment from your earlier mails. Will do it.
That said, if the obtained benefit is big enough, sure, we can definitely change the behavior, which isn't all that great to begin with, and try to shake out the bugs quicky by e.g. forcing all work items to execute on different CPUs, but you're presenting a few percent of work items being migrated to a different CPU from an already active CPU, which doesn't seem like such a big benefit to me even if the migration target CPU is somewhat more efficient. How much powersaving are we talking about?
Hmm.. I actually implemented the problem discussed here: (I know you have seen this earlier :) )
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sch...
Specifically slides: 12 & 19.
I haven't done much power calculations with it and have tested it more from functionality point of view.
@Vincent: Can you add some comments here?
-- viresh
On 27 November 2012 06:19, Viresh Kumar viresh.kumar@linaro.org wrote:
Hi Tejun,
On 26 November 2012 22:45, Tejun Heo tj@kernel.org wrote:
On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:
I'm pretty skeptical about this. queue_work() w/o explicit CPU assignment has always guaranteed that the work item will be executed on the same CPU. I don't think there are too many users depending on that but am not sure at all that there are none. I asked you last time that you would at the very least need to audit most users but it doesn't seem like there has been any effort there.
My bad. I completely missed/forgot that comment from your earlier mails. Will do it.
That said, if the obtained benefit is big enough, sure, we can definitely change the behavior, which isn't all that great to begin with, and try to shake out the bugs quicky by e.g. forcing all work items to execute on different CPUs, but you're presenting a few percent of work items being migrated to a different CPU from an already active CPU, which doesn't seem like such a big benefit to me even if the migration target CPU is somewhat more efficient. How much powersaving are we talking about?
Hmm.. I actually implemented the problem discussed here: (I know you have seen this earlier :) )
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sch...
Specifically slides: 12 & 19.
I haven't done much power calculations with it and have tested it more from functionality point of view.
@Vincent: Can you add some comments here?
Sorry for this late reply.
We have faced some situations on TC2 (as an example) where the tasks are running in the LITTLE cluster whereas some periodic works stay on the big cluster so we can have one cluster that wakes up for tasks and another one that wakes up for work. We would like to consolidate the behaviour of the work with the tasks behaviour.
Sorry, I don't have relevant figures as the patches are used with other ones which also impact the power consumption.
This series introduces the possibility to run a work on another CPU which is necessary if we want a better correlation of task and work scheduling on the system. Most of the time the queue_work is used when a driver don't mind the CPU on which you want to run whereas it looks like it should be used only if you want to run locally. We would like to solve this point with the new interface that is proposed by viresh
Vincent
-- viresh
Hi Tejun,
On 27 November 2012 10:49, Viresh Kumar viresh.kumar@linaro.org wrote:
On 26 November 2012 22:45, Tejun Heo tj@kernel.org wrote:
On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:
I'm pretty skeptical about this. queue_work() w/o explicit CPU assignment has always guaranteed that the work item will be executed on the same CPU. I don't think there are too many users depending on that but am not sure at all that there are none. I asked you last time that you would at the very least need to audit most users but it doesn't seem like there has been any effort there.
My bad. I completely missed/forgot that comment from your earlier mails. Will do it.
And this is the first thing i did this year :)
So there are almost 1200 files which are using: queue_work, queue_delayed_work, schedule_work, schedule_delayed_work or schedule_on_each_cpu
Obviously i can't try to understand all of them :) , and so i tried to search two strings in them: "cpu" or "processor_id". So, this would catch every file of these 1200 odd files which use some variables/comment/code with cpu or smp_processor_id or raw_processor_id, etc..
I got around 650 files with these two searches.. Then i went into these files to see if there is anything noticeable, which may break due to this patch...
I got a list of files where cpu/processor_id strings are found, which may break with this patch (still can't guarantee as i don't have knowledge of these drivers)...
- arch/powerpc/mm/numa.c - arch/s390/appldata/appldata_base.c - arch/s390/kernel/time.c - arch/s390/kernel/topology.c - arch/s390/oprofile/hwsampler.c - arch/x86/kernel/cpu/mcheck/mce.c - arch/x86/kernel/tsc.c - arch/x86/kvm/x86.c - block/blk-core.c - block/blk-throttle.c - block/blk-genhd.c - crypto/cryptd.c - drivers/base/power/domain.c - drivers/cpufreq/cpufreq.c - drivers/hv/vmbus_drv.c - drivers/infiniband/hw/qib/qib_iba7322.c and drivers/infiniband/hw/qib/qib_init.c - drivers/md/dm-crypt.c - drivers/md/md.c - drivers/net/ethernet/sfc/efx.c - drivers/net/ethernet/tile/tilepro.c - drivers/net/team/team_mode_loadbalance.c - drivers/oprofile/cpu_buffer.c - drivers/s390/kvm/kvm_virtio.c - drivers/scsi/fcoe/fcoe.c - drivers/tty/sysrq.c - drivers/xen/pcpu.c - fs/file.c, file_table.c, fs/fscache/object.c - include/linux/stop_machine.h, kernel/stop_machine.c - mm/percpu.c - mm/slab.c - mm/vmstat.c - net/core/flow.c - net/iucv/iucv.c
I am not sure what to do now :) , can you assist ?
Hello, Viresh.
On Fri, Jan 04, 2013 at 04:41:47PM +0530, Viresh Kumar wrote:
I got a list of files where cpu/processor_id strings are found, which may break with this patch (still can't guarantee as i don't have knowledge of these drivers)...
...
I am not sure what to do now :) , can you assist ?
I don't know either. Changing behavior subtly like this is hard. I usually try to spot some problem cases and try to identify patterns there. Once you identify a few of them, understanding and detecting other problem cases get a lot easier. In this case, maybe there are too many places to audit and the problems are too subtle, and, if we *have* to do it, the only thing we can do is implementing a debug option to make such problems more visible - say, always schedule to a different cpu on queue_work().
That said, at this point, the patchset doesn't seem all that convincing to me and if I'm comprehending responses from others correctly that seems to be the consensus. It might be a better approach to identify the specific offending workqueue users and make them handle the situation more intelligently than trying to impose the behavior on all workqueue users. At any rate, we need way more data showing this actually helps and if so why.
Thanks.
Hi Tejun,
On 4 January 2013 20:39, Tejun Heo tj@kernel.org wrote:
I don't know either. Changing behavior subtly like this is hard. I usually try to spot some problem cases and try to identify patterns there. Once you identify a few of them, understanding and detecting other problem cases get a lot easier. In this case, maybe there are too many places to audit and the problems are too subtle, and, if we *have* to do it, the only thing we can do is implementing a debug option to make such problems more visible - say, always schedule to a different cpu on queue_work().
That said, at this point, the patchset doesn't seem all that convincing to me and if I'm comprehending responses from others correctly that seems to be the consensus. It might be a better approach to identify the specific offending workqueue users and make them handle the situation more intelligently than trying to impose the behavior on all workqueue users. At any rate, we need way more data showing this actually helps and if so why.
I understand your concerns and believe me, even i feel the same :) I had another idea, that i wanted to share.
Firstly the root cause of this patchset.
Myself and some others in Linaro are working on ARM future cores: big.LITTLE systems. Here we have few very powerful, high power consuming cores (big, currently A15's) and few very efficient ones (LITTLE, currently A7's).
The ultimate goal is to save as much power as possible without compromising much with performance. For, that we wanted most of the stuff to run on LITTLE cores and some performance-demanding stuff on big Cores. There are multiple things going around in this direction. Now, we thought A15's or big cores shouldn't be used for running small tasks like timers/workqueues and hence this patch is an attempt towards reaching that goal.
Over that we can do some load balancing of works within multiple alive cpus, so that it can get done quickly. Also, we shouldn't start using an idle cpu just for processing work :)
I have another idea that we can try:
queue_work_on_any_cpu().
With this we would not break any existing code and can try to migrate old users to this new infrastructure (atleast the ones which are rearming works from their work_handlers). What do you say?
To take care of the cache locality issue, we can pass an argument to this routine, that can provide - the mask of cpus to schedule this work on OR - Sched Level (SD_LEVEL) of cpus to run it.
Waiting for your view on it :)
-- viresh
On Mon, 2013-01-07 at 15:28 +0530, Viresh Kumar wrote:
Hi Tejun,
On 4 January 2013 20:39, Tejun Heo tj@kernel.org wrote:
I don't know either. Changing behavior subtly like this is hard. I usually try to spot some problem cases and try to identify patterns there. Once you identify a few of them, understanding and detecting other problem cases get a lot easier. In this case, maybe there are too many places to audit and the problems are too subtle, and, if we *have* to do it, the only thing we can do is implementing a debug option to make such problems more visible - say, always schedule to a different cpu on queue_work().
That said, at this point, the patchset doesn't seem all that convincing to me and if I'm comprehending responses from others correctly that seems to be the consensus. It might be a better approach to identify the specific offending workqueue users and make them handle the situation more intelligently than trying to impose the behavior on all workqueue users. At any rate, we need way more data showing this actually helps and if so why.
I understand your concerns and believe me, even i feel the same :) I had another idea, that i wanted to share.
Firstly the root cause of this patchset.
Myself and some others in Linaro are working on ARM future cores: big.LITTLE systems. Here we have few very powerful, high power consuming cores (big, currently A15's) and few very efficient ones (LITTLE, currently A7's).
The ultimate goal is to save as much power as possible without compromising much with performance. For, that we wanted most of the stuff to run on LITTLE cores and some performance-demanding stuff on big Cores. There are multiple things going around in this direction. Now, we thought A15's or big cores shouldn't be used for running small tasks like timers/workqueues and hence this patch is an attempt towards reaching that goal.
Over that we can do some load balancing of works within multiple alive cpus, so that it can get done quickly. Also, we shouldn't start using an idle cpu just for processing work :)
I have another idea that we can try:
queue_work_on_any_cpu().
I think this is a good idea.
With this we would not break any existing code and can try to migrate old users to this new infrastructure (atleast the ones which are rearming works from their work_handlers). What do you say?
To take care of the cache locality issue, we can pass an argument to this routine, that can provide
- the mask of cpus to schedule this work on OR
- Sched Level (SD_LEVEL) of cpus to run it.
I wouldn't give a mask. If one is needed, we could have a queue_work_on_mask_cpus(), or something. I think the "any" in the name should be good enough to let developers know that this will not be on the CPU that is called. By default, I would suggest for cache locality, that we try to keep it on the same CPU. But if there's a better CPU to run on, it runs there. Also, we could still add a debug option that makes it always run on other CPUs to slap developers that don't read.
-- Steve
Waiting for your view on it :)
-- viresh
On 7 January 2013 18:58, Steven Rostedt rostedt@goodmis.org wrote:
On Mon, 2013-01-07 at 15:28 +0530, Viresh Kumar wrote:
I have another idea that we can try:
queue_work_on_any_cpu().
I think this is a good idea.
:) :)
- the mask of cpus to schedule this work on OR
- Sched Level (SD_LEVEL) of cpus to run it.
I wouldn't give a mask. If one is needed, we could have a queue_work_on_mask_cpus(), or something. I think the "any" in the name should be good enough to let developers know that this will not be on the CPU that is called.
Fair Enough.
By default, I would suggest for cache locality, that we try to keep it on the same CPU. But if there's a better CPU to run on, it runs there.
That would break our intention behind this routine. We should run it on a cpu which we think is the best one for it power/performance wise.
So, i would like to call the sched_select_cpu() routine from this routine to get the suggested cpu.
This routine would however would see more changes later to optimize it more.
Also, we could still add a debug option that makes it always run on other CPUs to slap developers that don't read.
I don't think we need it :( It would be a new API, with zero existing users. And so, whomsoever uses it, should know what exactly he is doing :)
Thanks for your quick feedback.
On Mon, 2013-01-07 at 23:29 +0530, Viresh Kumar wrote:
By default, I would suggest for cache locality, that we try to keep it on the same CPU. But if there's a better CPU to run on, it runs there.
That would break our intention behind this routine. We should run it on a cpu which we think is the best one for it power/performance wise.
If you are running on a big.Little box sure. But for normal (read x86 ;) systems, it should probably stay on the current CPU.
So, i would like to call the sched_select_cpu() routine from this routine to get the suggested cpu.
Does the caller need to know? Or if you have a big.LITTLE setup, it should just work automagically?
This routine would however would see more changes later to optimize it more.
Also, we could still add a debug option that makes it always run on other CPUs to slap developers that don't read.
I don't think we need it :( It would be a new API, with zero existing users. And so, whomsoever uses it, should know what exactly he is doing :)
Heh, you don't know kernel developers well do you? ;-)
Once a new API is added to the kernel tree, it quickly becomes (mis)used.
-- Steve
Hi Steven,
On 8 January 2013 03:59, Steven Rostedt rostedt@goodmis.org wrote:
On Mon, 2013-01-07 at 23:29 +0530, Viresh Kumar wrote:
By default, I would suggest for cache locality, that we try to keep it on the same CPU. But if there's a better CPU to run on, it runs there.
That would break our intention behind this routine. We should run it on a cpu which we think is the best one for it power/performance wise.
If you are running on a big.Little box sure. But for normal (read x86 ;) systems, it should probably stay on the current CPU.
But that's not the purpose of this new call. If the caller want it to be on local cpu, he must not use this call. It is upto the core routine (sched_select_cpu() in our case) to decide where to launch it. If we need something special for x86, we can hack this routine.
Even for non bigLITTLE systems, we may want to run a work on non-idle cpu and so we can't guarantee local cpu here.
So, i would like to call the sched_select_cpu() routine from this routine to get the suggested cpu.
Does the caller need to know? Or if you have a big.LITTLE setup, it should just work automagically?
Caller wouldn't know, he just needs to trust on sched_select_cpu() to give the most optimum cpu.
Again, it is not only for big LITTLE, but for any SMP system, where we don't want an idle cpu to do this work.
I don't think we need it :( It would be a new API, with zero existing users. And so, whomsoever uses it, should know what exactly he is doing :)
Heh, you don't know kernel developers well do you? ;-)
I agree with you, but we don't need to care for foolish new code here.
Once a new API is added to the kernel tree, it quickly becomes (mis)used.
Its true with all pieces of code in kernel and we really don't need to take care of such users here :)
Hello, Viresh.
On Mon, Jan 07, 2013 at 03:28:33PM +0530, Viresh Kumar wrote:
Firstly the root cause of this patchset.
Myself and some others in Linaro are working on ARM future cores: big.LITTLE systems. Here we have few very powerful, high power consuming cores (big, currently A15's) and few very efficient ones (LITTLE, currently A7's).
The ultimate goal is to save as much power as possible without compromising much with performance. For, that we wanted most of the stuff to run on LITTLE cores and some performance-demanding stuff on big Cores. There are multiple things going around in this direction. Now, we thought A15's or big cores shouldn't be used for running small tasks like timers/workqueues and hence this patch is an attempt towards reaching that goal.
I see. My understanding of big.little is very superficial so there are very good chances that I'm utterly wrong, but to me it seems like more of a stop-gap solution than something which will have staying power in the sense that the complexity doesn't seem to have any inherent reason other than the failure to make the big ones power efficient enough while idle or under minor load.
Maybe this isn't specific to big.little and heterogeneous cores will be the way of future with big.little being just a forefront of the trend. And even if that's not the case, it definitely doesn't mean that we can't accomodate big.little, but, at this point, it does make me a bit more reluctant to make wholesale changes specifically for big.little.
Over that we can do some load balancing of works within multiple alive cpus, so that it can get done quickly. Also, we shouldn't start using an idle cpu just for processing work :)
The latter part "not using idle cpu just for processing work" does apply to homogeneous systems too but as I wrote earlier work items don't spontaneously happen on an idle core. Something, including timer, needs to kick it. So, by definition, a CPU already isn't idle when a work item starts execution on it. What am I missing here?
I have another idea that we can try:
queue_work_on_any_cpu().
With this we would not break any existing code and can try to migrate old users to this new infrastructure (atleast the ones which are rearming works from their work_handlers). What do you say?
Yeah, this could be a better solution, I think. Plus, it's not like finding the optimal cpu is free.
To take care of the cache locality issue, we can pass an argument to this routine, that can provide
- the mask of cpus to schedule this work on OR
- Sched Level (SD_LEVEL) of cpus to run it.
Let's start simple for now. If we really need it, we can always add more later.
Thanks.
On Mon, Jan 7, 2013 at 8:34 PM, Tejun Heo tj@kernel.org wrote:
Hello, Viresh.
On Mon, Jan 07, 2013 at 03:28:33PM +0530, Viresh Kumar wrote:
Firstly the root cause of this patchset.
Myself and some others in Linaro are working on ARM future cores: big.LITTLE systems. Here we have few very powerful, high power consuming cores (big, currently A15's) and few very efficient ones (LITTLE, currently A7's).
The ultimate goal is to save as much power as possible without compromising much with performance. For, that we wanted most of the stuff to run on LITTLE cores and some performance-demanding stuff on big Cores. There are multiple things going around in this direction. Now, we thought A15's or big cores shouldn't be used for running small tasks like timers/workqueues and hence this patch is an attempt towards reaching that goal.
I see. My understanding of big.little is very superficial so there are very good chances that I'm utterly wrong, but to me it seems like more of a stop-gap solution than something which will have staying power in the sense that the complexity doesn't seem to have any inherent reason other than the failure to make the big ones power efficient enough while idle or under minor load.
The two processors use different manufacturing processes - one optimised for performance, the other for really low power. So the reason is physics at this point. Other architectures are known to be playing with similar schemes. ARM's big.LITTLE is just the first one to the market.
Maybe this isn't specific to big.little and heterogeneous cores will be the way of future with big.little being just a forefront of the trend. And even if that's not the case, it definitely doesn't mean that we can't accomodate big.little, but, at this point, it does make me a bit more reluctant to make wholesale changes specifically for big.little.
The patches aren't targeted to benefit only b.L systems though it was definitely the trigger for our investigations. Our hope is that after presenting more analysis results from our side we'll be able to convince the community of the general usefulness of this feature.
Here are a few short introductions to big.LITTLE in case you're interested.[1][2]
[1] http://www.arm.com/files/downloads/big.LITTLE_Final.pdf [2] http://lwn.net/Articles/481055/
[Removed Suresh and Venki from discussion, they switched their companies probably]
On 7 January 2013 20:34, Tejun Heo tj@kernel.org wrote:
The latter part "not using idle cpu just for processing work" does apply to homogeneous systems too but as I wrote earlier work items don't spontaneously happen on an idle core. Something, including timer, needs to kick it. So, by definition, a CPU already isn't idle when a work item starts execution on it. What am I missing here?
We are talking about a core being idle from schedulers perspective :)
I have another idea that we can try:
queue_work_on_any_cpu().
With this we would not break any existing code and can try to migrate old users to this new infrastructure (atleast the ones which are rearming works from their work_handlers). What do you say?
Yeah, this could be a better solution, I think. Plus, it's not like finding the optimal cpu is free.
Thanks for the first part (When i shared this idea with Vincent and Amit, i wasn't sure at all about the feedback i will get from you and others, but i am very happy now :) ).
I couldn't understand the second part. We still need to search for a free cpu for this new routine. And the implementation would almost be same as the implementation of queue_work() in my initial patch
To take care of the cache locality issue, we can pass an argument to this routine, that can provide
- the mask of cpus to schedule this work on OR
- Sched Level (SD_LEVEL) of cpus to run it.
Let's start simple for now. If we really need it, we can always add more later.
:) Agreed. But i liked the idea from steven, we can have two routines: queue_work_on_any_cpu() and queue_work_on_cpus()
Hello,
On Mon, Jan 07, 2013 at 11:37:22PM +0530, Viresh Kumar wrote:
On 7 January 2013 20:34, Tejun Heo tj@kernel.org wrote:
The latter part "not using idle cpu just for processing work" does apply to homogeneous systems too but as I wrote earlier work items don't spontaneously happen on an idle core. Something, including timer, needs to kick it. So, by definition, a CPU already isn't idle when a work item starts execution on it. What am I missing here?
We are talking about a core being idle from schedulers perspective :)
But it's not like cpu doesn't consume power if scheduler considers it idle, right? Can you please explain in detail how this contributes to saving power? Is it primarily about routing work items to lower power CPUs? And please don't point to presentation slides. They don't seem to explain much better and patches and the code should be able to describe themselves. Here, more so, as the desired behavior change and the resulting powersave are rather subtle.
Yeah, this could be a better solution, I think. Plus, it's not like finding the optimal cpu is free.
Thanks for the first part (When i shared this idea with Vincent and Amit, i wasn't sure at all about the feedback i will get from you and others, but i am very happy now :) ).
I couldn't understand the second part. We still need to search for a free cpu for this new routine. And the implementation would almost be same as the implementation of queue_work() in my initial patch
I meant that enforcing lookup for the "optimal" cpu on queue_work() by default would add overhead to the operation, which in cases (on most homogeneous configs) would lead to overhead without any realized benefit.
Let's start simple for now. If we really need it, we can always add more later.
:) Agreed. But i liked the idea from steven, we can have two routines: queue_work_on_any_cpu() and queue_work_on_cpus()
We can talk about specifics later but please try to *justify* each new interface. Please note that "putting work items to cpus which the scheduler considers idle" can't really be justification in itself.
Thanks.
On 10 January 2013 00:19, Tejun Heo tj@kernel.org wrote:
On Mon, Jan 07, 2013 at 11:37:22PM +0530, Viresh Kumar wrote:
We are talking about a core being idle from schedulers perspective :)
But it's not like cpu doesn't consume power if scheduler considers it idle, right? Can you please explain in detail how this contributes to saving power? Is it primarily about routing work items to lower power CPUs? And please don't point to presentation slides. They don't seem to explain much better and patches and the code should be able to describe themselves. Here, more so, as the desired behavior change and the resulting powersave are rather subtle.
I got your concerns. Firstly, when cpu is idle from schedulers perspective, it consumes a lot of power.
queue_work_on_any_cpu() would queue the work on any other cpu only when current cpu is idle from schedulers perspective, and this can only happen when the cpu was actually idle (in low power state), woke up due to some interrupt/timer and is asked to queue a work..
The idea is to choose other non-idle cpu at this point, so that current cpu can immediately go into deeper idle state. With this cpus can stay longer at deeper idle state, rather than running works.
And in cases, where works are rearmed from the handler, this can cause sufficient power loss, which could be easily saved by pushing this work to non-idle cpus.
The same approach is taken for deffered timers too, they are already using such routine. .
On Tue, 2012-11-06 at 16:08 +0530, Viresh Kumar wrote:
Workqueues queues work on current cpu, if the caller haven't passed a preferred cpu. This may wake up an idle CPU, which is actually not required.
This work can be processed by any CPU and so we must select a non-idle CPU here. This patch adds in support in workqueue framework to get preferred CPU details from the scheduler, instead of using current CPU.
Most of the time when a work is queued, the current cpu isn't idle and so we will choose it only. There are cases when a cpu is idle when it queues some work. For example, consider following scenario:
- A cpu has programmed a timer and is IDLE now.
- CPU gets into interrupt handler due to timer and queues a work. As the CPU is currently IDLE, we can queue this work to some other CPU.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
kernel/workqueue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 042d221..d32efa2 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1238,7 +1238,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, struct global_cwq *last_gcwq; if (cpu == WORK_CPU_UNBOUND)
cpu = raw_smp_processor_id();
cpu = sched_select_cpu(0);
A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own.
__queue_work() should be fast as there is a reason that it is delaying the work and not running it itself.
Also, I really don't like this as a default behavior. It seems that this solution is for a very special case, and this can become very intrusive for the normal case.
To be honest, I'm uncomfortable with this approach. It seems to be fighting a symptom and not the disease. I'd rather find a way to keep work from being queued on wrong CPU. If it is a timer, find a way to move the timer. If it is something else, lets work to fix that. Doing searches of possibly all CPUs (unlikely, but it is there), just seems wrong to me.
-- Steve
/* * It's multi cpu. If @work was previously on a different @@ -1383,7 +1383,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, if (gcwq) lcpu = gcwq->cpu; if (lcpu == WORK_CPU_UNBOUND)
lcpu = raw_smp_processor_id();
} else { lcpu = WORK_CPU_UNBOUND; }lcpu = sched_select_cpu(0);
On 27 November 2012 18:56, Steven Rostedt rostedt@goodmis.org wrote:
A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own.
Not sure if you missed the first check sched_select_cpu()
+int sched_select_cpu(unsigned int sd_flags) +{ + /* If Current cpu isn't idle, don't migrate anything */ + if (!idle_cpu(cpu)) + return cpu;
We aren't going to search if we aren't idle.
Also, I really don't like this as a default behavior. It seems that this solution is for a very special case, and this can become very intrusive for the normal case.
We tried with an KCONFIG option for it, which Tejun rejected.
To be honest, I'm uncomfortable with this approach. It seems to be fighting a symptom and not the disease. I'd rather find a way to keep work from being queued on wrong CPU. If it is a timer, find a way to move the timer. If it is something else, lets work to fix that. Doing searches of possibly all CPUs (unlikely, but it is there), just seems wrong to me.
As Vincent pointed out, on big LITTLE systems we just don't want to serve works on big cores. That would be wasting too much of power. Specially if we are going to wake up big cores.
It would be difficult to control the source driver (which queues work) to little cores. We thought, if somebody wanted to queue work on current cpu then they must use queue_work_on().
-- viresh
On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
On 27 November 2012 18:56, Steven Rostedt rostedt@goodmis.org wrote:
A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own.
Not sure if you missed the first check sched_select_cpu()
+int sched_select_cpu(unsigned int sd_flags) +{
/* If Current cpu isn't idle, don't migrate anything */
if (!idle_cpu(cpu))
return cpu;
We aren't going to search if we aren't idle.
OK, we are idle, but CPU 1022 isn't. We still need a large search. But, heh we are idle we can spin. But then why go through this in the first place ;-)
Also, I really don't like this as a default behavior. It seems that this solution is for a very special case, and this can become very intrusive for the normal case.
We tried with an KCONFIG option for it, which Tejun rejected.
Yeah, I saw that. I don't like adding KCONFIG options either. Best is to get something working that doesn't add any regressions. If you can get this to work without making *any* regressions in the normal case than I'm totally fine with that. But if this adds any issues with the normal case, then it's a show stopper.
To be honest, I'm uncomfortable with this approach. It seems to be fighting a symptom and not the disease. I'd rather find a way to keep work from being queued on wrong CPU. If it is a timer, find a way to move the timer. If it is something else, lets work to fix that. Doing searches of possibly all CPUs (unlikely, but it is there), just seems wrong to me.
As Vincent pointed out, on big LITTLE systems we just don't want to serve works on big cores. That would be wasting too much of power. Specially if we are going to wake up big cores.
It would be difficult to control the source driver (which queues work) to little cores. We thought, if somebody wanted to queue work on current cpu then they must use queue_work_on().
As Tejun has mentioned earlier, is there any assumptions anywhere that expects an unbounded work queue to not migrate? Where per cpu variables might be used. Tejun had a good idea of forcing this to migrate the work *every* time. To not let a work queue run on the same CPU that it was queued on. If it can survive that, then it is probably OK. Maybe add a config option that forces this? That way, anyone can test that this isn't an issue.
-- Steve
On 27 November 2012 14:59, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
On 27 November 2012 18:56, Steven Rostedt rostedt@goodmis.org wrote:
A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own.
Not sure if you missed the first check sched_select_cpu()
+int sched_select_cpu(unsigned int sd_flags) +{
/* If Current cpu isn't idle, don't migrate anything */
if (!idle_cpu(cpu))
return cpu;
We aren't going to search if we aren't idle.
OK, we are idle, but CPU 1022 isn't. We still need a large search. But, heh we are idle we can spin. But then why go through this in the first place ;-)
By migrating it now, it will create its activity and wake up on the right CPU next time.
If migrating on any CPUs seems a bit risky, we could restrict the migration on a CPU on the same node. We can pass such contraints on sched_select_cpu
Also, I really don't like this as a default behavior. It seems that this solution is for a very special case, and this can become very intrusive for the normal case.
We tried with an KCONFIG option for it, which Tejun rejected.
Yeah, I saw that. I don't like adding KCONFIG options either. Best is to get something working that doesn't add any regressions. If you can get this to work without making *any* regressions in the normal case than I'm totally fine with that. But if this adds any issues with the normal case, then it's a show stopper.
To be honest, I'm uncomfortable with this approach. It seems to be fighting a symptom and not the disease. I'd rather find a way to keep work from being queued on wrong CPU. If it is a timer, find a way to move the timer. If it is something else, lets work to fix that. Doing searches of possibly all CPUs (unlikely, but it is there), just seems wrong to me.
As Vincent pointed out, on big LITTLE systems we just don't want to serve works on big cores. That would be wasting too much of power. Specially if we are going to wake up big cores.
It would be difficult to control the source driver (which queues work) to little cores. We thought, if somebody wanted to queue work on current cpu then they must use queue_work_on().
As Tejun has mentioned earlier, is there any assumptions anywhere that expects an unbounded work queue to not migrate? Where per cpu variables might be used. Tejun had a good idea of forcing this to migrate the work *every* time. To not let a work queue run on the same CPU that it was queued on. If it can survive that, then it is probably OK. Maybe add a config option that forces this? That way, anyone can test that this isn't an issue.
-- Steve
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, 2012-11-27 at 15:55 +0100, Vincent Guittot wrote:
On 27 November 2012 14:59, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
On 27 November 2012 18:56, Steven Rostedt rostedt@goodmis.org wrote:
A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own.
Not sure if you missed the first check sched_select_cpu()
+int sched_select_cpu(unsigned int sd_flags) +{
/* If Current cpu isn't idle, don't migrate anything */
if (!idle_cpu(cpu))
return cpu;
We aren't going to search if we aren't idle.
OK, we are idle, but CPU 1022 isn't. We still need a large search. But, heh we are idle we can spin. But then why go through this in the first place ;-)
By migrating it now, it will create its activity and wake up on the right CPU next time.
If migrating on any CPUs seems a bit risky, we could restrict the migration on a CPU on the same node. We can pass such contraints on sched_select_cpu
That's assuming that the CPUs stay idle. Now if we move the work to another CPU and it goes idle, then it may move that again. It could end up being a ping pong approach.
I don't think idle is a strong enough heuristic for the general case. If interrupts are constantly going off on a CPU that happens to be idle most of the time, it will constantly be moving work onto CPUs that are currently doing real work, and by doing so, it will be slowing those CPUs down.
-- Steve
On 27 November 2012 16:04, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2012-11-27 at 15:55 +0100, Vincent Guittot wrote:
On 27 November 2012 14:59, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
On 27 November 2012 18:56, Steven Rostedt rostedt@goodmis.org wrote:
A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own.
Not sure if you missed the first check sched_select_cpu()
+int sched_select_cpu(unsigned int sd_flags) +{
/* If Current cpu isn't idle, don't migrate anything */
if (!idle_cpu(cpu))
return cpu;
We aren't going to search if we aren't idle.
OK, we are idle, but CPU 1022 isn't. We still need a large search. But, heh we are idle we can spin. But then why go through this in the first place ;-)
By migrating it now, it will create its activity and wake up on the right CPU next time.
If migrating on any CPUs seems a bit risky, we could restrict the migration on a CPU on the same node. We can pass such contraints on sched_select_cpu
That's assuming that the CPUs stay idle. Now if we move the work to another CPU and it goes idle, then it may move that again. It could end up being a ping pong approach.
I don't think idle is a strong enough heuristic for the general case. If interrupts are constantly going off on a CPU that happens to be idle most of the time, it will constantly be moving work onto CPUs that are currently doing real work, and by doing so, it will be slowing those CPUs down.
I agree that idle is probably not enough but it's the heuristic that is currently used for selecting a CPU for a timer and the timer also uses sched_select_cpu in this series. So in order to go step by step, a common interface has been introduced for selecting a CPU and this function uses the same algorithm than the timer already do. Once we agreed on an interface, the heuristic could be updated.
-- Steve
Till now, we weren't migrating a running timer because with migration del_timer_sync() can't detect that the timer's handler yet has not finished.
Now, when can we actually to reach to the code (inside __mod_timer()) where
base->running_timer == timer ? i.e. We are trying to migrate current timer
I can see only following case: - Timer re-armed itself. i.e. Currently we are running interrupt handler of a timer and it rearmed itself from there. At this time user might have called del_timer_sync() or not. If not, then there is no harm in re-arming the timer?
Now, when somebody tries to delete a timer, obviously he doesn't want to run it any more for now. So, why should we ever re-arm a timer, which is scheduled for deletion?
This patch tries to fix "migration of running timer", assuming above theory is correct :)
So, now when we get a call to del_timer_sync(), we will mark it scheduled for deletion in an additional variable. This would be checked whenever we try to modify/arm it again. If it was currently scheduled for deletion, we must not modify/arm it.
And so, whenever we reach to the situation where: base->running_timer == timer
We are sure, nobody is waiting in del_timer_sync().
We will clear this flag as soon as the timer is deleted, so that it can be started again after deleting it successfully.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/timer.h | 2 ++ kernel/timer.c | 42 +++++++++++++++++++++++++----------------- 2 files changed, 27 insertions(+), 17 deletions(-)
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..6aa720f 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -22,6 +22,7 @@ struct timer_list { unsigned long data;
int slack; + int sched_del;
#ifdef CONFIG_TIMER_STATS int start_pid; @@ -77,6 +78,7 @@ extern struct tvec_base boot_tvec_bases; .data = (_data), \ .base = (void *)((unsigned long)&boot_tvec_bases + (_flags)), \ .slack = -1, \ + .sched_del = 0, \ __TIMER_LOCKDEP_MAP_INITIALIZER( \ __FILE__ ":" __stringify(__LINE__)) \ } diff --git a/kernel/timer.c b/kernel/timer.c index 1170ece..14e1f76 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -622,6 +622,7 @@ static void do_init_timer(struct timer_list *timer, unsigned int flags, timer->entry.next = NULL; timer->base = (void *)((unsigned long)base | flags); timer->slack = -1; + timer->sched_del = 0; #ifdef CONFIG_TIMER_STATS timer->start_site = NULL; timer->start_pid = -1; @@ -729,6 +730,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
base = lock_timer_base(timer, &flags);
+ if (timer->sched_del) { + /* Don't schedule it again, as it is getting deleted */ + ret = -EBUSY; + goto out_unlock; + } + ret = detach_if_pending(timer, base, false); if (!ret && pending_only) goto out_unlock; @@ -746,21 +753,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) { - /* - * We are trying to schedule the timer on the local CPU. - * However we can't change timer's base while it is running, - * otherwise del_timer_sync() can't detect that the timer's - * handler yet has not finished. This also guarantees that - * the timer is serialized wrt itself. - */ - if (likely(base->running_timer != timer)) { - /* See the comment in lock_timer_base() */ - timer_set_base(timer, NULL); - spin_unlock(&base->lock); - base = new_base; - spin_lock(&base->lock); - timer_set_base(timer, base); - } + /* See the comment in lock_timer_base() */ + timer_set_base(timer, NULL); + spin_unlock(&base->lock); + base = new_base; + spin_lock(&base->lock); + timer_set_base(timer, base); }
timer->expires = expires; @@ -1039,9 +1037,11 @@ EXPORT_SYMBOL(try_to_del_timer_sync); */ int del_timer_sync(struct timer_list *timer) { -#ifdef CONFIG_LOCKDEP + struct tvec_base *base; unsigned long flags;
+#ifdef CONFIG_LOCKDEP + /* * If lockdep gives a backtrace here, please reference * the synchronization rules above. @@ -1051,6 +1051,12 @@ int del_timer_sync(struct timer_list *timer) lock_map_release(&timer->lockdep_map); local_irq_restore(flags); #endif + + /* Timer is scheduled for deletion, don't let it re-arm itself */ + base = lock_timer_base(timer, &flags); + timer->sched_del = 1; + spin_unlock_irqrestore(&base->lock, flags); + /* * don't use it in hardirq context, because it * could lead to deadlock. @@ -1058,8 +1064,10 @@ int del_timer_sync(struct timer_list *timer) WARN_ON(in_irq() && !tbase_get_irqsafe(timer->base)); for (;;) { int ret = try_to_del_timer_sync(timer); - if (ret >= 0) + if (ret >= 0) { + timer->sched_del = 0; return ret; + } cpu_relax(); } }
[ Added John Stultz ]
On Tue, 2012-11-06 at 16:08 +0530, Viresh Kumar wrote:
Till now, we weren't migrating a running timer because with migration del_timer_sync() can't detect that the timer's handler yet has not finished.
Now, when can we actually to reach to the code (inside __mod_timer()) where
base->running_timer == timer ? i.e. We are trying to migrate current timer
I can see only following case:
- Timer re-armed itself. i.e. Currently we are running interrupt handler of a timer and it rearmed itself from there. At this time user might have called del_timer_sync() or not. If not, then there is no harm in re-arming the timer?
Now, when somebody tries to delete a timer, obviously he doesn't want to run it any more for now. So, why should we ever re-arm a timer, which is scheduled for deletion?
This patch tries to fix "migration of running timer", assuming above theory is correct :)
That's a question for Thomas or John (hello! Thomas or John :-)
So, now when we get a call to del_timer_sync(), we will mark it scheduled for deletion in an additional variable. This would be checked whenever we try to modify/arm it again. If it was currently scheduled for deletion, we must not modify/arm it.
And so, whenever we reach to the situation where: base->running_timer == timer
We are sure, nobody is waiting in del_timer_sync().
We will clear this flag as soon as the timer is deleted, so that it can be started again after deleting it successfully.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
include/linux/timer.h | 2 ++ kernel/timer.c | 42 +++++++++++++++++++++++++----------------- 2 files changed, 27 insertions(+), 17 deletions(-)
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..6aa720f 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -22,6 +22,7 @@ struct timer_list { unsigned long data; int slack;
- int sched_del;
Make that a bool, as it's just a flag. Maybe gcc can optimize or something.
#ifdef CONFIG_TIMER_STATS int start_pid; @@ -77,6 +78,7 @@ extern struct tvec_base boot_tvec_bases; .data = (_data), \ .base = (void *)((unsigned long)&boot_tvec_bases + (_flags)), \ .slack = -1, \
__TIMER_LOCKDEP_MAP_INITIALIZER( \ __FILE__ ":" __stringify(__LINE__)) \ }.sched_del = 0, \
diff --git a/kernel/timer.c b/kernel/timer.c index 1170ece..14e1f76 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -622,6 +622,7 @@ static void do_init_timer(struct timer_list *timer, unsigned int flags, timer->entry.next = NULL; timer->base = (void *)((unsigned long)base | flags); timer->slack = -1;
- timer->sched_del = 0;
#ifdef CONFIG_TIMER_STATS timer->start_site = NULL; timer->start_pid = -1; @@ -729,6 +730,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires, base = lock_timer_base(timer, &flags);
- if (timer->sched_del) {
/* Don't schedule it again, as it is getting deleted */
ret = -EBUSY;
goto out_unlock;
- }
- ret = detach_if_pending(timer, base, false); if (!ret && pending_only) goto out_unlock;
@@ -746,21 +753,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu); if (base != new_base) {
/*
* We are trying to schedule the timer on the local CPU.
* However we can't change timer's base while it is running,
* otherwise del_timer_sync() can't detect that the timer's
* handler yet has not finished. This also guarantees that
* the timer is serialized wrt itself.
*/
if (likely(base->running_timer != timer)) {
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base);
}
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
}timer_set_base(timer, base);
timer->expires = expires; @@ -1039,9 +1037,11 @@ EXPORT_SYMBOL(try_to_del_timer_sync); */ int del_timer_sync(struct timer_list *timer) { -#ifdef CONFIG_LOCKDEP
- struct tvec_base *base; unsigned long flags;
+#ifdef CONFIG_LOCKDEP
- /*
- If lockdep gives a backtrace here, please reference
- the synchronization rules above.
@@ -1051,6 +1051,12 @@ int del_timer_sync(struct timer_list *timer) lock_map_release(&timer->lockdep_map); local_irq_restore(flags); #endif
- /* Timer is scheduled for deletion, don't let it re-arm itself */
- base = lock_timer_base(timer, &flags);
- timer->sched_del = 1;
- spin_unlock_irqrestore(&base->lock, flags);
I don't think this is good enough. For one thing, it doesn't handle try_to_del_timer_sync() or even del_timer_sync() for that matter. As that may return success when the timer happens to be running on another CPU.
We have this:
CPU0 CPU1 ---- ---- timerA (running) mod_timer(timerA) [ migrate to CPU2 ] release timer base lock del_timer_sync(timerA) timer->sched_del = true try_to_del_timer_sync(timerA) base(CPU2)->timer != timerA [TRUE!] timerA (finishes)
Fail!
-- Steve
- /*
- don't use it in hardirq context, because it
- could lead to deadlock.
@@ -1058,8 +1064,10 @@ int del_timer_sync(struct timer_list *timer) WARN_ON(in_irq() && !tbase_get_irqsafe(timer->base)); for (;;) { int ret = try_to_del_timer_sync(timer);
if (ret >= 0)
if (ret >= 0) {
timer->sched_del = 0; return ret;
cpu_relax(); }}
}
On 27 November 2012 19:17, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2012-11-06 at 16:08 +0530, Viresh Kumar wrote:
diff --git a/kernel/timer.c b/kernel/timer.c @@ -729,6 +730,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
base = lock_timer_base(timer, &flags);
if (timer->sched_del) {
/* Don't schedule it again, as it is getting deleted */
ret = -EBUSY;
goto out_unlock;
}
ret = detach_if_pending(timer, base, false); if (!ret && pending_only) goto out_unlock;
@@ -746,21 +753,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) {
/*
* We are trying to schedule the timer on the local CPU.
* However we can't change timer's base while it is running,
* otherwise del_timer_sync() can't detect that the timer's
* handler yet has not finished. This also guarantees that
* the timer is serialized wrt itself.
*/
if (likely(base->running_timer != timer)) {
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base);
}
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base); }
I don't think this is good enough. For one thing, it doesn't handle try_to_del_timer_sync() or even del_timer_sync() for that matter. As that may return success when the timer happens to be running on another CPU.
We have this:
CPU0 CPU1 ---- ----
timerA (running) mod_timer(timerA) [ migrate to CPU2 ] release timer base lock del_timer_sync(timerA) timer->sched_del = true try_to_del_timer_sync(timerA) base(CPU2)->timer != timerA [TRUE!] timerA (finishes)
Fail!
Hi Steven/Thomas,
I came back to this patch after completing some other stuff and posting wq part of this patchset separately.
I got your point and understand how this would fail.
@Thomas: I need your opinion first. Do you like this concept of migrating running timer or not? Or you see some basic problem with this concept?
If no (i.e. i can go ahead with another version), then i have some solution to fix earlier problems reported by Steven:
The problem lies with del_timer_sync() which just checks base->running_timer != timer to check if timer is currently running or not.
What if we add another variable in struct timer_list, that will store if we are running timer callback or not. And so, before we call callback in timer core, we will set this variable and will reset it after finishing callback.
del_timer_sync() will have something like:
if (base->running_timer != timer) remove timer and return; else if (timer->running_callback) go back to its loop...
So, with my existing patch + this change, del_timer_sync() will not return back unless the callback is completed on CPU0.
But what can happen now is base->running_timer == timer can be true for two cpus simultaneously cpu0 (running callback) and cpu2 (running hardware timer). Will that cause any issues?
-- viresh
[Steven replied to a personal Ping!!, including everybody again]
On 9 April 2013 19:25, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2013-04-09 at 14:05 +0530, Viresh Kumar wrote:
Ping!!
Remind me again. What problem are you trying to solve?
I was trying to migrate a running timer which arms itself, so that we don't keep a cpu busy just for servicing this timer.
On 20 March 2013 20:43, Viresh Kumar viresh.kumar@linaro.org wrote:
Hi Steven/Thomas,
I came back to this patch after completing some other stuff and posting wq part of this patchset separately.
I got your point and understand how this would fail.
@Thomas: I need your opinion first. Do you like this concept of migrating running timer or not? Or you see some basic problem with this concept?
I'll let Thomas answer this, but to me, this sounds really racy.
Sure.
If no (i.e. i can go ahead with another version), then i have some solution to fix earlier problems reported by Steven:
The problem lies with del_timer_sync() which just checks base->running_timer != timer to check if timer is currently running or not.
What if we add another variable in struct timer_list, that will store if we are running timer callback or not. And so, before we call callback in timer core, we will set this variable and will reset it after finishing callback.
del_timer_sync() will have something like:
if (base->running_timer != timer) remove timer and return;
For example, this didn't fix the issue. You removed the timer when it was still running, because base->running_timer did not equal timer.
You are correct and i was stupid. I wanted to write this instead:
del_timer_sync() will have something like:
if (base->running_timer != timer) if (timer->running_callback) go back to its loop... else remove timer and return;
i.e. if we aren't running on our base cpu, just check if our callback is executing somewhere else due to migration.
On 9 April 2013 20:22, Viresh Kumar viresh.kumar@linaro.org wrote:
[Steven replied to a personal Ping!!, including everybody again]
On 9 April 2013 19:25, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2013-04-09 at 14:05 +0530, Viresh Kumar wrote:
Ping!!
Remind me again. What problem are you trying to solve?
I was trying to migrate a running timer which arms itself, so that we don't keep a cpu busy just for servicing this timer.
On 20 March 2013 20:43, Viresh Kumar viresh.kumar@linaro.org wrote:
Hi Steven/Thomas,
I came back to this patch after completing some other stuff and posting wq part of this patchset separately.
I got your point and understand how this would fail.
@Thomas: I need your opinion first. Do you like this concept of migrating running timer or not? Or you see some basic problem with this concept?
I'll let Thomas answer this, but to me, this sounds really racy.
Sure.
If no (i.e. i can go ahead with another version), then i have some solution to fix earlier problems reported by Steven:
The problem lies with del_timer_sync() which just checks base->running_timer != timer to check if timer is currently running or not.
What if we add another variable in struct timer_list, that will store if we are running timer callback or not. And so, before we call callback in timer core, we will set this variable and will reset it after finishing callback.
del_timer_sync() will have something like:
if (base->running_timer != timer) remove timer and return;
For example, this didn't fix the issue. You removed the timer when it was still running, because base->running_timer did not equal timer.
You are correct and i was stupid. I wanted to write this instead:
del_timer_sync() will have something like:
if (base->running_timer != timer) if (timer->running_callback) go back to its loop... else remove timer and return;
i.e. if we aren't running on our base cpu, just check if our callback is executing somewhere else due to migration.
Ping!!
On 24 April 2013 16:52, Viresh Kumar viresh.kumar@linaro.org wrote:
On 9 April 2013 20:22, Viresh Kumar viresh.kumar@linaro.org wrote:
[Steven replied to a personal Ping!!, including everybody again]
On 9 April 2013 19:25, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2013-04-09 at 14:05 +0530, Viresh Kumar wrote:
Ping!!
Remind me again. What problem are you trying to solve?
I was trying to migrate a running timer which arms itself, so that we don't keep a cpu busy just for servicing this timer.
On 20 March 2013 20:43, Viresh Kumar viresh.kumar@linaro.org wrote:
Hi Steven/Thomas,
I came back to this patch after completing some other stuff and posting wq part of this patchset separately.
I got your point and understand how this would fail.
@Thomas: I need your opinion first. Do you like this concept of migrating running timer or not? Or you see some basic problem with this concept?
I'll let Thomas answer this, but to me, this sounds really racy.
Sure.
If no (i.e. i can go ahead with another version), then i have some solution to fix earlier problems reported by Steven:
The problem lies with del_timer_sync() which just checks base->running_timer != timer to check if timer is currently running or not.
What if we add another variable in struct timer_list, that will store if we are running timer callback or not. And so, before we call callback in timer core, we will set this variable and will reset it after finishing callback.
del_timer_sync() will have something like:
if (base->running_timer != timer) remove timer and return;
For example, this didn't fix the issue. You removed the timer when it was still running, because base->running_timer did not equal timer.
You are correct and i was stupid. I wanted to write this instead:
del_timer_sync() will have something like:
if (base->running_timer != timer) if (timer->running_callback) go back to its loop... else remove timer and return;
i.e. if we aren't running on our base cpu, just check if our callback is executing somewhere else due to migration.
Ping!!
Ping!!
On Mon, 13 May 2013, Viresh Kumar wrote:
On 24 April 2013 16:52, Viresh Kumar viresh.kumar@linaro.org wrote:
On 9 April 2013 20:22, Viresh Kumar viresh.kumar@linaro.org wrote:
[Steven replied to a personal Ping!!, including everybody again]
On 9 April 2013 19:25, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2013-04-09 at 14:05 +0530, Viresh Kumar wrote:
Ping!!
Remind me again. What problem are you trying to solve?
I was trying to migrate a running timer which arms itself, so that we don't keep a cpu busy just for servicing this timer.
Which mechanism is migrating the timer away?
On 20 March 2013 20:43, Viresh Kumar viresh.kumar@linaro.org wrote:
Hi Steven/Thomas,
I came back to this patch after completing some other stuff and posting wq part of this patchset separately.
I got your point and understand how this would fail.
@Thomas: I need your opinion first. Do you like this concept of migrating running timer or not? Or you see some basic problem with this concept?
I have no objections to the functionality per se, but the proposed solution is not going to fly.
Aside of bloating the data structure you're changing the semantics of __mod_timer(). No __mod_timer() caller can deal with -EBUSY. So you'd break the world and some more.
Here is a list of questions:
- Which mechanism migrates timers?
- How is that mechanism triggered?
- How does that deal with CPU bound timers?
Thanks,
tglx
Sorry for being late in replying to your queries.
On 13 May 2013 16:05, Thomas Gleixner tglx@linutronix.de wrote:
Which mechanism is migrating the timer away?
It will be the same: get_nohz_timer_target() which will decide target cpu for migration.
I have no objections to the functionality per se, but the proposed solution is not going to fly.
Aside of bloating the data structure you're changing the semantics of __mod_timer(). No __mod_timer() caller can deal with -EBUSY. So you'd break the world and some more.
Ahh.. That idea was dropped already.
Here is a list of questions:
- Which mechanism migrates timers? - How is that mechanism triggered?
The mechanism remains the same as is for non-rearmed timers. i.e. get_nohz_timer_target()..
We are just trying to give a approach with which we can migrate running timers. i.e. those which re-arm themselves from their handlers.
- How does that deal with CPU bound timers?
We will still check 'Pinned' for this timer as is done for any other normal timer. So, we don't migrate them.
So, this is the clean draft for the idea I had.. (Naming is poor for now):
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..ad00ebe 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -20,6 +20,7 @@ struct timer_list {
void (*function)(unsigned long); unsigned long data; + int wait_for_migration_to_complete;
int slack;
diff --git a/kernel/timer.c b/kernel/timer.c index a860bba..7791f28 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -746,21 +746,15 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) { - /* - * We are trying to schedule the timer on the local CPU. - * However we can't change timer's base while it is running, - * otherwise del_timer_sync() can't detect that the timer's - * handler yet has not finished. This also guarantees that - * the timer is serialized wrt itself. - */ - if (likely(base->running_timer != timer)) { - /* See the comment in lock_timer_base() */ - timer_set_base(timer, NULL); - spin_unlock(&base->lock); - base = new_base; - spin_lock(&base->lock); - timer_set_base(timer, base); - } + if (base->running_timer == timer) + timer->wait_for_migration_to_complete = 1; + + /* See the comment in lock_timer_base() */ + timer_set_base(timer, NULL); + spin_unlock(&base->lock); + base = new_base; + spin_lock(&base->lock); + timer_set_base(timer, base); }
timer->expires = expires; @@ -990,7 +984,8 @@ int try_to_del_timer_sync(struct timer_list *timer)
base = lock_timer_base(timer, &flags);
- if (base->running_timer != timer) { + if ((base->running_timer != timer) && + !timer->wait_for_migration_to_complete) { timer_stats_timer_clear_start_info(timer); ret = detach_if_pending(timer, base, true); } @@ -1183,6 +1178,8 @@ static inline void __run_timers(struct tvec_base *base) call_timer_fn(timer, fn, data); spin_lock_irq(&base->lock); } + if (timer->wait_for_migration_to_complete) + timer->wait_for_migration_to_complete = 0; } } base->running_timer = NULL;
Please see if it a junk idea or has some light of hope :)
On Wed, May 22, 2013 at 02:04:16PM +0530, Viresh Kumar wrote:
So, this is the clean draft for the idea I had.. (Naming is poor for now):
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..ad00ebe 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -20,6 +20,7 @@ struct timer_list {
void (*function)(unsigned long); unsigned long data;
int wait_for_migration_to_complete;
If you play games with the alignment constraints of struct tvec_base you can get extra low bits to play with for TIMER_FLAG_MASK. See struct pool_workqueue for similar games.
That would avoid the struct bloat and I suppose make tglx happy :-)
On 22 May 2013 14:36, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 22, 2013 at 02:04:16PM +0530, Viresh Kumar wrote:
So, this is the clean draft for the idea I had.. (Naming is poor for now):
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..ad00ebe 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -20,6 +20,7 @@ struct timer_list {
void (*function)(unsigned long); unsigned long data;
int wait_for_migration_to_complete;
If you play games with the alignment constraints of struct tvec_base you can get extra low bits to play with for TIMER_FLAG_MASK. See struct pool_workqueue for similar games.
That would avoid the struct bloat and I suppose make tglx happy :-)
Sure, once the initial draft is approved we can get it optimized to the best of our capabilities. :)
On 22 May 2013 14:04, Viresh Kumar viresh.kumar@linaro.org wrote:
Sorry for being late in replying to your queries.
On 13 May 2013 16:05, Thomas Gleixner tglx@linutronix.de wrote:
Which mechanism is migrating the timer away?
It will be the same: get_nohz_timer_target() which will decide target cpu for migration.
I have no objections to the functionality per se, but the proposed solution is not going to fly.
Aside of bloating the data structure you're changing the semantics of __mod_timer(). No __mod_timer() caller can deal with -EBUSY. So you'd break the world and some more.
Ahh.. That idea was dropped already.
Here is a list of questions:
- Which mechanism migrates timers? - How is that mechanism triggered?
The mechanism remains the same as is for non-rearmed timers. i.e. get_nohz_timer_target()..
We are just trying to give a approach with which we can migrate running timers. i.e. those which re-arm themselves from their handlers.
- How does that deal with CPU bound timers?
We will still check 'Pinned' for this timer as is done for any other normal timer. So, we don't migrate them.
So, this is the clean draft for the idea I had.. (Naming is poor for now):
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..ad00ebe 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -20,6 +20,7 @@ struct timer_list {
void (*function)(unsigned long); unsigned long data;
int wait_for_migration_to_complete; int slack;
diff --git a/kernel/timer.c b/kernel/timer.c index a860bba..7791f28 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -746,21 +746,15 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) {
/*
* We are trying to schedule the timer on the local CPU.
* However we can't change timer's base while it is running,
* otherwise del_timer_sync() can't detect that the timer's
* handler yet has not finished. This also guarantees that
* the timer is serialized wrt itself.
*/
if (likely(base->running_timer != timer)) {
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base);
}
if (base->running_timer == timer)
timer->wait_for_migration_to_complete = 1;
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base); } timer->expires = expires;
@@ -990,7 +984,8 @@ int try_to_del_timer_sync(struct timer_list *timer)
base = lock_timer_base(timer, &flags);
if (base->running_timer != timer) {
if ((base->running_timer != timer) &&
!timer->wait_for_migration_to_complete) { timer_stats_timer_clear_start_info(timer); ret = detach_if_pending(timer, base, true); }
@@ -1183,6 +1178,8 @@ static inline void __run_timers(struct tvec_base *base) call_timer_fn(timer, fn, data); spin_lock_irq(&base->lock); }
if (timer->wait_for_migration_to_complete)
timer->wait_for_migration_to_complete = 0; } } base->running_timer = NULL;
Please see if it a junk idea or has some light of hope :)
Ping!!
On 31 May 2013 16:19, Viresh Kumar viresh.kumar@linaro.org wrote:
On 22 May 2013 14:04, Viresh Kumar viresh.kumar@linaro.org wrote:
Sorry for being late in replying to your queries.
On 13 May 2013 16:05, Thomas Gleixner tglx@linutronix.de wrote:
Which mechanism is migrating the timer away?
It will be the same: get_nohz_timer_target() which will decide target cpu for migration.
I have no objections to the functionality per se, but the proposed solution is not going to fly.
Aside of bloating the data structure you're changing the semantics of __mod_timer(). No __mod_timer() caller can deal with -EBUSY. So you'd break the world and some more.
Ahh.. That idea was dropped already.
Here is a list of questions:
- Which mechanism migrates timers? - How is that mechanism triggered?
The mechanism remains the same as is for non-rearmed timers. i.e. get_nohz_timer_target()..
We are just trying to give a approach with which we can migrate running timers. i.e. those which re-arm themselves from their handlers.
- How does that deal with CPU bound timers?
We will still check 'Pinned' for this timer as is done for any other normal timer. So, we don't migrate them.
So, this is the clean draft for the idea I had.. (Naming is poor for now):
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..ad00ebe 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -20,6 +20,7 @@ struct timer_list {
void (*function)(unsigned long); unsigned long data;
int wait_for_migration_to_complete; int slack;
diff --git a/kernel/timer.c b/kernel/timer.c index a860bba..7791f28 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -746,21 +746,15 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) {
/*
* We are trying to schedule the timer on the local CPU.
* However we can't change timer's base while it is running,
* otherwise del_timer_sync() can't detect that the timer's
* handler yet has not finished. This also guarantees that
* the timer is serialized wrt itself.
*/
if (likely(base->running_timer != timer)) {
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base);
}
if (base->running_timer == timer)
timer->wait_for_migration_to_complete = 1;
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base); } timer->expires = expires;
@@ -990,7 +984,8 @@ int try_to_del_timer_sync(struct timer_list *timer)
base = lock_timer_base(timer, &flags);
if (base->running_timer != timer) {
if ((base->running_timer != timer) &&
!timer->wait_for_migration_to_complete) { timer_stats_timer_clear_start_info(timer); ret = detach_if_pending(timer, base, true); }
@@ -1183,6 +1178,8 @@ static inline void __run_timers(struct tvec_base *base) call_timer_fn(timer, fn, data); spin_lock_irq(&base->lock); }
if (timer->wait_for_migration_to_complete)
timer->wait_for_migration_to_complete = 0; } } base->running_timer = NULL;
Please see if it a junk idea or has some light of hope :)
Ping!!
Ping!!
On 18 June 2013 10:21, Viresh Kumar viresh.kumar@linaro.org wrote:
On 31 May 2013 16:19, Viresh Kumar viresh.kumar@linaro.org wrote:
On 22 May 2013 14:04, Viresh Kumar viresh.kumar@linaro.org wrote:
Sorry for being late in replying to your queries.
On 13 May 2013 16:05, Thomas Gleixner tglx@linutronix.de wrote:
Which mechanism is migrating the timer away?
It will be the same: get_nohz_timer_target() which will decide target cpu for migration.
I have no objections to the functionality per se, but the proposed solution is not going to fly.
Aside of bloating the data structure you're changing the semantics of __mod_timer(). No __mod_timer() caller can deal with -EBUSY. So you'd break the world and some more.
Ahh.. That idea was dropped already.
Here is a list of questions:
- Which mechanism migrates timers? - How is that mechanism triggered?
The mechanism remains the same as is for non-rearmed timers. i.e. get_nohz_timer_target()..
We are just trying to give a approach with which we can migrate running timers. i.e. those which re-arm themselves from their handlers.
- How does that deal with CPU bound timers?
We will still check 'Pinned' for this timer as is done for any other normal timer. So, we don't migrate them.
So, this is the clean draft for the idea I had.. (Naming is poor for now):
diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..ad00ebe 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -20,6 +20,7 @@ struct timer_list {
void (*function)(unsigned long); unsigned long data;
int wait_for_migration_to_complete; int slack;
diff --git a/kernel/timer.c b/kernel/timer.c index a860bba..7791f28 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -746,21 +746,15 @@ __mod_timer(struct timer_list *timer, unsigned long expires, new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) {
/*
* We are trying to schedule the timer on the local CPU.
* However we can't change timer's base while it is running,
* otherwise del_timer_sync() can't detect that the timer's
* handler yet has not finished. This also guarantees that
* the timer is serialized wrt itself.
*/
if (likely(base->running_timer != timer)) {
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base);
}
if (base->running_timer == timer)
timer->wait_for_migration_to_complete = 1;
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base); } timer->expires = expires;
@@ -990,7 +984,8 @@ int try_to_del_timer_sync(struct timer_list *timer)
base = lock_timer_base(timer, &flags);
if (base->running_timer != timer) {
if ((base->running_timer != timer) &&
!timer->wait_for_migration_to_complete) { timer_stats_timer_clear_start_info(timer); ret = detach_if_pending(timer, base, true); }
@@ -1183,6 +1178,8 @@ static inline void __run_timers(struct tvec_base *base) call_timer_fn(timer, fn, data); spin_lock_irq(&base->lock); }
if (timer->wait_for_migration_to_complete)
timer->wait_for_migration_to_complete = 0; } } base->running_timer = NULL;
Please see if it a junk idea or has some light of hope :)
Ping!!
Ping!!
Hi Thomas,
Do you still see some light of hope in this patch :) ?
-- viresh
On 24 July 2013 14:47, Viresh Kumar viresh.kumar@linaro.org wrote:
On 18 June 2013 10:21, Viresh Kumar viresh.kumar@linaro.org wrote:
Hi Thomas,
Do you still see some light of hope in this patch :) ?
Ping!!
* Viresh Kumar | 2013-08-07 15:25:55 [+0530]:
On 24 July 2013 14:47, Viresh Kumar viresh.kumar@linaro.org wrote:
On 18 June 2013 10:21, Viresh Kumar viresh.kumar@linaro.org wrote:
Hi Thomas,
Do you still see some light of hope in this patch :) ?
Ping!!
tglx, could you please take a look at this patch / thread?
Sebastian
On 4 October 2013 18:09, Sebastian Andrzej Siewior bigeasy@linutronix.de wrote:
tglx, could you please take a look at this patch / thread?
Thomas, Ping!!
On 6 November 2012 16:08, Viresh Kumar viresh.kumar@linaro.org wrote:
This is V2 Resend of my sched_select_cpu() work. Resend because didn't got much attention on V2. Including more guys now in cc :)
In order to save power, it would be useful to schedule work onto non-IDLE cpus instead of waking up an IDLE one.
To achieve this, we need scheduler to guide kernel frameworks (like: timers & workqueues) on which is the most preferred CPU that must be used for these tasks.
This patchset is about implementing this concept.
- The first patch adds sched_select_cpu() routine which returns the preferred cpu which is non-idle.
- Second patch removes idle_cpu() calls from timer & hrtimer.
- Third patch is about adapting this change in workqueue framework.
- Fourth patch add migration capability in running timer
Earlier discussions over v1 can be found here: http://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html
Earlier discussions over this concept were done at last LPC: http://summit.linuxplumbersconf.org/lpc-2012/meeting/90/lpc2012-sched-timer-...
Module created for testing this behavior is present here: http://git.linaro.org/gitweb?p=people/vireshk/module.git%3Ba=summary
Ping!!
-- viresh
On Mon, 2012-11-26 at 20:30 +0530, Viresh Kumar wrote:
On 6 November 2012 16:08, Viresh Kumar viresh.kumar@linaro.org wrote:
This is V2 Resend of my sched_select_cpu() work. Resend because didn't got much attention on V2. Including more guys now in cc :)
In order to save power, it would be useful to schedule work onto non-IDLE cpus instead of waking up an IDLE one.
To achieve this, we need scheduler to guide kernel frameworks (like: timers & workqueues) on which is the most preferred CPU that must be used for these tasks.
This patchset is about implementing this concept.
- The first patch adds sched_select_cpu() routine which returns the preferred cpu which is non-idle.
- Second patch removes idle_cpu() calls from timer & hrtimer.
- Third patch is about adapting this change in workqueue framework.
- Fourth patch add migration capability in running timer
Earlier discussions over v1 can be found here: http://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html
Earlier discussions over this concept were done at last LPC: http://summit.linuxplumbersconf.org/lpc-2012/meeting/90/lpc2012-sched-timer-...
Module created for testing this behavior is present here: http://git.linaro.org/gitweb?p=people/vireshk/module.git%3Ba=summary
Ping!!
This is a really bad time of year to post new patches :-/ A lot of people are trying to get their own work done by year end and then there's holidays and such that are also distractions. Not to mention that a new merge window will be opening soon.
That said...
As workqueues are set off by the CPU that queued it, what real benefit does this give? A CPU was active when it queued the work and the work should be done before it gets back to sleep.
OK, an interrupt happens on an idle CPU and queues some work. That work should execute before the CPU gets back to sleep, right? I fail to see the benefit of trying to move that work elsewhere. The CPU had to wake up to execute the interrupt. It's no longer in a deep sleep (or any sleep for that matter).
To me it seems best to avoid waking up an idle CPU in the first place.
I'm still working off a turkey overdose, so maybe I'm missing something obvious.
-- Steve
On Mon, Nov 26, 2012 at 11:40:27AM -0500, Steven Rostedt wrote:
On Mon, 2012-11-26 at 20:30 +0530, Viresh Kumar wrote:
On 6 November 2012 16:08, Viresh Kumar viresh.kumar@linaro.org wrote:
This is V2 Resend of my sched_select_cpu() work. Resend because didn't got much attention on V2. Including more guys now in cc :)
In order to save power, it would be useful to schedule work onto non-IDLE cpus instead of waking up an IDLE one.
To achieve this, we need scheduler to guide kernel frameworks (like: timers & workqueues) on which is the most preferred CPU that must be used for these tasks.
This patchset is about implementing this concept.
- The first patch adds sched_select_cpu() routine which returns the preferred cpu which is non-idle.
- Second patch removes idle_cpu() calls from timer & hrtimer.
- Third patch is about adapting this change in workqueue framework.
- Fourth patch add migration capability in running timer
Earlier discussions over v1 can be found here: http://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html
Earlier discussions over this concept were done at last LPC: http://summit.linuxplumbersconf.org/lpc-2012/meeting/90/lpc2012-sched-timer-...
Module created for testing this behavior is present here: http://git.linaro.org/gitweb?p=people/vireshk/module.git%3Ba=summary
Ping!!
This is a really bad time of year to post new patches :-/ A lot of people are trying to get their own work done by year end and then there's holidays and such that are also distractions. Not to mention that a new merge window will be opening soon.
That said...
As workqueues are set off by the CPU that queued it, what real benefit does this give? A CPU was active when it queued the work and the work should be done before it gets back to sleep.
OK, an interrupt happens on an idle CPU and queues some work. That work should execute before the CPU gets back to sleep, right? I fail to see the benefit of trying to move that work elsewhere. The CPU had to wake up to execute the interrupt. It's no longer in a deep sleep (or any sleep for that matter).
To me it seems best to avoid waking up an idle CPU in the first place.
I'm still working off a turkey overdose, so maybe I'm missing something obvious.
If I understand correctly (though also suffering turkey OD), the idea is to offload work to more energy-efficient CPUs.
Thanx, Paul
On Mon, 2012-11-26 at 09:03 -0800, Paul E. McKenney wrote:
If I understand correctly (though also suffering turkey OD), the idea is to offload work to more energy-efficient CPUs.
This is determined by a CPU that isn't running the idle task? Is it because a CPU that just woke up may be running at a lower freq, and thus not as efficient? But pushing off to another CPU may cause cache misses as well. Wouldn't that also be a factor in efficiencies, if a CPU is stalled waiting for memory to be loaded?
I should also ask the obvious. Has these patches shown real world efficiencies or is this just a theory? Do these patches actually improve battery life when applied?
Just asking.
-- Steve
On Mon, Nov 26, 2012 at 12:35:52PM -0500, Steven Rostedt wrote:
On Mon, 2012-11-26 at 09:03 -0800, Paul E. McKenney wrote:
If I understand correctly (though also suffering turkey OD), the idea is to offload work to more energy-efficient CPUs.
This is determined by a CPU that isn't running the idle task? Is it because a CPU that just woke up may be running at a lower freq, and thus not as efficient? But pushing off to another CPU may cause cache misses as well. Wouldn't that also be a factor in efficiencies, if a CPU is stalled waiting for memory to be loaded?
Two different microarchitectures -- same instruction set (at user level, anyway), but different power/performance characteristics. One set is optimized for performance, the other for energy efficiency. For example, ARM's big.LITTLE architecture.
I should also ask the obvious. Has these patches shown real world efficiencies or is this just a theory? Do these patches actually improve battery life when applied?
I must defer to Viresh on this one.
Thanx, Paul
Just asking.
-- Steve
On Mon, 2012-11-26 at 11:03 -0800, Paul E. McKenney wrote:
On Mon, Nov 26, 2012 at 12:35:52PM -0500, Steven Rostedt wrote:
On Mon, 2012-11-26 at 09:03 -0800, Paul E. McKenney wrote:
If I understand correctly (though also suffering turkey OD), the idea is to offload work to more energy-efficient CPUs.
This is determined by a CPU that isn't running the idle task? Is it because a CPU that just woke up may be running at a lower freq, and thus not as efficient? But pushing off to another CPU may cause cache misses as well. Wouldn't that also be a factor in efficiencies, if a CPU is stalled waiting for memory to be loaded?
Two different microarchitectures -- same instruction set (at user level, anyway), but different power/performance characteristics. One set is optimized for performance, the other for energy efficiency. For example, ARM's big.LITTLE architecture.
But I don't see anything in the patch set that guarantees that we will be moving something off to one of the "big" cores. The only heuristic that is used is if the CPU is idle or not. In fact, if the big core was idle we may be pushing work off to a "LITTLE" CPU.
Again, work is only run on the CPU it was queued on (which this patch set is trying to change). Thus, only work that would be queued on a LITTLE CPU is by something that ran on that CPU.
I'm still having a bit of trouble understanding where the benefit comes from.
-- Steve
Hi Steven,
Thanks for sharing your opinion. :)
As, this went out to be a long thread of discussion (thanks Paul), i will try to answer everything here.
On 26 November 2012 22:10, Steven Rostedt rostedt@goodmis.org wrote:
This is a really bad time of year to post new patches :-/ A lot of people are trying to get their own work done by year end and then there's holidays and such that are also distractions. Not to mention that a new merge window will be opening soon.
Patches are there since End of September and it was just a ping now (actually with bad timing - merge window).
As workqueues are set off by the CPU that queued it, what real benefit does this give? A CPU was active when it queued the work and the work should be done before it gets back to sleep.
OK, an interrupt happens on an idle CPU and queues some work. That work should execute before the CPU gets back to sleep, right? I fail to see the benefit of trying to move that work elsewhere. The CPU had to wake up to execute the interrupt. It's no longer in a deep sleep (or any sleep for that matter).
To me it seems best to avoid waking up an idle CPU in the first place.
I'm still working off a turkey overdose, so maybe I'm missing something obvious.
Ok, here is the story behind these patches. The idea was first discussed by Vincent in LPC this year:
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sch...
Specifically slides: 12 & 19.
cpu idleness here means idleness from scheduler's perspective, i.e. For the scheduler CPU is idle, if all below are true: - current task is idle task - nr_running == 0 - wake_list is empty
There are two use cases of workqueue patch 3/4:
- queue-work from timer interrupt, which re-arms itself (slide 12): Whether timer is deferred or not, it will queue the work on current cpu once cpu wakes up. The cpu could have immediately gone back to idle state again, if the work is not queued on it. So, because this cpu is idle (from schedulers point of view), we can move this work to other cpu.
- delayed-work, which rearm's itself (slide 19): Again the same thing, we could have kept the cpu in idle state for some more time.
There might not be many users with this behavior, but a single user can actually have significant impact.
For now, it doesn't take care of big LITTLE stuff that Paul pointed out, but yes that is in plan. Some work is going on in that direction too:
http://linux.kernel.narkive.com/mCyvFVUX/rfc-0-6-sched-packing-small-tasks
The very first reason of having this patchset was to have a single preferred_cpu() routine, which can be used by all frameworks. Timers being the first user (already), workqueues tried to be the second one.
I tested it more from functionality point of view rather than with power figures :( And i understood, that it is very much required.
Having said that, I believe all the questions raised are on PATCH 3/4 (workqueue). And other 3 patches should be fine. Can you share your opinion on those patches, I will then split this patchset and send workqueue stuff after doing some power measurements later.
-- viresh