Hi All,
This is V2 of my sched_select_cpu() work.
In order to save power, it would be useful to schedule work onto non-IDLE cpus instead of waking up an IDLE one.
To achieve this, we need scheduler to guide kernel frameworks (like: timers & workqueues) on which is the most preferred CPU that must be used for these tasks.
This patchset is about implementing this concept.
- The first patch adds sched_select_cpu() routine which returns the preferred cpu which is non-idle. - Second patch removes idle_cpu() calls from timer & hrtimer. - Third patch is about adapting this change in workqueue framework.
Earlier discussions over v1 can be found here: http://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html
Earlier discussions over this concept were done at last LPC: http://summit.linuxplumbersconf.org/lpc-2012/meeting/90/lpc2012-sched-timer-...
Module created for testing this behavior is present here: http://git.linaro.org/gitweb?p=people/vireshk/module.git%3Ba=summary
Following are the steps followed in test module: 1. Run single work on each cpu 2. This work will start a timer after x (tested with 10) jiffies of delay 3. Timer routine queues a work... (This may be called from idle or non-idle cpu) and starts the same timer again STEP 3 is done for n number of times (i.e. queuing n works, one after other) 4. All works will call a single routine, which will count following per cpu: - Total works processed by a CPU - Total works processed by a CPU, which are queued from it - Total works processed by a CPU, which aren't queued from it
Setup: ----- - ARM Vexpress TC2 - big.LITTLE CPU - Core 0-1: A15, 2-4: A7 - rootfs: linaro-ubuntu-nano
Results: ------- Without Workqueue Modification, i.e. PATCH 3/3: [ 2493.022335] Workqueue Analyser: works processsed by CPU0, Total: 1000, Own: 0, migrated: 0 [ 2493.047789] Workqueue Analyser: works processsed by CPU1, Total: 1000, Own: 0, migrated: 0 [ 2493.072918] Workqueue Analyser: works processsed by CPU2, Total: 1000, Own: 0, migrated: 0 [ 2493.098576] Workqueue Analyser: works processsed by CPU3, Total: 1000, Own: 0, migrated: 0 [ 2493.123702] Workqueue Analyser: works processsed by CPU4, Total: 1000, Own: 0, migrated: 0
With Workqueue Modification, i.e. PATCH 3/3: [ 2493.022335] Workqueue Analyser: works processsed by CPU0, Total: 1002, Own: 999, migrated: 3 [ 2493.047789] Workqueue Analyser: works processsed by CPU1, Total: 998, Own: 997, migrated: 1 [ 2493.072918] Workqueue Analyser: works processsed by CPU2, Total: 1013, Own: 996, migrated: 17 [ 2493.098576] Workqueue Analyser: works processsed by CPU3, Total: 998, Own: 993, migrated: 5 [ 2493.123702] Workqueue Analyser: works processsed by CPU4, Total: 989, Own: 987, migrated: 2
V1->V2 ----- - New SD_* macros removed now and earlier ones used - sched_select_cpu() rewritten and it includes the check on current cpu's idleness. - cpu_idle() calls from timer and hrtimer removed now. - Patch 2/3 from V1, removed as it doesn't apply to latest workqueue branch from tejun. - CONFIG_MIGRATE_WQ removed and so is wq_select_cpu() - sched_select_cpu() called only from __queue_work() - got tejun/for-3.7 branch in my tree, before making workqueue changes.
Viresh Kumar (3): sched: Create sched_select_cpu() to give preferred CPU for power saving timer: hrtimer: Don't check idle_cpu() before calling get_nohz_timer_target() workqueue: Schedule work on non-idle cpu instead of current one
include/linux/sched.h | 16 ++++++++++-- kernel/hrtimer.c | 2 +- kernel/sched/core.c | 69 +++++++++++++++++++++++++++++++-------------------- kernel/timer.c | 9 ++++--- kernel/workqueue.c | 2 +- 5 files changed, 63 insertions(+), 35 deletions(-)
In order to save power, it would be useful to schedule work onto non-IDLE cpus instead of waking up an IDLE one.
To achieve this, we need scheduler to guide kernel frameworks (like: timers & workqueues) on which is the most preferred CPU that must be used for these tasks.
This routine returns the preferred cpu which is non-idle. It accepts a bitwise OR of SD_* flags present in linux/sched.h. If the local CPU isn't idle, it is returned back. If it is idle, then we must look for another CPU which have all the flags passed as argument as set. Also, as this activity is part of load balancing only, SD_LOAD_BALANCE must also be set for selected domain.
This patch reuses the code from get_nohz_timer_target() routine, which had similar implementation. get_nohz_timer_target() is also modified to use sched_select_cpu() now.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/sched.h | 16 ++++++++++-- kernel/sched/core.c | 69 +++++++++++++++++++++++++++++++-------------------- 2 files changed, 56 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 0059212..d2722cf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -272,14 +272,26 @@ extern void init_idle_bootup_task(struct task_struct *idle);
extern int runqueue_is_locked(int cpu);
-#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) +#ifdef CONFIG_SMP +extern int sched_select_cpu(unsigned int sd_flags); + +#ifdef CONFIG_NO_HZ extern void select_nohz_load_balancer(int stop_tick); extern void set_cpu_sd_state_idle(void); -extern int get_nohz_timer_target(void); +/* + * In the semi idle case, use the nearest busy cpu for migrating timers + * from an idle cpu. This is good for power-savings. + * + * We don't do similar optimization for completely idle system, as + * selecting an idle cpu will add more delays to the timers than intended + * (as that cpu's timer base may not be uptodate wrt jiffies etc). + */ +#define get_nohz_timer_target() sched_select_cpu(0) #else static inline void select_nohz_load_balancer(int stop_tick) { } static inline void set_cpu_sd_state_idle(void) { } #endif +#endif /* CONFIG_SMP */
/* * Only dump TASK_* tasks. (0 for all tasks) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index de97083..d573183 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -542,33 +542,6 @@ void resched_cpu(int cpu)
#ifdef CONFIG_NO_HZ /* - * In the semi idle case, use the nearest busy cpu for migrating timers - * from an idle cpu. This is good for power-savings. - * - * We don't do similar optimization for completely idle system, as - * selecting an idle cpu will add more delays to the timers than intended - * (as that cpu's timer base may not be uptodate wrt jiffies etc). - */ -int get_nohz_timer_target(void) -{ - int cpu = smp_processor_id(); - int i; - struct sched_domain *sd; - - rcu_read_lock(); - for_each_domain(cpu, sd) { - for_each_cpu(i, sched_domain_span(sd)) { - if (!idle_cpu(i)) { - cpu = i; - goto unlock; - } - } - } -unlock: - rcu_read_unlock(); - return cpu; -} -/* * When add_timer_on() enqueues a timer into the timer wheel of an * idle CPU then this timer might expire before the next timer event * which is scheduled to wake up that CPU. In case of a completely @@ -639,6 +612,48 @@ void sched_avg_update(struct rq *rq) } }
+/* + * This routine returns the preferred cpu which is non-idle. It accepts a + * bitwise OR of SD_* flags present in linux/sched.h. If the local CPU isn't + * idle, it is returned back. If it is idle, then we must look for another CPU + * which have all the flags passed as argument as set. Also, as this activity is + * part of load balancing only, SD_LOAD_BALANCE must also be set for selected + * domain. + */ +int sched_select_cpu(unsigned int sd_flags) +{ + struct sched_domain *sd; + int cpu = smp_processor_id(); + int i; + + /* If Current cpu isn't idle, don't migrate anything */ + if (!idle_cpu(cpu)) + return cpu; + + /* Add SD_LOAD_BALANCE to flags */ + sd_flags |= SD_LOAD_BALANCE; + + rcu_read_lock(); + for_each_domain(cpu, sd) { + /* + * If sd doesnt' have both sd_flags and SD_LOAD_BALANCE set, + * skip sd. + */ + if ((sd->flags & sd_flags) != sd_flags) + continue; + + for_each_cpu(i, sched_domain_span(sd)) { + if (!idle_cpu(i)) { + cpu = i; + goto unlock; + } + } + } +unlock: + rcu_read_unlock(); + return cpu; +} + #else /* !CONFIG_SMP */ void resched_task(struct task_struct *p) {
Check for current cpu's idleness already done in implementation of sched_select_cpu() which is called by get_nohz_timer_target(). So, no need to call idle_cpu() twice, once from sched_select_cpu() and once from timer and hrtimer before calling get_nohz_timer_target().
This patch removes calls to idle_cpu() from timer and hrtimer.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/hrtimer.c | 2 +- kernel/timer.c | 9 +++++---- 2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..74bdaf6 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -159,7 +159,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, static int hrtimer_get_target(int this_cpu, int pinned) { #ifdef CONFIG_NO_HZ - if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) + if (!pinned && get_sysctl_timer_migration()) return get_nohz_timer_target(); #endif return this_cpu; diff --git a/kernel/timer.c b/kernel/timer.c index d5de1b2..8409d9d 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -720,7 +720,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, { struct tvec_base *base, *new_base; unsigned long flags; - int ret = 0 , cpu; + int ret = 0 , cpu = 0;
timer_stats_timer_set_start_info(timer); BUG_ON(!timer->function); @@ -733,12 +733,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
debug_activate(timer, expires);
- cpu = smp_processor_id(); - #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP) - if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) + if (!pinned && get_sysctl_timer_migration()) cpu = get_nohz_timer_target(); #endif + if (!cpu) + cpu = smp_processor_id(); + new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) {
On 27 September 2012 14:34, Viresh Kumar viresh.kumar@linaro.org wrote:
#if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu))
if (!pinned && get_sysctl_timer_migration()) cpu = get_nohz_timer_target();
#endif
if (!cpu)
cpu = smp_processor_id();
Ahh... Bad change... cpu can be returned as zero by get_nohz_timer_target().
Please consider below patch here:
------------8<-------------
Subject: [PATCH] timer: hrtimer: Don't check idle_cpu() before calling get_nohz_timer_target()
Check for current cpu's idleness already done in implementation of sched_select_cpu() which is called by get_nohz_timer_target(). So, no need to call idle_cpu() twice, once from sched_select_cpu() and once from timer and hrtimer before calling get_nohz_timer_target().
This patch removes calls to idle_cpu() from timer and hrtimer.
This also reduces an extra call to smp_processor_id() when get_nohz_timer_target() is called.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/hrtimer.c | 2 +- kernel/timer.c | 8 +++++--- 2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..74bdaf6 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -159,7 +159,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, static int hrtimer_get_target(int this_cpu, int pinned) { #ifdef CONFIG_NO_HZ - if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) + if (!pinned && get_sysctl_timer_migration()) return get_nohz_timer_target(); #endif return this_cpu; diff --git a/kernel/timer.c b/kernel/timer.c index d5de1b2..db57606 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -733,11 +733,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
debug_activate(timer, expires);
- cpu = smp_processor_id(); - #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP) - if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) + if (!pinned && get_sysctl_timer_migration()) cpu = get_nohz_timer_target(); + else + cpu = smp_processor_id(); +#else + cpu = smp_processor_id(); #endif new_base = per_cpu(tvec_bases, cpu);
Workqueues queues work on current cpu, if the caller haven't passed a preferred cpu. This may wake up an idle CPU, which is actually not required.
This work can be processed by any CPU and so we must select a non-idle CPU here. This patch adds in support in workqueue framework to get preferred CPU details from the scheduler, instead of using current CPU.
Most of the time when a work is queued, the current cpu isn't idle and so we will choose it only. There are cases when a cpu is idle when it queues some work. For example, consider following scenario: - A cpu has programmed a timer and is IDLE now. - CPU gets into interrupt handler due to timer and queues a work. As the CPU is currently IDLE, we can queue this work to some other CPU.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 143fd8c..5fa4ba4 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1238,7 +1238,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, struct global_cwq *last_gcwq;
if (cpu == WORK_CPU_UNBOUND) - cpu = raw_smp_processor_id(); + cpu = sched_select_cpu(0);
/* * It's multi cpu. If @work was previously on a different
On 27 September 2012 14:34, Viresh Kumar viresh.kumar@linaro.org wrote:
Workqueues queues work on current cpu, if the caller haven't passed a preferred cpu. This may wake up an idle CPU, which is actually not required.
This work can be processed by any CPU and so we must select a non-idle CPU here. This patch adds in support in workqueue framework to get preferred CPU details from the scheduler, instead of using current CPU.
Most of the time when a work is queued, the current cpu isn't idle and so we will choose it only. There are cases when a cpu is idle when it queues some work. For example, consider following scenario:
- A cpu has programmed a timer and is IDLE now.
- CPU gets into interrupt handler due to timer and queues a work. As the CPU is currently IDLE, we can queue this work to some other CPU.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 143fd8c..5fa4ba4 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1238,7 +1238,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, struct global_cwq *last_gcwq;
if (cpu == WORK_CPU_UNBOUND)
cpu = raw_smp_processor_id();
cpu = sched_select_cpu(0);
Hi Tejun,
Following change is also required with this patch:
--- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1383,7 +1383,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, if (gcwq) lcpu = gcwq->cpu; if (lcpu == WORK_CPU_UNBOUND) - lcpu = raw_smp_processor_id(); + lcpu = sched_select_cpu(0); } else { lcpu = WORK_CPU_UNBOUND; }
Otherwise, we will end up running the re-reentrancy check, which we are trying to skip for delayed work.
-- viresh
Hello, Viresh.
On Thu, Sep 27, 2012 at 02:34:05PM +0530, Viresh Kumar wrote:
- A cpu has programmed a timer and is IDLE now.
- CPU gets into interrupt handler due to timer and queues a work. As the CPU is currently IDLE, we can queue this work to some other CPU.
I'm still a bit confused, if the CPU is already running the IRQ handler, the CPU is not idle by definition. What am I missing here?
Thanks.
On 30 September 2012 14:24, Tejun Heo tj@kernel.org wrote:
On Thu, Sep 27, 2012 at 02:34:05PM +0530, Viresh Kumar wrote:
- A cpu has programmed a timer and is IDLE now.
- CPU gets into interrupt handler due to timer and queues a work. As the CPU is currently IDLE, we can queue this work to some other CPU.
I'm still a bit confused, if the CPU is already running the IRQ handler, the CPU is not idle by definition. What am I missing here?
Hi Tejun,
For the scheduler CPU is idle, if all below are true: - current task is idle task - nr_running == 0 - wake_list is empty
And during these conditions, there can be a timer running in background. And when we reach its interrupt handler, then also these conditions hold true and local cpu is idle.
-- Viresh
Hello,
On Sun, Sep 30, 2012 at 05:46:45PM +0530, Viresh Kumar wrote:
For the scheduler CPU is idle, if all below are true:
- current task is idle task
- nr_running == 0
- wake_list is empty
And during these conditions, there can be a timer running in background. And when we reach its interrupt handler, then also these conditions hold true and local cpu is idle.
It isn't about the CPU being actually idle? Also, if it's only about timers, shouldn't it be enough to implement it for timer and delayed_work?
It would be great if you explain what you're trying to achieve how. I can't tell what you're aiming for and why that would be beneficial from these submissions.
Thanks.
On 1 October 2012 06:02, Tejun Heo tj@kernel.org wrote:
It isn't about the CPU being actually idle?
No. Being idle only from scheduler's perspective. :)
Also, if it's only about timers, shouldn't it be enough to implement it for timer and delayed_work?
What if we need a timer, which must re-arm itself + schedule a work? delayed_work will not be sufficient in that case, and we would need to use normal work.
If i am not wrong, there can be other future users of this routine too. @Vincent: Can you please comment on this?
It would be great if you explain what you're trying to achieve how. I can't tell what you're aiming for and why that would be beneficial from these submissions.
Following slides are implemented by Vincent and presented during LPC. Please have a look at them, they explain the problem statement well:
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sch...
Specifically slides: 12 & 19.
There aren't too many users with this behavior, but even a single user will be sufficient not to let the cpu get idle at all. And that will result in less power saving.
-- viresh
On 1 October 2012 05:47, Viresh Kumar viresh.kumar@linaro.org wrote:
On 1 October 2012 06:02, Tejun Heo tj@kernel.org wrote:
It isn't about the CPU being actually idle?
No. Being idle only from scheduler's perspective. :)
Also, if it's only about timers, shouldn't it be enough to implement it for timer and delayed_work?
What if we need a timer, which must re-arm itself + schedule a work? delayed_work will not be sufficient in that case, and we would need to use normal work.
If i am not wrong, there can be other future users of this routine too. @Vincent: Can you please comment on this?
The goal is to consolidate the place, where the selection of a target CPU for running something is done. The scheduler should select the CPU in order to have coherent behaviour of all framework.
A delayed work can be delayed for a long period and the target CPU that have been selected for the timer could not be the best place to queue a work any more.
The goal is to be able to migrate a work if the scheduler thinks that' is necessary.
Vincent
It would be great if you explain what you're trying to achieve how. I can't tell what you're aiming for and why that would be beneficial from these submissions.
Following slides are implemented by Vincent and presented during LPC. Please have a look at them, they explain the problem statement well:
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sch...
Specifically slides: 12 & 19.
There aren't too many users with this behavior, but even a single user will be sufficient not to let the cpu get idle at all. And that will result in less power saving.
-- viresh
On 27 September 2012 14:34, Viresh Kumar viresh.kumar@linaro.org wrote:
This is V2 of my sched_select_cpu() work.
In order to save power, it would be useful to schedule work onto non-IDLE cpus instead of waking up an IDLE one.
To achieve this, we need scheduler to guide kernel frameworks (like: timers & workqueues) on which is the most preferred CPU that must be used for these tasks.
This patchset is about implementing this concept.
- The first patch adds sched_select_cpu() routine which returns the preferred cpu which is non-idle.
- Second patch removes idle_cpu() calls from timer & hrtimer.
- Third patch is about adapting this change in workqueue framework.
Earlier discussions over v1 can be found here: http://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html
Earlier discussions over this concept were done at last LPC: http://summit.linuxplumbersconf.org/lpc-2012/meeting/90/lpc2012-sched-timer-...
Module created for testing this behavior is present here: http://git.linaro.org/gitweb?p=people/vireshk/module.git%3Ba=summary
Following are the steps followed in test module:
- Run single work on each cpu
- This work will start a timer after x (tested with 10) jiffies of delay
- Timer routine queues a work... (This may be called from idle or non-idle cpu) and starts the same timer again STEP 3 is done for n number of times (i.e. queuing n works, one after other)
- All works will call a single routine, which will count following per cpu:
- Total works processed by a CPU
- Total works processed by a CPU, which are queued from it
- Total works processed by a CPU, which aren't queued from it
Setup:
- ARM Vexpress TC2 - big.LITTLE CPU
- Core 0-1: A15, 2-4: A7
- rootfs: linaro-ubuntu-nano
Results:
Without Workqueue Modification, i.e. PATCH 3/3: [ 2493.022335] Workqueue Analyser: works processsed by CPU0, Total: 1000, Own: 0, migrated: 0 [ 2493.047789] Workqueue Analyser: works processsed by CPU1, Total: 1000, Own: 0, migrated: 0 [ 2493.072918] Workqueue Analyser: works processsed by CPU2, Total: 1000, Own: 0, migrated: 0 [ 2493.098576] Workqueue Analyser: works processsed by CPU3, Total: 1000, Own: 0, migrated: 0 [ 2493.123702] Workqueue Analyser: works processsed by CPU4, Total: 1000, Own: 0, migrated: 0
With Workqueue Modification, i.e. PATCH 3/3: [ 2493.022335] Workqueue Analyser: works processsed by CPU0, Total: 1002, Own: 999, migrated: 3 [ 2493.047789] Workqueue Analyser: works processsed by CPU1, Total: 998, Own: 997, migrated: 1 [ 2493.072918] Workqueue Analyser: works processsed by CPU2, Total: 1013, Own: 996, migrated: 17 [ 2493.098576] Workqueue Analyser: works processsed by CPU3, Total: 998, Own: 993, migrated: 5 [ 2493.123702] Workqueue Analyser: works processsed by CPU4, Total: 989, Own: 987, migrated: 2
V1->V2
- New SD_* macros removed now and earlier ones used
- sched_select_cpu() rewritten and it includes the check on current cpu's idleness.
- cpu_idle() calls from timer and hrtimer removed now.
- Patch 2/3 from V1, removed as it doesn't apply to latest workqueue branch from tejun.
- CONFIG_MIGRATE_WQ removed and so is wq_select_cpu()
- sched_select_cpu() called only from __queue_work()
- got tejun/for-3.7 branch in my tree, before making workqueue changes.
Viresh Kumar (3): sched: Create sched_select_cpu() to give preferred CPU for power saving timer: hrtimer: Don't check idle_cpu() before calling get_nohz_timer_target() workqueue: Schedule work on non-idle cpu instead of current one
Hi Guys,
I totally understand since last few weeks you guys were very busy as the merge window was around. So, didn't tried to disturb you then :)
Can you please share your viewpoint on this patchset now? And also the running timer migration patch (which was sent separately)?
Thanks in Advance.
-- viresh
On 15 October 2012 10:08, Viresh Kumar viresh.kumar@linaro.org wrote:
I totally understand since last few weeks you guys were very busy as the merge window was around. So, didn't tried to disturb you then :)
Can you please share your viewpoint on this patchset now? And also the running timer migration patch (which was sent separately)?
Ping!!