 
            In order to save power, it would be useful to schedule light weight work on cpus that aren't IDLE instead of waking up an IDLE one.
By idle cpu (from scheduler's perspective) we mean: - Current task is idle task - nr_running == 0 - wake_list is empty
This is already implemented for timers as get_nohz_timer_target(). We can figure out few more users of this feature, like workqueues.
This patchset converts get_nohz_timer_target() into a generic API sched_select_cpu() so that other frameworks (like workqueue) can also use it.
This routine returns the cpu which is non-idle. It accepts a bitwise OR of SD_* flags present in linux/sched.h. If the local CPU isn't idle OR all cpus are idle, local cpu is returned back. If local cpu is idle, then we must look for another CPU which have all the flags passed as argument as set and isn't idle.
This patchset in first two patches creates generic API sched_select_cpu(). In the third patch we create a new set of APIs for workqueues to queue work on any cpu. All other patches migrate some of the users of workqueues which showed up significantly on my setup. Others can be migrated later.
Earlier discussions over v1 and v2 can be found here: http://www.mail-archive.com/linaro-dev@lists.linaro.org/msg13342.html http://lists.linaro.org/pipermail/linaro-dev/2012-November/014344.html
Earlier discussions over this concept were done at last LPC: http://summit.linuxplumbersconf.org/lpc-2012/meeting/90/lpc2012-sched-timer-...
Setup: ----- - ARM Vexpress TC2 - big.LITTLE CPU - Core 0-1: A15, 2-4: A7 - rootfs: linaro-ubuntu-devel
This patchset has been tested on a big LITTLE system (heterogeneous) but is useful for all other homogeneous systems as well. During these tests audio was played in background using aplay.
Results: -------
Cluster A15 Energy Cluster A7 Energy Total ------------------ ----------------- -----
Without this patchset (Energy in Joules): ---------------------
0.151162 2.183545 2.334707 0.223730 2.687067 2.910797 0.289687 2.732702 3.022389 0.454198 2.745908 3.200106 0.495552 2.746465 3.242017
Average: 0.322866 2.619137 2.942003
With this patchset (Energy in Joules): ---------------------
0.133361 2.267822 2.401183 0.260626 2.833389 3.094015 0.142365 2.277929 2.420294 0.246793 2.582550 2.829343 0.130462 2.257033 2.387495
Average: 0.182721 2.443745 2.626466
Above tests are repeated multiple times and events are tracked using trace-cmd and analysed using kernelshark. And it was easily noticeable that idle time for many cpus has increased considerably, which eventually saved some power.
These patches are applied here for others to test: http://git.linaro.org/gitweb?p=people/vireshk/linux.git%3Ba=shortlog%3Bh=ref...
Viresh Kumar (7): sched: Create sched_select_cpu() to give preferred CPU for power saving timer: hrtimer: Don't check idle_cpu() before calling get_nohz_timer_target() workqueue: Add helpers to schedule work on any cpu PHYLIB: queue work on any cpu mmc: queue work on any cpu block: queue work on any cpu fbcon: queue work on any cpu
block/blk-core.c | 6 +- block/blk-ioc.c | 2 +- block/genhd.c | 9 ++- drivers/mmc/core/core.c | 2 +- drivers/net/phy/phy.c | 9 +-- drivers/video/console/fbcon.c | 2 +- include/linux/sched.h | 21 +++++- include/linux/workqueue.h | 5 ++ kernel/hrtimer.c | 2 +- kernel/sched/core.c | 63 +++++++++------- kernel/timer.c | 2 +- kernel/workqueue.c | 163 +++++++++++++++++++++++++++++------------- 12 files changed, 192 insertions(+), 94 deletions(-)
 
            In order to save power, it would be useful to schedule light weight work on cpus that aren't IDLE instead of waking up an IDLE one.
By idle cpu (from scheduler's perspective) we mean: - Current task is idle task - nr_running == 0 - wake_list is empty
This is already implemented for timers as get_nohz_timer_target(). We can figure out few more users of this feature, like workqueues.
This patch converts get_nohz_timer_target() into a generic API sched_select_cpu() so that other frameworks (like workqueue) can also use it.
This routine returns the cpu which is non-idle. It accepts a bitwise OR of SD_* flags present in linux/sched.h. If the local CPU isn't idle OR all cpus are idle, local cpu is returned back. If local cpu is idle, then we must look for another CPU which have all the flags passed as argument as set and isn't idle.
This patch reuses the code from get_nohz_timer_target() routine, which had similar implementation.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/sched.h | 21 +++++++++++++++-- kernel/sched/core.c | 63 +++++++++++++++++++++++++++++---------------------- 2 files changed, 55 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index e20580d..216fa0d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -230,14 +230,31 @@ extern void init_idle_bootup_task(struct task_struct *idle);
extern int runqueue_is_locked(int cpu);
-#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) +#ifdef CONFIG_SMP +extern int sched_select_cpu(unsigned int sd_flags); + +#ifdef CONFIG_NO_HZ extern void nohz_balance_enter_idle(int cpu); extern void set_cpu_sd_state_idle(void); -extern int get_nohz_timer_target(void); +/* + * In the semi idle case, use the nearest busy cpu for migrating timers + * from an idle cpu. This is good for power-savings. + * + * We don't do similar optimization for completely idle system, as + * selecting an idle cpu will add more delays to the timers than intended + * (as that cpu's timer base may not be uptodate wrt jiffies etc). + */ +#define get_nohz_timer_target() sched_select_cpu(0) #else static inline void nohz_balance_enter_idle(int cpu) { } static inline void set_cpu_sd_state_idle(void) { } + +static inline int sched_select_cpu(unsigned int sd_flags) +{ + return raw_smp_processor_id(); +} #endif +#endif /* CONFIG_SMP */
/* * Only dump TASK_* tasks. (0 for all tasks) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b36635e..ccd76d7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -551,33 +551,6 @@ void resched_cpu(int cpu)
#ifdef CONFIG_NO_HZ /* - * In the semi idle case, use the nearest busy cpu for migrating timers - * from an idle cpu. This is good for power-savings. - * - * We don't do similar optimization for completely idle system, as - * selecting an idle cpu will add more delays to the timers than intended - * (as that cpu's timer base may not be uptodate wrt jiffies etc). - */ -int get_nohz_timer_target(void) -{ - int cpu = smp_processor_id(); - int i; - struct sched_domain *sd; - - rcu_read_lock(); - for_each_domain(cpu, sd) { - for_each_cpu(i, sched_domain_span(sd)) { - if (!idle_cpu(i)) { - cpu = i; - goto unlock; - } - } - } -unlock: - rcu_read_unlock(); - return cpu; -} -/* * When add_timer_on() enqueues a timer into the timer wheel of an * idle CPU then this timer might expire before the next timer event * which is scheduled to wake up that CPU. In case of a completely @@ -648,6 +621,42 @@ void sched_avg_update(struct rq *rq) } }
+/* + * This routine returns the nearest non-idle cpu. It accepts a bitwise OR of + * SD_* flags present in linux/sched.h. If the local CPU isn't idle, it is + * returned back. If it is idle, then we must look for another CPU which have + * all the flags passed as argument as set. + */ +int sched_select_cpu(unsigned int sd_flags) +{ + struct sched_domain *sd; + int cpu = smp_processor_id(); + int i; + + /* If Current cpu isn't idle, don't migrate anything */ + if (!idle_cpu(cpu)) + return cpu; + + rcu_read_lock(); + for_each_domain(cpu, sd) { + /* If sd doesnt' have sd_flags set skip sd. */ + if ((sd->flags & sd_flags) != sd_flags) + continue; + + for_each_cpu(i, sched_domain_span(sd)) { + if (i == cpu) + continue; + if (!idle_cpu(i)) { + cpu = i; + goto unlock; + } + } + } +unlock: + rcu_read_unlock(); + return cpu; +} + #else /* !CONFIG_SMP */ void resched_task(struct task_struct *p) {
 
            2013/3/18 Viresh Kumar viresh.kumar@linaro.org:
In order to save power, it would be useful to schedule light weight work on cpus that aren't IDLE instead of waking up an IDLE one.
By idle cpu (from scheduler's perspective) we mean:
- Current task is idle task
- nr_running == 0
- wake_list is empty
This is already implemented for timers as get_nohz_timer_target(). We can figure out few more users of this feature, like workqueues.
This patch converts get_nohz_timer_target() into a generic API sched_select_cpu() so that other frameworks (like workqueue) can also use it.
This routine returns the cpu which is non-idle. It accepts a bitwise OR of SD_* flags present in linux/sched.h. If the local CPU isn't idle OR all cpus are idle, local cpu is returned back. If local cpu is idle, then we must look for another CPU which have all the flags passed as argument as set and isn't idle.
This patch reuses the code from get_nohz_timer_target() routine, which had similar implementation.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
include/linux/sched.h | 21 +++++++++++++++-- kernel/sched/core.c | 63 +++++++++++++++++++++++++++++---------------------- 2 files changed, 55 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index e20580d..216fa0d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -230,14 +230,31 @@ extern void init_idle_bootup_task(struct task_struct *idle);
extern int runqueue_is_locked(int cpu);
-#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) +#ifdef CONFIG_SMP +extern int sched_select_cpu(unsigned int sd_flags);
+#ifdef CONFIG_NO_HZ extern void nohz_balance_enter_idle(int cpu); extern void set_cpu_sd_state_idle(void); -extern int get_nohz_timer_target(void); +/*
- In the semi idle case, use the nearest busy cpu for migrating timers
- from an idle cpu. This is good for power-savings.
- We don't do similar optimization for completely idle system, as
- selecting an idle cpu will add more delays to the timers than intended
- (as that cpu's timer base may not be uptodate wrt jiffies etc).
- */
+#define get_nohz_timer_target() sched_select_cpu(0) #else static inline void nohz_balance_enter_idle(int cpu) { } static inline void set_cpu_sd_state_idle(void) { }
+static inline int sched_select_cpu(unsigned int sd_flags) +{
return raw_smp_processor_id();
I feel this should be symetric with the requirement of having preemption disabled as in the CONFIG_NO_HZ version. This should be smp_processor_id().
[...]
@@ -648,6 +621,42 @@ void sched_avg_update(struct rq *rq) } }
+/*
- This routine returns the nearest non-idle cpu. It accepts a bitwise OR of
- SD_* flags present in linux/sched.h. If the local CPU isn't idle, it is
- returned back. If it is idle, then we must look for another CPU which have
- all the flags passed as argument as set.
- */
+int sched_select_cpu(unsigned int sd_flags)
It would be nice to have some more precise naming. sched_select_busy_cpu() ?
 
            On Mon, Mar 18, 2013 at 9:09 PM, Frederic Weisbecker fweisbec@gmail.com wrote:
2013/3/18 Viresh Kumar viresh.kumar@linaro.org:
+static inline int sched_select_cpu(unsigned int sd_flags) +{
return raw_smp_processor_id();I feel this should be symetric with the requirement of having preemption disabled as in the CONFIG_NO_HZ version. This should be smp_processor_id().
Yes, my fault.
+int sched_select_cpu(unsigned int sd_flags)
It would be nice to have some more precise naming. sched_select_busy_cpu() ?
You are correct that name can be improved but busy may give it a different meaning. Maybe sched_select_non_idle_cpu()?
Thanks for your instantaneous review :)
 
            2013/3/18 Viresh Kumar viresh.kumar@linaro.org:
On Mon, Mar 18, 2013 at 9:09 PM, Frederic Weisbecker fweisbec@gmail.com wrote:
2013/3/18 Viresh Kumar viresh.kumar@linaro.org:
+static inline int sched_select_cpu(unsigned int sd_flags) +{
return raw_smp_processor_id();I feel this should be symetric with the requirement of having preemption disabled as in the CONFIG_NO_HZ version. This should be smp_processor_id().
Yes, my fault.
+int sched_select_cpu(unsigned int sd_flags)
It would be nice to have some more precise naming. sched_select_busy_cpu() ?
You are correct that name can be improved but busy may give it a different meaning. Maybe sched_select_non_idle_cpu()?
That works too yeah.
Thanks.
 
            On 18 March 2013 21:27, Frederic Weisbecker fweisbec@gmail.com wrote:
2013/3/18 Viresh Kumar viresh.kumar@linaro.org:
It would be nice to have some more precise naming. sched_select_busy_cpu() ?
You are correct that name can be improved but busy may give it a different meaning. Maybe sched_select_non_idle_cpu()?
That works too yeah.
Here is the fixup that i applied (it fixes another stupid mistake in sched.h). I have pushed this series here again:
http://git.linaro.org/gitweb?p=people/vireshk/linux.git%3Ba=shortlog%3Bh=ref...
commit e769ff0d78fd2fb504ae056a44b70bba7c259126 Author: Viresh Kumar viresh.kumar@linaro.org Date: Fri Sep 14 15:13:05 2012 +0530
fixup! sched: Create sched_select_cpu() to give preferred CPU for power saving
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/sched.h | 19 ++++++++++--------- kernel/sched/core.c | 2 +- 2 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 216fa0d..db6655a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -231,11 +231,18 @@ extern void init_idle_bootup_task(struct task_struct *idle); extern int runqueue_is_locked(int cpu);
#ifdef CONFIG_SMP -extern int sched_select_cpu(unsigned int sd_flags); +extern int sched_select_non_idle_cpu(unsigned int sd_flags); +#else +static inline int sched_select_non_idle_cpu(unsigned int sd_flags) +{ + return smp_processor_id(); +} +#endif
-#ifdef CONFIG_NO_HZ +#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) extern void nohz_balance_enter_idle(int cpu); extern void set_cpu_sd_state_idle(void); + /* * In the semi idle case, use the nearest busy cpu for migrating timers * from an idle cpu. This is good for power-savings. @@ -244,17 +251,11 @@ extern void set_cpu_sd_state_idle(void); * selecting an idle cpu will add more delays to the timers than intended * (as that cpu's timer base may not be uptodate wrt jiffies etc). */ -#define get_nohz_timer_target() sched_select_cpu(0) +#define get_nohz_timer_target() sched_select_non_idle_cpu(0) #else static inline void nohz_balance_enter_idle(int cpu) { } static inline void set_cpu_sd_state_idle(void) { } - -static inline int sched_select_cpu(unsigned int sd_flags) -{ - return raw_smp_processor_id(); -} #endif -#endif /* CONFIG_SMP */
/* * Only dump TASK_* tasks. (0 for all tasks) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ccd76d7..0265a5e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -627,7 +627,7 @@ void sched_avg_update(struct rq *rq) * returned back. If it is idle, then we must look for another CPU which have * all the flags passed as argument as set. */ -int sched_select_cpu(unsigned int sd_flags) +int sched_select_non_idle_cpu(unsigned int sd_flags) { struct sched_domain *sd; int cpu = smp_processor_id();
 
            On Mon, 2013-03-18 at 20:53 +0530, Viresh Kumar wrote:
+/*
- This routine returns the nearest non-idle cpu. It accepts a
bitwise OR of
- SD_* flags present in linux/sched.h. If the local CPU isn't idle,
it is
- returned back. If it is idle, then we must look for another CPU
which have
- all the flags passed as argument as set.
- */
+int sched_select_cpu(unsigned int sd_flags)
So the only problem I have is that you expose the SD_flags to !sched code. The only proposed user is in the workqueue code and passes 0.
Why not leave out the sd_flags argument and introduce it once you start using it; at which point we can argue on the interface.
 
            On 19 March 2013 18:00, Peter Zijlstra peterz@infradead.org wrote:
On Mon, 2013-03-18 at 20:53 +0530, Viresh Kumar wrote:
+/*
- This routine returns the nearest non-idle cpu. It accepts a
bitwise OR of
- SD_* flags present in linux/sched.h. If the local CPU isn't idle,
it is
- returned back. If it is idle, then we must look for another CPU
which have
- all the flags passed as argument as set.
- */
+int sched_select_cpu(unsigned int sd_flags)
So the only problem I have is that you expose the SD_flags to !sched code. The only proposed user is in the workqueue code and passes 0.
Why not leave out the sd_flags argument and introduce it once you start using it; at which point we can argue on the interface.
Yes, that can be done. Will fix it up.
 
            On 19 March 2013 18:22, Viresh Kumar viresh.kumar@linaro.org wrote:
On 19 March 2013 18:00, Peter Zijlstra peterz@infradead.org wrote:
Why not leave out the sd_flags argument and introduce it once you start using it; at which point we can argue on the interface.
Yes, that can be done. Will fix it up.
Fixup:
commit 77927939224520cbb0ac47270d3458bedffe42c4 Author: Viresh Kumar viresh.kumar@linaro.org Date: Tue Mar 19 18:50:37 2013 +0530
fixup! sched: Create sched_select_non_idle_cpu() to give preferred CPU for power saving --- include/linux/sched.h | 6 +++--- kernel/sched/core.c | 6 +----- 2 files changed, 4 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index db6655a..37eb1dd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -231,9 +231,9 @@ extern void init_idle_bootup_task(struct task_struct *idle); extern int runqueue_is_locked(int cpu);
#ifdef CONFIG_SMP -extern int sched_select_non_idle_cpu(unsigned int sd_flags); +extern int sched_select_non_idle_cpu(void); #else -static inline int sched_select_non_idle_cpu(unsigned int sd_flags) +static inline int sched_select_non_idle_cpu(void) { return smp_processor_id(); } @@ -251,7 +251,7 @@ extern void set_cpu_sd_state_idle(void); * selecting an idle cpu will add more delays to the timers than intended * (as that cpu's timer base may not be uptodate wrt jiffies etc). */ -#define get_nohz_timer_target() sched_select_non_idle_cpu(0) +#define get_nohz_timer_target() sched_select_non_idle_cpu() #else static inline void nohz_balance_enter_idle(int cpu) { } static inline void set_cpu_sd_state_idle(void) { } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0265a5e..f597d2b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -627,7 +627,7 @@ void sched_avg_update(struct rq *rq) * returned back. If it is idle, then we must look for another CPU which have * all the flags passed as argument as set. */ -int sched_select_non_idle_cpu(unsigned int sd_flags) +int sched_select_non_idle_cpu(void) { struct sched_domain *sd; int cpu = smp_processor_id(); @@ -639,10 +639,6 @@ int sched_select_non_idle_cpu(unsigned int sd_flags)
rcu_read_lock(); for_each_domain(cpu, sd) { - /* If sd doesnt' have sd_flags set skip sd. */ - if ((sd->flags & sd_flags) != sd_flags) - continue; - for_each_cpu(i, sched_domain_span(sd)) { if (i == cpu) continue;
 
            Check for current cpu's idleness is already done in implementation of sched_select_cpu() which is called by get_nohz_timer_target(). So, no need to call idle_cpu() twice, once from sched_select_cpu() and once from timer and hrtimer before calling get_nohz_timer_target().
This patch removes calls to idle_cpu() from timer and hrtimer.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- kernel/hrtimer.c | 2 +- kernel/timer.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index cc47812..c56668d 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -161,7 +161,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, static int hrtimer_get_target(int this_cpu, int pinned) { #ifdef CONFIG_NO_HZ - if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) + if (!pinned && get_sysctl_timer_migration()) return get_nohz_timer_target(); #endif return this_cpu; diff --git a/kernel/timer.c b/kernel/timer.c index dbf7a78..83a0d3c 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -739,7 +739,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, cpu = smp_processor_id();
#if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP) - if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) + if (!pinned && get_sysctl_timer_migration()) cpu = get_nohz_timer_target(); #endif new_base = per_cpu(tvec_bases, cpu);
 
            queue_work() queues work on current cpu. This may wake up an idle CPU, which is actually not required.
Some of these works can be processed by any CPU and so we must select a non-idle CPU here. The initial idea was to modify implementation of queue_work(), but that may end up breaking lots of kernel code that would be a nightmare to debug.
So, we finalized to adding new workqueue interfaces, for works that don't depend on a cpu to execute them.
This patch adds following new routines: - queue_work_on_any_cpu - queue_delayed_work_on_any_cpu
These routines would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- include/linux/workqueue.h | 5 ++ kernel/workqueue.c | 163 ++++++++++++++++++++++++++++++++-------------- 2 files changed, 118 insertions(+), 50 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index df30763..f0f7068 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -114,6 +114,7 @@ struct delayed_work { /* target workqueue and CPU ->timer uses to queue ->work */ struct workqueue_struct *wq; int cpu; + bool on_any_cpu; };
/* @@ -418,10 +419,14 @@ int apply_workqueue_attrs(struct workqueue_struct *wq, extern bool queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work); extern bool queue_work(struct workqueue_struct *wq, struct work_struct *work); +extern bool queue_work_on_any_cpu(struct workqueue_struct *wq, + struct work_struct *work); extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq, struct delayed_work *work, unsigned long delay); extern bool queue_delayed_work(struct workqueue_struct *wq, struct delayed_work *work, unsigned long delay); +extern bool queue_delayed_work_on_any_cpu(struct workqueue_struct *wq, + struct delayed_work *dwork, unsigned long delay); extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay); extern bool mod_delayed_work(struct workqueue_struct *wq, diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 0e4fa1d..cf9c570 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1215,7 +1215,7 @@ static bool is_chained_work(struct workqueue_struct *wq) }
static void __queue_work(int cpu, struct workqueue_struct *wq, - struct work_struct *work) + struct work_struct *work, bool on_any_cpu) { struct pool_workqueue *pwq; struct worker_pool *last_pool; @@ -1240,8 +1240,13 @@ static void __queue_work(int cpu, struct workqueue_struct *wq, retry: /* pwq which will be used unless @work is executing elsewhere */ if (!(wq->flags & WQ_UNBOUND)) { - if (cpu == WORK_CPU_UNBOUND) - cpu = raw_smp_processor_id(); + if (cpu == WORK_CPU_UNBOUND) { + if (on_any_cpu) + cpu = sched_select_cpu(0); + else + cpu = raw_smp_processor_id(); + } + pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); } else { pwq = first_pwq(wq); @@ -1315,6 +1320,22 @@ retry: spin_unlock(&pwq->pool->lock); }
+static bool __queue_work_on(int cpu, struct workqueue_struct *wq, + struct work_struct *work, bool on_any_cpu) +{ + bool ret = false; + unsigned long flags; + + local_irq_save(flags); + + if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { + __queue_work(cpu, wq, work, on_any_cpu); + ret = true; + } + + local_irq_restore(flags); + return ret; +} /** * queue_work_on - queue work on specific cpu * @cpu: CPU number to execute work on @@ -1329,18 +1350,7 @@ retry: bool queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work) { - bool ret = false; - unsigned long flags; - - local_irq_save(flags); - - if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { - __queue_work(cpu, wq, work); - ret = true; - } - - local_irq_restore(flags); - return ret; + return __queue_work_on(cpu, wq, work, false); } EXPORT_SYMBOL_GPL(queue_work_on);
@@ -1356,21 +1366,38 @@ EXPORT_SYMBOL_GPL(queue_work_on); */ bool queue_work(struct workqueue_struct *wq, struct work_struct *work) { - return queue_work_on(WORK_CPU_UNBOUND, wq, work); + return __queue_work_on(WORK_CPU_UNBOUND, wq, work, false); } EXPORT_SYMBOL_GPL(queue_work);
+/** + * queue_work_on_any_cpu - queue work on any cpu on a workqueue + * @wq: workqueue to use + * @work: work to queue + * + * Returns %false if @work was already on a queue, %true otherwise. + * + * We queue the work to any non-idle (from schedulers perspective) cpu. + */ +bool queue_work_on_any_cpu(struct workqueue_struct *wq, + struct work_struct *work) +{ + return __queue_work_on(WORK_CPU_UNBOUND, wq, work, true); +} +EXPORT_SYMBOL_GPL(queue_work_on_any_cpu); + void delayed_work_timer_fn(unsigned long __data) { struct delayed_work *dwork = (struct delayed_work *)__data;
/* should have been called from irqsafe timer with irq already off */ - __queue_work(dwork->cpu, dwork->wq, &dwork->work); + __queue_work(dwork->cpu, dwork->wq, &dwork->work, dwork->on_any_cpu); } EXPORT_SYMBOL(delayed_work_timer_fn);
static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, - struct delayed_work *dwork, unsigned long delay) + struct delayed_work *dwork, unsigned long delay, + bool on_any_cpu) { struct timer_list *timer = &dwork->timer; struct work_struct *work = &dwork->work; @@ -1387,7 +1414,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, * on that there's no such delay when @delay is 0. */ if (!delay) { - __queue_work(cpu, wq, &dwork->work); + __queue_work(cpu, wq, &dwork->work, on_any_cpu); return; }
@@ -1395,6 +1422,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
dwork->wq = wq; dwork->cpu = cpu; + dwork->on_any_cpu = on_any_cpu; timer->expires = jiffies + delay;
if (unlikely(cpu != WORK_CPU_UNBOUND)) @@ -1403,19 +1431,9 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, add_timer(timer); }
-/** - * queue_delayed_work_on - queue work on specific CPU after delay - * @cpu: CPU number to execute work on - * @wq: workqueue to use - * @dwork: work to queue - * @delay: number of jiffies to wait before queueing - * - * Returns %false if @work was already on a queue, %true otherwise. If - * @delay is zero and @dwork is idle, it will be scheduled for immediate - * execution. - */ -bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq, - struct delayed_work *dwork, unsigned long delay) +static bool __queue_delayed_work_on(int cpu, struct workqueue_struct *wq, + struct delayed_work *dwork, unsigned long delay, + bool on_any_cpu) { struct work_struct *work = &dwork->work; bool ret = false; @@ -1425,13 +1443,30 @@ bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq, local_irq_save(flags);
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { - __queue_delayed_work(cpu, wq, dwork, delay); + __queue_delayed_work(cpu, wq, dwork, delay, on_any_cpu); ret = true; }
local_irq_restore(flags); return ret; } + +/** + * queue_delayed_work_on - queue work on specific CPU after delay + * @cpu: CPU number to execute work on + * @wq: workqueue to use + * @dwork: work to queue + * @delay: number of jiffies to wait before queueing + * + * Returns %false if @work was already on a queue, %true otherwise. If + * @delay is zero and @dwork is idle, it will be scheduled for immediate + * execution. + */ +bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq, + struct delayed_work *dwork, unsigned long delay) +{ + return __queue_delayed_work_on(cpu, wq, dwork, delay, false); +} EXPORT_SYMBOL_GPL(queue_delayed_work_on);
/** @@ -1445,11 +1480,50 @@ EXPORT_SYMBOL_GPL(queue_delayed_work_on); bool queue_delayed_work(struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay) { - return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay); + return __queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay, + false); } EXPORT_SYMBOL_GPL(queue_delayed_work);
/** + * queue_delayed_work_on_any_cpu - queue work on any non-idle cpu on a workqueue + * after delay + * @wq: workqueue to use + * @dwork: delayable work to queue + * @delay: number of jiffies to wait before queueing + * + * Equivalent to queue_delayed_work() but tries to use any non-idle (from + * schedulers perspective) CPU. + */ +bool queue_delayed_work_on_any_cpu(struct workqueue_struct *wq, + struct delayed_work *dwork, unsigned long delay) +{ + return __queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay, + true); +} +EXPORT_SYMBOL_GPL(queue_delayed_work_on_any_cpu); + +static bool __mod_delayed_work_on(int cpu, struct workqueue_struct *wq, + struct delayed_work *dwork, unsigned long delay, + bool on_any_cpu) +{ + unsigned long flags; + int ret; + + do { + ret = try_to_grab_pending(&dwork->work, true, &flags); + } while (unlikely(ret == -EAGAIN)); + + if (likely(ret >= 0)) { + __queue_delayed_work(cpu, wq, dwork, delay, on_any_cpu); + local_irq_restore(flags); + } + + /* -ENOENT from try_to_grab_pending() becomes %true */ + return ret; +} + +/** * mod_delayed_work_on - modify delay of or queue a delayed work on specific CPU * @cpu: CPU number to execute work on * @wq: workqueue to use @@ -1470,20 +1544,7 @@ EXPORT_SYMBOL_GPL(queue_delayed_work); bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay) { - unsigned long flags; - int ret; - - do { - ret = try_to_grab_pending(&dwork->work, true, &flags); - } while (unlikely(ret == -EAGAIN)); - - if (likely(ret >= 0)) { - __queue_delayed_work(cpu, wq, dwork, delay); - local_irq_restore(flags); - } - - /* -ENOENT from try_to_grab_pending() becomes %true */ - return ret; + return __mod_delayed_work_on(cpu, wq, dwork, delay, false); } EXPORT_SYMBOL_GPL(mod_delayed_work_on);
@@ -1498,7 +1559,8 @@ EXPORT_SYMBOL_GPL(mod_delayed_work_on); bool mod_delayed_work(struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay) { - return mod_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay); + return __mod_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay, + dwork->on_any_cpu); } EXPORT_SYMBOL_GPL(mod_delayed_work);
@@ -2952,7 +3014,8 @@ bool flush_delayed_work(struct delayed_work *dwork) { local_irq_disable(); if (del_timer_sync(&dwork->timer)) - __queue_work(dwork->cpu, dwork->wq, &dwork->work); + __queue_work(dwork->cpu, dwork->wq, &dwork->work, + dwork->on_any_cpu); local_irq_enable(); return flush_work(&dwork->work); }
 
            On 18 March 2013 20:53, Viresh Kumar viresh.kumar@linaro.org wrote:
queue_work() queues work on current cpu. This may wake up an idle CPU, which is actually not required.
Some of these works can be processed by any CPU and so we must select a non-idle CPU here. The initial idea was to modify implementation of queue_work(), but that may end up breaking lots of kernel code that would be a nightmare to debug.
So, we finalized to adding new workqueue interfaces, for works that don't depend on a cpu to execute them.
Fixup:
commit 6e5b6fac3353f5e4e5490931b3eb6d3fd7864ca0 Author: Viresh Kumar viresh.kumar@linaro.org Date: Tue Mar 19 10:34:20 2013 +0530
fixup! workqueue: Add helpers to schedule work on any cpu --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index cf9c570..68daf50 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1242,7 +1242,7 @@ retry: if (!(wq->flags & WQ_UNBOUND)) { if (cpu == WORK_CPU_UNBOUND) { if (on_any_cpu) - cpu = sched_select_cpu(0); + cpu = sched_select_non_idle_cpu(0); else cpu = raw_smp_processor_id(); }
 
            On 19 March 2013 10:45, Viresh Kumar viresh.kumar@linaro.org wrote:
On 18 March 2013 20:53, Viresh Kumar viresh.kumar@linaro.org wrote:
queue_work() queues work on current cpu. This may wake up an idle CPU, which is actually not required.
Some of these works can be processed by any CPU and so we must select a non-idle CPU here. The initial idea was to modify implementation of queue_work(), but that may end up breaking lots of kernel code that would be a nightmare to debug.
So, we finalized to adding new workqueue interfaces, for works that don't depend on a cpu to execute them.
Another fixup:
commit 8753c6d936faa6e3233cbf44a55913d05de05683 Author: Viresh Kumar viresh.kumar@linaro.org Date: Tue Mar 19 18:50:59 2013 +0530
fixup! workqueue: Add helpers to schedule work on any cpu --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 68daf50..4e023ab 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1242,7 +1242,7 @@ retry: if (!(wq->flags & WQ_UNBOUND)) { if (cpu == WORK_CPU_UNBOUND) { if (on_any_cpu) - cpu = sched_select_non_idle_cpu(0); + cpu = sched_select_non_idle_cpu(); else cpu = raw_smp_processor_id(); }
 
            Hello, Viresh.
On Mon, Mar 18, 2013 at 08:53:25PM +0530, Viresh Kumar wrote:
queue_work() queues work on current cpu. This may wake up an idle CPU, which is actually not required.
Some of these works can be processed by any CPU and so we must select a non-idle CPU here. The initial idea was to modify implementation of queue_work(), but that may end up breaking lots of kernel code that would be a nightmare to debug.
So, we finalized to adding new workqueue interfaces, for works that don't depend on a cpu to execute them.
This patch adds following new routines:
- queue_work_on_any_cpu
- queue_delayed_work_on_any_cpu
These routines would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
For some reason, I've been always unsure about this patchset. I've been thinking about it past few days and I think I now know why.
I can see that strategy like this being useful for timer, and impressive power saving numbers BTW - we definitely want that. While workqueue shares a lot of attributes with timers, there's one important difference - scheduler is involved in work item executions.
Work item scheduling for per-cpu work items artificially removes the choice of CPU from the scheduler with the assumption that such restrictions would lead to lower overall overhead. There is another variant of workqueue - WQ_UNBOUND - when such assumption wouldn't hold too well and it would be best to give the scheduler full latitude regarding scheduling work item execution including choice of CPU.
Now, this patchset, kinda sorta adds back CPU selection by scheduler in an ad-hoc way, right? If we're gonna do that, why not just make it unbound workqueues? Then, the scheduler would have full discretion over them and we don't need to implement this separate ad-hoc thing on the side. It's the roundaboutness that bothers me.
I'm not sure about other workqueues but the block one can be very performance sensitive on machines with high IOPS and it would definitely be advisable to keep it CPU-affine unless power consumption is the overriding concern, so how about introducing another workqueue flag, say WQ_UNBOUND_FOR_POWER (it's terrible, please come up with something better), which is WQ_UNBOUND on configurations where this matters and becomes noop if not?
Thanks.
 
            On 21 March 2013 05:42, Tejun Heo tj@kernel.org wrote:
On Mon, Mar 18, 2013 at 08:53:25PM +0530, Viresh Kumar wrote:
For some reason, I've been always unsure about this patchset. I've been thinking about it past few days and I think I now know why.
I knew this since the beginning and thought power numbers i showed might change your view. My hard luck :)
I can see that strategy like this being useful for timer, and impressive power saving numbers BTW - we definitely want that. While workqueue shares a lot of attributes with timers, there's one important difference - scheduler is involved in work item executions.
Yes, scheduler is involved for queuing works queued on a UNBOUND wq.
Work item scheduling for per-cpu work items artificially removes the choice of CPU from the scheduler with the assumption that such restrictions would lead to lower overall overhead.
Yes.. We aren't touching them here and we can't :)
There is another variant of workqueue - WQ_UNBOUND - when such assumption wouldn't hold too well and it would be best to give the scheduler full latitude regarding scheduling work item execution including choice of CPU.
I liked this point of yours and what you said is correct too. But there is a difference here with our approach.
I here see two parallel solutions:
1. What we proposed: - Add another wq api and used sched_select_non_idle_cpu() to choose the right cpu - Fix user drivers we care about
2. What you proposed: - Fix user drivers we care about by using UNBOUNDED workqueues rather than system_wq or other private wq's. And let the scheduler do everything. Right?
Now there are two things i am worried about.
A. About the difference in behavior of scheduler and sched_select_non_idle_cpu(). Scheduler will treat this work as any normal task and can schedule it anywhere based on what it feels is the right place and may still wake up an idle one for load balancing.
In our approach, We aren't switching CPU for a work if that cpu (or system) is busy enough, like in case of high IOPS for block layer. We will find that cpu as busy, most of the time on such situations and so we will stay where we were.
In case system is fairly idle, then also we stay on the same cpu. The only time we migrate is when current cpu is idle but the whole system isn't.
B. It is sort of difficult to teach users about BOUND and UNBOUND workqueues and people have used them without thinking too much about them since some time. We need a straightforward API to make this more transparent. And queue_work_on_any_cpu() was a good candidate here. I am not talking about its implementation but about the interface.
Though in both suggestions we need to fix user drivers one by one to use queue_work_on_any_cpu() or use WQ_UNBOUND..
Now, this patchset, kinda sorta adds back CPU selection by scheduler in an ad-hoc way, right?
Yes.
If we're gonna do that, why not just make it unbound workqueues? Then, the scheduler would have full discretion over them and we don't need to implement this separate ad-hoc thing on the side. It's the roundaboutness that bothers me.
I had my own reasons as explained earlier.
I'm not sure about other workqueues but the block one can be very performance sensitive on machines with high IOPS and it would definitely be advisable to keep it CPU-affine unless power consumption is the overriding concern, so how about introducing another workqueue flag, say WQ_UNBOUND_FOR_POWER (it's terrible, please come up with something better), which is WQ_UNBOUND on configurations where this matters and becomes noop if not?
Maybe people wouldn't suffer much because of the reasons i told you earlier. On a busy system we will never switch cpus.
This static configuration might not work well with multiplatform images.
-- viresh
 
            Hey, Viresh.
On Thu, Mar 21, 2013 at 04:27:23PM +0530, Viresh Kumar wrote:
For some reason, I've been always unsure about this patchset. I've been thinking about it past few days and I think I now know why.
I knew this since the beginning and thought power numbers i showed might change your view. My hard luck :)
Heh, sorry about the trouble but, yeah, your numbers are impressive and no matter how this discussion turns out, we're gonna have it in some form, so please don't think that your numbers didn't change anything. Actually, to me, it seems that identifying which workqueues are good candidates for conversion and determining how much powersave can be gained are 90% of the work.
Now there are two things i am worried about.
A. About the difference in behavior of scheduler and sched_select_non_idle_cpu(). Scheduler will treat this work as any normal task and can schedule it anywhere based on what it feels is the right place and may still wake up an idle one for load balancing.
In our approach, We aren't switching CPU for a work if that cpu (or system) is busy enough, like in case of high IOPS for block layer. We will find that cpu as busy, most of the time on such situations and so we will stay where we were.
In case system is fairly idle, then also we stay on the same cpu. The only time we migrate is when current cpu is idle but the whole system isn't.
Yes, I actually like that part a lot although I do wish the idle check was inlined.
What I'm wondering is whether the kinda out-of-band decision via sched_select_cpu() is justified given that it can and is likely to go through full scheduling decision anyway. For timer, we don't have that, so it makes sense. For work items, it's a bit different.
To rephrase, *if* the scheduler can't already make proper decisions regarding power consumption on an idlish system, maybe it can be improved to do so? It could as well be that this CPU selection is special enough that it's just better to keep it separate as this patchset proposes. This is something up to the scheduler people. Peter, Ingo, what do you guys think?
B. It is sort of difficult to teach users about BOUND and UNBOUND workqueues and people have used them without thinking too much about them since some time. We need a straightforward API to make this more transparent. And queue_work_on_any_cpu() was a good candidate here. I am not talking about its implementation but about the interface.
Though in both suggestions we need to fix user drivers one by one to use queue_work_on_any_cpu() or use WQ_UNBOUND..
I don't know. Workqueues are attribute domains and I kinda dislike having a completely separate set of interface for this. Regardless of how the backend gets implemented, I'd really much prefer it being implemented as a workqueue attribute rather than another set of queueing interface functions.
Maybe people wouldn't suffer much because of the reasons i told you earlier. On a busy system we will never switch cpus.
This static configuration might not work well with multiplatform images.
Well, it doesn't have to be a compile time thing. It can be dynamically switched with, say, static_branch(), but I think that's a minor point at this point.
Thanks.
 
            On Thu, Mar 21, 2013 at 11:29:37AM -0700, Tejun Heo wrote:
Yes, I actually like that part a lot although I do wish the idle check was inlined.
What I'm wondering is whether the kinda out-of-band decision via sched_select_cpu() is justified given that it can and is likely to go through full scheduling decision anyway. For timer, we don't have that, so it makes sense. For work items, it's a bit different.
To rephrase, *if* the scheduler can't already make proper decisions regarding power consumption on an idlish system, maybe it can be improved to do so? It could as well be that this CPU selection is special enough that it's just better to keep it separate as this patchset proposes. This is something up to the scheduler people. Peter, Ingo, what do you guys think?
Ping. Peter, Ingo?
Viresh, would it be difficult to make another measurement of the same workload with the said workqueues converted to unbound? I think that would at least provide a nice reference point.
Thanks.
 
            On 28 March 2013 23:43, Tejun Heo tj@kernel.org wrote:
Ping. Peter, Ingo?
Thanks for your ping, even i was thinking of it since sometime :)
Viresh, would it be difficult to make another measurement of the same workload with the said workqueues converted to unbound? I think that would at least provide a nice reference point.
I will try.
 
            On 28 March 2013 23:43, Tejun Heo tj@kernel.org wrote:
Viresh, would it be difficult to make another measurement of the same workload with the said workqueues converted to unbound? I think that would at least provide a nice reference point.
My heart is shattered after getting the results :)
Cluster A15 Energy Cluster A7 Energy Total ------------------------- ----------------------- ------
Without this patchset (Energy in Joules): ---------------------------------------------------
0.151162 2.183545 2.334707 0.223730 2.687067 2.910797 0.289687 2.732702 3.022389 0.454198 2.745908 3.200106 0.495552 2.746465 3.242017
Average: 0.322866 2.619137 2.942003
With this patchset (Energy in Joules): -----------------------------------------------
0.133361 2.267822 2.401183 0.260626 2.833389 3.094015 0.142365 2.277929 2.420294 0.246793 2.582550 2.829343 0.130462 2.257033 2.387495
Average: 0.182721 2.443745 2.626466
With WQ Unbound (Energy in Joules): -----------------------------------------------
0.226421 2.283658 2.510079 0.151361 2.236656 2.388017 0.197726 2.249849 2.447575 0.221915 2.229446 2.451361 0.347098 2.257707 2.604805
Average: 0.2289042 2.2514632 2.4803674
PS: I have to add another generic workqueue for these tests: system_unbound_freezable_wq as block layer was using system_freezable_wq.
-- viresh
 
            Hello, Viresh.
On Fri, Mar 29, 2013 at 12:57:23PM +0530, Viresh Kumar wrote:
On 28 March 2013 23:43, Tejun Heo tj@kernel.org wrote:
Viresh, would it be difficult to make another measurement of the same workload with the said workqueues converted to unbound? I think that would at least provide a nice reference point.
My heart is shattered after getting the results :)
Cluster A15 Energy Cluster A7 Energy Total
Without this patchset (Energy in Joules):
0.322866 2.619137 2.942003
With this patchset (Energy in Joules):
0.182721 2.443745 2.626466
With WQ Unbound (Energy in Joules):
0.2289042 2.2514632 2.4803674
So the scheduler does know what it's doing after all, which is a nice news. Given the result, the best thing to do would be somehow marking these workqueues as unbound for powersaving?
Thanks!
 
            On Fri, Mar 29, 2013 at 10:40:03AM -0700, Tejun Heo wrote:
So the scheduler does know what it's doing after all, which is a nice news. Given the result, the best thing to do would be somehow marking these workqueues as unbound for powersaving?
BTW, recent changes to workqueue means that it shouldn't be too difficult to create a combined workqueue which distributes work items to both per-cpu and unbound worker pools depending on cpu_idle(), but, at least for now, I think we can make do with much coarser approach. We can refine it afterwards.
Thanks.
 
            On 29 March 2013 23:10, Tejun Heo tj@kernel.org wrote:
So the scheduler does know what it's doing after all, which is a nice news. Given the result, the best thing to do would be somehow marking these workqueues as unbound for powersaving?
Yes. I am going to do it soon.
 
            Phylib uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean: - Current task is idle task - nr_running == 0 - wake_list is empty
This patch replaces the schedule_work() and schedule_delayed_work() routines with their queue_[delayed_]work_on_any_cpu() siblings with system_wq as parameter.
These routines would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Cc: "David S. Miller" davem@davemloft.net Cc: netdev@vger.kernel.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/net/phy/phy.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 298b4c2..a517706 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -439,7 +439,7 @@ void phy_start_machine(struct phy_device *phydev, { phydev->adjust_state = handler;
- schedule_delayed_work(&phydev->state_queue, HZ); + queue_delayed_work_on_any_cpu(system_wq, &phydev->state_queue, HZ); }
/** @@ -527,7 +527,7 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat) disable_irq_nosync(irq); atomic_inc(&phydev->irq_disable);
- schedule_work(&phydev->phy_queue); + queue_work_on_any_cpu(system_wq, &phydev->phy_queue);
return IRQ_HANDLED; } @@ -682,7 +682,7 @@ static void phy_change(struct work_struct *work)
/* reschedule state queue work to run as soon as possible */ cancel_delayed_work_sync(&phydev->state_queue); - schedule_delayed_work(&phydev->state_queue, 0); + queue_delayed_work_on_any_cpu(system_wq, &phydev->state_queue, 0);
return;
@@ -966,7 +966,8 @@ void phy_state_machine(struct work_struct *work) if (err < 0) phy_error(phydev);
- schedule_delayed_work(&phydev->state_queue, PHY_STATE_TIME * HZ); + queue_delayed_work_on_any_cpu(system_wq, &phydev->state_queue, + PHY_STATE_TIME * HZ); }
static inline void mmd_phy_indirect(struct mii_bus *bus, int prtad, int devad,
 
            From: Viresh Kumar viresh.kumar@linaro.org Date: Mon, 18 Mar 2013 20:53:26 +0530
Phylib uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean:
- Current task is idle task
- nr_running == 0
- wake_list is empty
This patch replaces the schedule_work() and schedule_delayed_work() routines with their queue_[delayed_]work_on_any_cpu() siblings with system_wq as parameter.
These routines would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Cc: "David S. Miller" davem@davemloft.net Cc: netdev@vger.kernel.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
This will need to be applied to whatever tree adds these new interfaces, and for that:
Acked-by: David S. Miller davem@davemloft.net
 
            mmc uses workqueues for running mmc_rescan(). There is no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean: - Current task is idle task - nr_running == 0 - wake_list is empty
This patch replaces the queue_delayed_work() with queue_delayed_work_on_any_cpu() siblings.
This routine would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Cc: Chris Ball cjb@laptop.org Cc: linux-mmc@vger.kernel.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/mmc/core/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 9290bb5..adf331a 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -85,7 +85,7 @@ MODULE_PARM_DESC( static int mmc_schedule_delayed_work(struct delayed_work *work, unsigned long delay) { - return queue_delayed_work(workqueue, work, delay); + return queue_delayed_work_on_any_cpu(workqueue, work, delay); }
/*
 
            On 18 March 2013 16:23, Viresh Kumar viresh.kumar@linaro.org wrote:
mmc uses workqueues for running mmc_rescan(). There is no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean:
- Current task is idle task
- nr_running == 0
- wake_list is empty
This patch replaces the queue_delayed_work() with queue_delayed_work_on_any_cpu() siblings.
This routine would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Cc: Chris Ball cjb@laptop.org Cc: linux-mmc@vger.kernel.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
drivers/mmc/core/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 9290bb5..adf331a 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -85,7 +85,7 @@ MODULE_PARM_DESC( static int mmc_schedule_delayed_work(struct delayed_work *work, unsigned long delay) {
return queue_delayed_work(workqueue, work, delay);
return queue_delayed_work_on_any_cpu(workqueue, work, delay);}
/*
1.7.12.rc2.18.g61b472e
-- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Acked-by: Ulf Hansson ulf.hansson@linaro.org
 
            Hi,
On Mon, Mar 18 2013, Viresh Kumar wrote:
mmc uses workqueues for running mmc_rescan(). There is no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean:
- Current task is idle task
- nr_running == 0
- wake_list is empty
This patch replaces the queue_delayed_work() with queue_delayed_work_on_any_cpu() siblings.
This routine would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Cc: Chris Ball cjb@laptop.org Cc: linux-mmc@vger.kernel.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
drivers/mmc/core/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 9290bb5..adf331a 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -85,7 +85,7 @@ MODULE_PARM_DESC( static int mmc_schedule_delayed_work(struct delayed_work *work, unsigned long delay) {
- return queue_delayed_work(workqueue, work, delay);
- return queue_delayed_work_on_any_cpu(workqueue, work, delay);
} /*
Thanks, pushed to mmc-next for 3.10.
- Chris.
 
            Hi,
On Mon, Mar 18 2013, Viresh Kumar wrote:
mmc uses workqueues for running mmc_rescan(). There is no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean:
- Current task is idle task
- nr_running == 0
- wake_list is empty
This patch replaces the queue_delayed_work() with queue_delayed_work_on_any_cpu() siblings.
This routine would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Cc: Chris Ball cjb@laptop.org Cc: linux-mmc@vger.kernel.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
drivers/mmc/core/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 9290bb5..adf331a 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -85,7 +85,7 @@ MODULE_PARM_DESC( static int mmc_schedule_delayed_work(struct delayed_work *work, unsigned long delay) {
- return queue_delayed_work(workqueue, work, delay);
- return queue_delayed_work_on_any_cpu(workqueue, work, delay);
} /*
/home/cjb/git/mmc/drivers/mmc/core/core.c: In function ‘mmc_schedule_delayed_work’: /home/cjb/git/mmc/drivers/mmc/core/core.c:88:2: error: implicit declaration of function ‘queue_delayed_work_on_any_cpu’ [-Werror=implicit-function-declaration]
I've dropped this patch for now. This function doesn't seem to be defined in linux-next either.
- Chris.
 
            On 22 March 2013 22:56, Chris Ball cjb@laptop.org wrote:
On Mon, Mar 18 2013, Viresh Kumar wrote:
/home/cjb/git/mmc/drivers/mmc/core/core.c: In function ‘mmc_schedule_delayed_work’: /home/cjb/git/mmc/drivers/mmc/core/core.c:88:2: error: implicit declaration of function ‘queue_delayed_work_on_any_cpu’ [-Werror=implicit-function-declaration]
I've dropped this patch for now. This function doesn't seem to be defined in linux-next either.
Hi chris,
This patch was part of a bigger patchset which also adds this API. I don't want you to apply this one but just Ack here. Probably Tejun or some scheduler maintainer will apply it later (if they like all patches).
 
            Hi,
On Fri, Mar 22 2013, Viresh Kumar wrote:
On 22 March 2013 22:56, Chris Ball cjb@laptop.org wrote:
On Mon, Mar 18 2013, Viresh Kumar wrote:
/home/cjb/git/mmc/drivers/mmc/core/core.c: In function ‘mmc_schedule_delayed_work’: /home/cjb/git/mmc/drivers/mmc/core/core.c:88:2: error: implicit declaration of function ‘queue_delayed_work_on_any_cpu’ [-Werror=implicit-function-declaration]
I've dropped this patch for now. This function doesn't seem to be defined in linux-next either.
Hi chris,
This patch was part of a bigger patchset which also adds this API. I don't want you to apply this one but just Ack here. Probably Tejun or some scheduler maintainer will apply it later (if they like all patches).
Thanks, makes sense. For [5/7]:
Acked-by: Chris Ball cjb@laptop.org
- Chris.
 
            block layer uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean: - Current task is idle task - nr_running == 0 - wake_list is empty
This patch replaces schedule_work() and queue_[delayed_]work() with queue_[delayed_]work_on_any_cpu() siblings.
These routines would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Cc: Jens Axboe axboe@kernel.dk Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- block/blk-core.c | 6 +++--- block/blk-ioc.c | 2 +- block/genhd.c | 9 ++++++--- 3 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 186603b..14ae74f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -225,7 +225,7 @@ static void blk_delay_work(struct work_struct *work) void blk_delay_queue(struct request_queue *q, unsigned long msecs) { if (likely(!blk_queue_dead(q))) - queue_delayed_work(kblockd_workqueue, &q->delay_work, + queue_delayed_work_on_any_cpu(kblockd_workqueue, &q->delay_work, msecs_to_jiffies(msecs)); } EXPORT_SYMBOL(blk_delay_queue); @@ -2852,14 +2852,14 @@ EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
int kblockd_schedule_work(struct request_queue *q, struct work_struct *work) { - return queue_work(kblockd_workqueue, work); + return queue_work_on_any_cpu(kblockd_workqueue, work); } EXPORT_SYMBOL(kblockd_schedule_work);
int kblockd_schedule_delayed_work(struct request_queue *q, struct delayed_work *dwork, unsigned long delay) { - return queue_delayed_work(kblockd_workqueue, dwork, delay); + return queue_delayed_work_on_any_cpu(kblockd_workqueue, dwork, delay); } EXPORT_SYMBOL(kblockd_schedule_delayed_work);
diff --git a/block/blk-ioc.c b/block/blk-ioc.c index 9c4bb82..2eefeb1 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -144,7 +144,7 @@ void put_io_context(struct io_context *ioc) if (atomic_long_dec_and_test(&ioc->refcount)) { spin_lock_irqsave(&ioc->lock, flags); if (!hlist_empty(&ioc->icq_list)) - schedule_work(&ioc->release_work); + queue_work_on_any_cpu(system_wq, &ioc->release_work); else free_ioc = true; spin_unlock_irqrestore(&ioc->lock, flags); diff --git a/block/genhd.c b/block/genhd.c index a1ed52a..4bdb735 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1488,9 +1488,11 @@ static void __disk_unblock_events(struct gendisk *disk, bool check_now) intv = disk_events_poll_jiffies(disk); set_timer_slack(&ev->dwork.timer, intv / 4); if (check_now) - queue_delayed_work(system_freezable_wq, &ev->dwork, 0); + queue_delayed_work_on_any_cpu(system_freezable_wq, &ev->dwork, + 0); else if (intv) - queue_delayed_work(system_freezable_wq, &ev->dwork, intv); + queue_delayed_work_on_any_cpu(system_freezable_wq, &ev->dwork, + intv); out_unlock: spin_unlock_irqrestore(&ev->lock, flags); } @@ -1626,7 +1628,8 @@ static void disk_check_events(struct disk_events *ev,
intv = disk_events_poll_jiffies(disk); if (!ev->block && intv) - queue_delayed_work(system_freezable_wq, &ev->dwork, intv); + queue_delayed_work_on_any_cpu(system_freezable_wq, &ev->dwork, + intv);
spin_unlock_irq(&ev->lock);
 
            On Mon, Mar 18 2013, Viresh Kumar wrote:
block layer uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean:
- Current task is idle task
- nr_running == 0
- wake_list is empty
This patch replaces schedule_work() and queue_[delayed_]work() with queue_[delayed_]work_on_any_cpu() siblings.
These routines would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
What are the performance implications of this? From that perspective, scheduling on a local core is the better choice. Hopefully it's already running on the local socket of the device, keeping it on that node would be preferable.
For the delayed functions, the timer is typically added on the current CPU. It's hard to know what the state of idleness of CPUs would be when the delay has expired. How are you handling this?
I can see the win from a power consumption point of view, but it quickly gets a little more complicated than that.
 
            On 22 March 2013 20:35, Jens Axboe axboe@kernel.dk wrote:
Hi Jens,
First of all, please have a look at:
1. https://lkml.org/lkml/2013/3/18/364
This is cover-letter of my mail where i have shown power savings achieved with this patchset.
2. https://lkml.org/lkml/2013/3/21/172
This is the link for discussion i had with Tejun which answers some of these questions.
What are the performance implications of this? From that perspective, scheduling on a local core is the better choice. Hopefully it's already running on the local socket of the device, keeping it on that node would be preferable.
If the local cpu is busy or all cpus are idle then we don't migrate, and so for performance shouldn't be affected with it.
For the delayed functions, the timer is typically added on the current CPU.
This is the code that executes here:
if (unlikely(cpu != WORK_CPU_UNBOUND)) add_timer_on(timer, cpu); else add_timer(timer);
Clearly for queue_delayed_work() call we don't use local cpu but we use the same sched_select_non_idle_cpu() routine to give us the right cpu.
It's hard to know what the state of idleness of CPUs would be when the delay has expired. How are you handling this?
Correct, and we are not deciding that right now, but when the timer expires.
Thanks for your feedback.
 
            fbcon uses workqueues and it has no real dependency of scheduling these on the cpu which scheduled them.
On a idle system, it is observed that and idle cpu wakes up few times just to service this work. It would be better if we can schedule it on a cpu which isn't idle to save on power.
By idle cpu (from scheduler's perspective) we mean: - Current task is idle task - nr_running == 0 - wake_list is empty
This patch replaces schedule_work() routine with queue_work_on_any_cpu() sibling with system_wq as workqueue.
This routine would look for the closest (via scheduling domains) non-idle cpu (non-idle from schedulers perspective). If the current cpu is not idle or all cpus are idle, work will be scheduled on local cpu.
Cc: Dave Airlie airlied@redhat.com Cc: linux-fbdev@vger.kernel.org Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/video/console/fbcon.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/video/console/fbcon.c b/drivers/video/console/fbcon.c index 3cd6759..a900997 100644 --- a/drivers/video/console/fbcon.c +++ b/drivers/video/console/fbcon.c @@ -404,7 +404,7 @@ static void cursor_timer_handler(unsigned long dev_addr) struct fb_info *info = (struct fb_info *) dev_addr; struct fbcon_ops *ops = info->fbcon_par;
- schedule_work(&info->queue); + queue_work_on_any_cpu(system_wq, &info->queue); mod_timer(&ops->cursor_timer, jiffies + HZ/5); }
 
            On 18 March 2013 20:53, Viresh Kumar viresh.kumar@linaro.org wrote:
In order to save power, it would be useful to schedule light weight work on cpus that aren't IDLE instead of waking up an IDLE one.
By idle cpu (from scheduler's perspective) we mean:
- Current task is idle task
- nr_running == 0
- wake_list is empty
Oops!! I forgot the change log:
V2->V3: - Dropped changes into core queue_work() API, rather create *_on_any_cpu() APIs - Dropped running timers migration patch as that was broken - Migrated few users of workqueues to use *_on_any_cpu() APIs.
linaro-kernel@lists.linaro.org







