Hi,
This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the tasks in as few as possible CPU/Cluster/Core. It has got 2 packing modes: -The 1st mode packs the small tasks when the system is not too busy. The main goal is to reduce the power consumption in the low system load use cases by minimizing the number of power domain that are enabled but it also keeps the default behavior which is performance oriented. -The 2nd mode packs all tasks in as few as possible power domains in order to improve the power consumption of the system but at the cost of possible performance decrease because of the increase of the rate of ressources sharing compared to the default mode.
The packing is done in 3 steps (the last step is only applicable for the agressive packing mode):
The 1st step looks for the best place to pack tasks in a system according to its topology and it defines a 1st pack buddy CPU for each CPU if there is one available. The policy for defining a buddy CPU is that we want to pack at levels where a group of CPU can be power gated independently from others. To describe this capability, a new flag SD_SHARE_POWERDOMAIN has been introduced, that is used to indicate whether the groups of CPUs of a scheduling domain share their power state. By default, this flag is set in all sched_domain in order to keep unchanged the current behavior of the scheduler and only ARM platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level.
In a 2nd step, the scheduler checks the load average of a task which wakes up as well as the load average of the buddy CPU and it can decide to migrate the light tasks on a not busy buddy. This check is done during the wake up because small tasks tend to wake up between periodic load balance and asynchronously to each other which prevents the default mechanism to catch and migrate them efficiently. A light task is defined by a runnable_avg_sum that is less than 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones: The average CPU load of the task must be less than 20% and the task must have been runnable less than 10ms when it woke up last time in order to be electable for the packing migration. So, a task than runs 1 ms each 5ms will be considered as a small task but a task that runs 50 ms with a period of 500ms, will not. Then, the business of the buddy CPU depends of the load average for the rq and the number of running tasks. A CPU with a load average greater than 50% will be considered as busy CPU whatever the number of running tasks is and this threshold will be reduced by the number of running tasks in order to not increase too much the wake up latency of a task. When the buddy CPU is busy, the scheduler falls back to default CFS policy.
The 3rd step is only used when the agressive packing mode is enable. In this case, the CPUs pack their tasks in their buddy until they becomes full. Unlike the previous step, we can't keep the same buddy so we update it during load balance. During the periodic load balance, the scheduler computes the activity of the system thanks the runnable_avg_sum and the cpu_power of all CPUs and then it defines the CPUs that will be used to handle the current activity. The selected CPUs will be their own buddy and will participate to the default load balancing mecanism in order to share the tasks in a fair way, whereas the not selected CPUs will not, and their buddy will be the last selected CPU. The behavior can be summarized as: The scheduler defines how many CPUs are required to handle the current activity, keeps the tasks on these CPUS and perform normal load balancing (or any evolution of the current load balancer like the use of runnable load avg from Alex https://lkml.org/lkml/2013/4/1/580) on this limited number of CPUs . Like the other steps, the CPUs are selected to minimize the number of power domain that must stay on.
Change since V3:
- Take into account comments on previous version. - Add an agressive packing mode and a knob to select between the various mode
Change since V2:
- Migrate only a task that wakes up - Change the light tasks threshold to 20% - Change the loaded CPU threshold to not pull tasks if the current number of running tasks is null but the load average is already greater than 50% - Fix the algorithm for selecting the buddy CPU.
Change since V1:
Patch 2/6 - Change the flag name which was not clear. The new name is SD_SHARE_POWERDOMAIN. - Create an architecture dependent function to tune the sched_domain flags Patch 3/6 - Fix issues in the algorithm that looks for the best buddy CPU - Use pr_debug instead of pr_info - Fix for uniprocessor Patch 4/6 - Remove the use of usage_avg_sum which has not been merged Patch 5/6 - Change the way the coherency of runnable_avg_sum and runnable_avg_period is ensured Patch 6/6 - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM platform
Previous results for v3:
This series has been tested with hackbench on ARM platform and the results don't show any performance regression
Hackbench 3.9-rc2 +patches Mean Time (10 tests): 2.048 2.015 stdev : 0.047 0.068
Previous results for V2:
This series has been tested with MP3 play back on ARM platform: TC2 HMP (dual CA-15 and 3xCA-7 cluster).
The measurements have been done on an Ubuntu image during 60 seconds of playback and the result has been normalized to 100.
| CA15 | CA7 | total | ------------------------------------- default | 81 | 97 | 178 | pack | 13 | 100 | 113 | -------------------------------------
Previous results for V1:
The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that it's worth packing small tasks at all topology levels.
The performance tests have been done on both platforms with sysbench. The results don't show any performance regressions. These results are aligned with the policy which uses the normal behavior with heavy use cases.
test: sysbench --test=cpu --num-threads=N --max-requests=R run
Results below is the average duration of 3 tests on the quad CA-9. default is the current scheduler behavior (pack buddy CPU is -1) pack is the scheduler with the pack mechanism
| default | pack | ----------------------------------- N=8; R=200 | 3.1999 | 3.1921 | N=8; R=2000 | 31.4939 | 31.4844 | N=12; R=200 | 3.2043 | 3.2084 | N=12; R=2000 | 31.4897 | 31.4831 | N=16; R=200 | 3.1774 | 3.1824 | N=16; R=2000 | 31.4899 | 31.4897 | -----------------------------------
The power consumption tests have been done only on TC2 platform which has got accessible power lines and I have used cyclictest to simulate small tasks. The tests show some power consumption improvements.
test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
The measurements have been done during 16 seconds and the result has been normalized to 100
| CA15 | CA7 | total | ------------------------------------- default | 100 | 40 | 140 | pack | <1 | 45 | <46 | -------------------------------------
The A15 cluster is less power efficient than the A7 cluster but if we assume that the tasks is well spread on both clusters, we can guest estimate that the power consumption on a dual cluster of CA7 would have been for a default kernel:
| CA7 | CA7 | total | ------------------------------------- default | 40 | 40 | 80 | -------------------------------------
Vincent Guittot (14): Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain sched: pack small tasks sched: pack the idle load balance ARM: sched: clear SD_SHARE_POWERDOMAIN sched: add a knob to choose the packing level sched: agressively pack at wake/fork/exec sched: trig ILB on an idle buddy sched: evaluate the activity level of the system sched: update the buddy CPU sched: filter task pull request sched: create a new field with available capacity sched: update the cpu_power sched: force migration on buddy CPU
arch/arm/kernel/topology.c | 9 + arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 11 +- include/linux/sched/sysctl.h | 8 + include/linux/topology.h | 4 + kernel/sched/core.c | 14 +- kernel/sched/fair.c | 393 +++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 15 +- kernel/sysctl.c | 13 ++ 10 files changed, 423 insertions(+), 46 deletions(-)
This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 8 +------- kernel/sched/core.c | 7 +------ kernel/sched/fair.c | 13 ++----------- kernel/sched/sched.h | 9 +-------- 4 files changed, 5 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index d35d2b6..16b0d18 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1159,13 +1159,7 @@ struct sched_entity { /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; #endif - -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP /* Per-entity load-tracking */ struct sched_avg avg; #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 67d0465..c8db984 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1563,12 +1563,7 @@ static void __sched_fork(struct task_struct *p) p->se.vruntime = 0; INIT_LIST_HEAD(&p->se.group_node);
-/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP p->se.avg.runnable_avg_period = 0; p->se.avg.runnable_avg_sum = 0; #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a33e59..9c2f726 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) } #endif /* CONFIG_FAIR_GROUP_SCHED */
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP /* * We choose a half-life close to 1 scheduling period. * Note: The tables below are dependent on this value. @@ -3394,12 +3393,6 @@ unlock: }
/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED -/* * Called immediately before a task is migrated to a new cpu; task_cpu(p) and * cfs_rq_of(p) references at time of call are still valid and identify the * previous cpu. However, the caller only guarantees p->pi_lock is held; no @@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load); } } -#endif #endif /* CONFIG_SMP */
static unsigned long @@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SMP .select_task_rq = select_task_rq_fair, -#ifdef CONFIG_FAIR_GROUP_SCHED .migrate_task_rq = migrate_task_rq_fair, -#endif + .rq_online = rq_online_fair, .rq_offline = rq_offline_fair,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cc03cfd..7f36024f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -227,12 +227,6 @@ struct cfs_rq { #endif
#ifdef CONFIG_SMP -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED /* * CFS Load tracking * Under CFS, load is tracked on a per-entity basis and aggregated up. @@ -242,8 +236,7 @@ struct cfs_rq { u64 runnable_load_avg, blocked_load_avg; atomic64_t decay_counter, removed_load; u64 last_decay; -#endif /* CONFIG_FAIR_GROUP_SCHED */ -/* These always depend on CONFIG_FAIR_GROUP_SCHED */ + #ifdef CONFIG_FAIR_GROUP_SCHED u32 tg_runnable_contrib; u64 tg_load_contrib;
This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPUs in a sched_domain level can or not reach a different power state. If groups of cores can be power gated independently, as an example, the flag should be cleared at CPU level. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gated the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks so the flag is set into all sched_domains.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 4 ++++ kernel/sched/core.c | 6 ++++++ 5 files changed, 13 insertions(+)
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..6d0b617 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \ + | arch_sd_local_flags(0)\ .last_balance = jiffies, \ .balance_interval = 1, \ .nr_balance_failed = 0, \ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d5e86c9..adc87102 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0) \ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 16b0d18..a82cdeb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -809,6 +809,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..3eab293 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -99,6 +99,8 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_CPUPOWER|\ + SD_SHARE_PKG_RESOURCES) \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing() \ @@ -131,6 +133,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \ @@ -161,6 +164,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0) \ | 0*SD_SERIALIZE \ | 1*SD_PREFER_SIBLING \ , \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c8db984..3b9861f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5940,6 +5940,11 @@ int __weak arch_sd_sibling_asym_packing(void) return 0*SD_ASYM_PACKING; }
+int __weak arch_sd_local_flags(int level) +{ + return 1*SD_SHARE_POWERDOMAIN; +} + /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -6129,6 +6134,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) | 0*SD_WAKE_AFFINE | 0*SD_SHARE_CPUPOWER | 0*SD_SHARE_PKG_RESOURCES + | 1*SD_SHARE_POWERDOMAIN | 1*SD_SERIALIZE | 0*SD_PREFER_SIBLING | sd_local_flags(level)
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPUs can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior.
On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | ----------------------------------- buddy | CPU0 | CPU0 | CPU0 | CPU2 |
If the cores in a cluster can't be power gated independently, the buddy configuration becomes:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | ----------------------------------- buddy | CPU0 | CPU1 | CPU0 | CPU0 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/core.c | 1 + kernel/sched/fair.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 5 ++ 3 files changed, 138 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3b9861f..c5ef170 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5664,6 +5664,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
+ update_packing_domain(cpu); update_top_cache_domain(cpu); }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9c2f726..6adc57c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -160,6 +160,76 @@ void sched_init_granularity(void) update_sysctl(); }
+ +#ifdef CONFIG_SMP +/* + * Save the id of the optimal CPU that should be used to pack small tasks + * The value -1 is used when no buddy has been found + */ +DEFINE_PER_CPU(int, sd_pack_buddy); + +/* + * Look for the best buddy CPU that can be used to pack small tasks + * We make the assumption that it doesn't wort to pack on CPU that share the + * same powerline. We look for the 1st sched_domain without the + * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest + * power per core based on the assumption that their power efficiency is + * better + */ +void update_packing_domain(int cpu) +{ + struct sched_domain *sd; + int id = -1; + + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN); + if (!sd) + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + else + sd = sd->parent; + + while (sd && (sd->flags & SD_LOAD_BALANCE) + && !(sd->flags & SD_SHARE_POWERDOMAIN)) { + struct sched_group *sg = sd->groups; + struct sched_group *pack = sg; + struct sched_group *tmp; + + /* + * The sched_domain of a CPU points on the local sched_group + * and this CPU of this local group is a good candidate + */ + id = cpu; + + /* loop the sched groups to find the best one */ + for (tmp = sg->next; tmp != sg; tmp = tmp->next) { + if (tmp->sgp->power * pack->group_weight > + pack->sgp->power * tmp->group_weight) + continue; + + if ((tmp->sgp->power * pack->group_weight == + pack->sgp->power * tmp->group_weight) + && (cpumask_first(sched_group_cpus(tmp)) >= id)) + continue; + + /* we have found a better group */ + pack = tmp; + + /* Take the 1st CPU of the new group */ + id = cpumask_first(sched_group_cpus(pack)); + } + + /* Look for another CPU than itself */ + if (id != cpu) + break; + + sd = sd->parent; + } + + pr_debug("CPU%d packing on CPU%d\n", cpu, id); + per_cpu(sd_pack_buddy, cpu) = id; +} + +#endif /* CONFIG_SMP */ + #if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -3291,6 +3361,64 @@ done: return target; }
+static bool is_buddy_busy(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + u32 sum = rq->avg.runnable_avg_sum; + u32 period = rq->avg.runnable_avg_period; + + /* + * If a CPU accesses the runnable_avg_sum and runnable_avg_period + * fields of its buddy CPU while the latter updates it, it can get the + * new version of a field and the old version of the other one. This + * can generate erroneous decisions. We don't want to use a lock + * mechanism for ensuring the coherency because of the overhead in + * this critical path. + * The runnable_avg_period of a runqueue tends to the max value in + * less than 345ms after plugging a CPU, which implies that we could + * use the max value instead of reading runnable_avg_period after + * 345ms. During the starting phase, we must ensure a minimum of + * coherency between the fields. A simple rule is runnable_avg_sum <= + * runnable_avg_period. + */ + sum = min(sum, period); + + /* + * A busy buddy is a CPU with a high load or a small load with a lot of + * running tasks. + */ + return (sum > (period / (rq->nr_running + 2))); +} + +static bool is_light_task(struct task_struct *p) +{ + /* A light task runs less than 20% in average */ + return ((p->se.avg.runnable_avg_sum * 5) < + (p->se.avg.runnable_avg_period)); +} + +static int check_pack_buddy(int cpu, struct task_struct *p) +{ + int buddy = per_cpu(sd_pack_buddy, cpu); + + /* No pack buddy for this CPU */ + if (buddy == -1) + return false; + + /* buddy is not an allowed CPU */ + if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p))) + return false; + + /* + * If the task is a small one and the buddy is not overloaded, + * we use buddy cpu + */ + if (!is_light_task(p) || is_buddy_busy(buddy)) + return false; + + return true; +} + /* * sched_balance_self: balance the current task (running on cpu) in domains * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and @@ -3319,6 +3447,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1; new_cpu = prev_cpu; + + /* We pack only at wake up and not new task */ + if (check_pack_buddy(new_cpu, p)) + return per_cpu(sd_pack_buddy, new_cpu); }
rcu_read_lock(); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7f36024f..96b164d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -872,6 +872,7 @@ extern const struct sched_class idle_sched_class;
extern void trigger_load_balance(struct rq *rq, int cpu); extern void idle_balance(int this_cpu, struct rq *this_rq); +extern void update_packing_domain(int cpu);
#else /* CONFIG_SMP */
@@ -879,6 +880,10 @@ static inline void idle_balance(int cpu, struct rq *rq) { }
+static inline void update_packing_domain(int cpu) +{ +} + #endif
extern void sysrq_sched_debug_show(void);
On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPUs can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior.
On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
If the cores in a cluster can't be power gated independently, the buddy configuration becomes:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU1 | CPU0 | CPU0 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
So I really don't get the point of this buddy stuff, even for light load non performance impact stuff you want to do.
The moment you judge cpu0 busy you'll bail, even though its perfectly doable (and desirable afaict) to continue stacking light tasks on cpu1 instead of waking up cpu2/3.
So what's wrong with keeping a single light-wake target cpu selection and updating it appropriately?
Also where/how does the nohz balance cpu criteria not match the light-wake target criteria?
On 26 April 2013 14:30, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPUs can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior.
On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
If the cores in a cluster can't be power gated independently, the buddy configuration becomes:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU1 | CPU0 | CPU0 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
So I really don't get the point of this buddy stuff, even for light load non performance impact stuff you want to do.
The moment you judge cpu0 busy you'll bail, even though its perfectly doable (and desirable afaict) to continue stacking light tasks on cpu1 instead of waking up cpu2/3.
So what's wrong with keeping a single light-wake target cpu selection and updating it appropriately?
I have tried to follow the same kind of tasks migration as during load balance: 1 CPU in a power domain group migrates tasks with other groups.
Also where/how does the nohz balance cpu criteria not match the light-wake target criteria?
The nohz balance cpu is an idle cpu but it doesn't mean that it's the target cpu which will be sometime busy with the light tasks.
On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
- struct rq *rq = cpu_rq(cpu);
- u32 sum = rq->avg.runnable_avg_sum;
- u32 period = rq->avg.runnable_avg_period;
- /*
* If a CPU accesses the runnable_avg_sum and runnable_avg_period
* fields of its buddy CPU while the latter updates it, it can get the
* new version of a field and the old version of the other one. This
* can generate erroneous decisions. We don't want to use a lock
* mechanism for ensuring the coherency because of the overhead in
* this critical path.
* The runnable_avg_period of a runqueue tends to the max value in
* less than 345ms after plugging a CPU, which implies that we could
* use the max value instead of reading runnable_avg_period after
* 345ms. During the starting phase, we must ensure a minimum of
* coherency between the fields. A simple rule is runnable_avg_sum <=
* runnable_avg_period.
*/
- sum = min(sum, period);
- /*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
- return (sum > (period / (rq->nr_running + 2)));
+}
I'm still not sold on the entire nr_running thing and the comment doesn't say why you're doing it.
I'm fairly sure there's software out there that wakes a gazillion threads at a time only for a gazillion-1 to go back to sleep immediately. Patterns like that completely defeat your heuristic.
Does that matter... I don't know.
What happens if you keep this thing simple and only put a cap on utilization (say 80%) and drop the entire nr_running magic? Have you seen it make an actual difference or did it just seem like a good (TM) thing to do?
+static bool is_light_task(struct task_struct *p) +{
- /* A light task runs less than 20% in average */
- return ((p->se.avg.runnable_avg_sum * 5) <
(p->se.avg.runnable_avg_period));
+}
There superfluous () and ' ' in there. Also why 20%.. seemed like a good number? Do we want a SCHED_DEBUG sysctl for that?
On 26 April 2013 14:38, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;
/*
* If a CPU accesses the runnable_avg_sum and runnable_avg_period
* fields of its buddy CPU while the latter updates it, it can get the
* new version of a field and the old version of the other one. This
* can generate erroneous decisions. We don't want to use a lock
* mechanism for ensuring the coherency because of the overhead in
* this critical path.
* The runnable_avg_period of a runqueue tends to the max value in
* less than 345ms after plugging a CPU, which implies that we could
* use the max value instead of reading runnable_avg_period after
* 345ms. During the starting phase, we must ensure a minimum of
* coherency between the fields. A simple rule is runnable_avg_sum <=
* runnable_avg_period.
*/
sum = min(sum, period);
/*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
return (sum > (period / (rq->nr_running + 2)));
+}
I'm still not sold on the entire nr_running thing and the comment doesn't say why you're doing it.
The main goal is really to minimize the scheduling latency of tasks. If the cpu already has dozens of task in its runqueue the scheduling latency can become quite huge. In this case it's worth using another core
I'm fairly sure there's software out there that wakes a gazillion threads at a time only for a gazillion-1 to go back to sleep immediately. Patterns like that completely defeat your heuristic.
Does that matter... I don't know.
What happens if you keep this thing simple and only put a cap on utilization (say 80%) and drop the entire nr_running magic? Have you seen it make an actual difference or did it just seem like a good (TM) thing to do?
It was a efficient way
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 20% in average */
return ((p->se.avg.runnable_avg_sum * 5) <
(p->se.avg.runnable_avg_period));
+}
There superfluous () and ' ' in there. Also why 20%.. seemed like a good number? Do we want a SCHED_DEBUG sysctl for that?
I have chosen 20% because it will ensure that the packing mechanism will only apply on tasks that match the 2 following conditions: - less than 10 ms during the last single run - and for those tasks less than 20% of the time in average
These tasks are nearly never catch by the periodic load balance and they are so small that the overall scheduling latency is nearly not impacted while the cpu is not too busy
Look for an idle CPU close to the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/fair.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6adc57c..a985c98 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5469,7 +5469,26 @@ static struct {
static inline int find_new_ilb(int call_cpu) { + struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask); + int buddy = per_cpu(sd_pack_buddy, call_cpu); + + /* + * If we have a pack buddy CPU, we try to run load balance on a CPU + * that is close to the buddy. + */ + if (buddy != -1) { + for_each_domain(buddy, sd) { + if (sd->flags & SD_SHARE_CPUPOWER) + continue; + + ilb = cpumask_first_and(sched_domain_span(sd), + nohz.idle_cpus_mask); + + if (ilb < nr_cpu_ids) + break; + } + }
if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb;
On Thu, Apr 25, 2013 at 07:23:20PM +0200, Vincent Guittot wrote:
Look for an idle CPU close to the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com
kernel/sched/fair.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6adc57c..a985c98 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5469,7 +5469,26 @@ static struct { static inline int find_new_ilb(int call_cpu) {
- struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask);
- int buddy = per_cpu(sd_pack_buddy, call_cpu);
- /*
* If we have a pack buddy CPU, we try to run load balance on a CPU
* that is close to the buddy.
*/
- if (buddy != -1) {
for_each_domain(buddy, sd) {
if (sd->flags & SD_SHARE_CPUPOWER)
continue;
ilb = cpumask_first_and(sched_domain_span(sd),
nohz.idle_cpus_mask);
if (ilb < nr_cpu_ids)
break;
}
- }
if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb;
Ha! and here you hope people won't put multiple big-little clusters in a single machine? :-)
On 26 April 2013 14:49, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 25, 2013 at 07:23:20PM +0200, Vincent Guittot wrote:
Look for an idle CPU close to the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com
kernel/sched/fair.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6adc57c..a985c98 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5469,7 +5469,26 @@ static struct {
static inline int find_new_ilb(int call_cpu) {
struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask);
int buddy = per_cpu(sd_pack_buddy, call_cpu);
/*
* If we have a pack buddy CPU, we try to run load balance on a CPU
* that is close to the buddy.
*/
if (buddy != -1) {
for_each_domain(buddy, sd) {
if (sd->flags & SD_SHARE_CPUPOWER)
continue;
ilb = cpumask_first_and(sched_domain_span(sd),
nohz.idle_cpus_mask);
if (ilb < nr_cpu_ids)
break;
}
} if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb;
Ha! and here you hope people won't put multiple big-little clusters in a single machine? :-)
yes, we will probably face this situation sooner or later but the other little clusters will probably be not less close than the local big cluster from a power domain point of view. That's why i look for the small sched_domain level to the largest one
The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be power gated independantly. So we clear SD_SHARE_POWERDOMAIN at MC and CPU level.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/kernel/topology.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 79282eb..f89a4a2 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -201,6 +201,15 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS];
+int arch_sd_local_flags(int level) +{ + /* Powergate at threading level doesn't make sense */ + if (level & SD_SHARE_CPUPOWER) + return 1*SD_SHARE_POWERDOMAIN; + + return 0*SD_SHARE_POWERDOMAIN; +} + const struct cpumask *cpu_coregroup_mask(int cpu) { return &cpu_topology[cpu].core_sibling;
There are 3 packing levels: - the default one only packs the small tasks when the system is not busy - the none level doesn't pack anything - the full level uses as few as possible number of CPUs based on the current activity of the system
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched/sysctl.h | 8 ++++++++ kernel/sched/fair.c | 12 ++++++++++++ kernel/sysctl.c | 13 +++++++++++++ 3 files changed, 33 insertions(+)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index bf8086b..b72a8b8 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -44,6 +44,14 @@ enum sched_tunable_scaling { }; extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
+enum sched_tunable_packing { + SCHED_PACKING_NONE, + SCHED_PACKING_DEFAULT, + SCHED_PACKING_FULL, +}; + +extern enum sched_tunable_packing __read_mostly sysctl_sched_packing_mode; + extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a985c98..98166aa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -113,6 +113,18 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL; unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; #endif
+#ifdef CONFIG_SMP +/* + * The packing policy of the scheduler + * + * Options are: + * SCHED_PACKING_NONE - No buddy is used for packing some tasks + * SCHED_PACKING_DEFAULT - The small tasks are packed on a not busy CPUs + * SCHED_PACKING_FULL - All Tasks are packed in a minimum number of CPUs + */ +enum sched_tunable_packing sysctl_sched_packing_mode = SCHED_PACKING_DEFAULT; + +#endif /* * Increase the granularity value when there are more CPUs, * because with more CPUs the 'effective latency' as visible diff --git a/kernel/sysctl.c b/kernel/sysctl.c index afc1dc6..ca22f59 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -265,6 +265,8 @@ static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */ #ifdef CONFIG_SMP static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE; static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1; +static int min_sched_packing_mode = SCHED_PACKING_NONE; +static int max_sched_packing_mode = SCHED_PACKING_FULL-1; #endif /* CONFIG_SMP */ #endif /* CONFIG_SCHED_DEBUG */
@@ -281,6 +283,17 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_SMP + { + .procname = "sched_packing_mode", + .data = &sysctl_sched_packing_mode, + .maxlen = sizeof(enum sched_tunable_packing), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = &min_sched_packing_mode, + .extra2 = &max_sched_packing_mode, + }, +#endif #ifdef CONFIG_SCHED_DEBUG { .procname = "sched_min_granularity_ns",
According to the packing policy, the scheduler can pack tasks at different step: -SCHED_PACKING_NONE level: we don't pack any task. -SCHED_PACKING_DEFAULT: we only pack small tasks at wake up when system is not busy. -SCHED_PACKING_FULL: we pack tasks at wake up until a CPU becomes full. During a fork or a exec, we assume that the new task is a full running one and we look for an idle CPU close to the buddy CPU.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 42 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 98166aa..874f330 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3259,13 +3259,16 @@ static struct sched_group * find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int load_idx) { - struct sched_group *idlest = NULL, *group = sd->groups; + struct sched_group *idlest = NULL, *group = sd->groups, *buddy = NULL; unsigned long min_load = ULONG_MAX, this_load = 0; int imbalance = 100 + (sd->imbalance_pct-100)/2; + int buddy_cpu = per_cpu(sd_pack_buddy, this_cpu); + int get_buddy = ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) && + !(sd->flags & SD_SHARE_POWERDOMAIN) && (buddy_cpu != -1));
do { unsigned long load, avg_load; - int local_group; + int local_group, buddy_group = 0; int i;
/* Skip over this group if it has no CPUs allowed */ @@ -3276,6 +3279,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, local_group = cpumask_test_cpu(this_cpu, sched_group_cpus(group));
+ if (get_buddy) { + buddy_group = cpumask_test_cpu(buddy_cpu, + sched_group_cpus(group)); + } + /* Tally up the load of all CPUs in the group */ avg_load = 0;
@@ -3287,6 +3295,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, load = target_load(i, load_idx);
avg_load += load; + + if ((buddy_group) && idle_cpu(i)) + buddy = group; }
/* Adjust by relative CPU power of the group */ @@ -3300,6 +3311,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, } } while (group = group->next, group != sd->groups);
+ if (buddy) + return buddy; + if (!idlest || 100*this_load < imbalance*min_load) return NULL; return idlest; @@ -3402,6 +3416,21 @@ static bool is_buddy_busy(int cpu) return (sum > (period / (rq->nr_running + 2))); }
+static bool is_buddy_full(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + u32 sum = rq->avg.runnable_avg_sum; + u32 period = rq->avg.runnable_avg_period; + + sum = min(sum, period); + + /* + * A full buddy is a CPU with a sum greater or equal to period + * We keep a margin of 2.4% + */ + return (sum * 1024 >= period * 1000); +} + static bool is_light_task(struct task_struct *p) { /* A light task runs less than 20% in average */ @@ -3413,6 +3442,9 @@ static int check_pack_buddy(int cpu, struct task_struct *p) { int buddy = per_cpu(sd_pack_buddy, cpu);
+ if (sysctl_sched_packing_mode == SCHED_PACKING_NONE) + return false; + /* No pack buddy for this CPU */ if (buddy == -1) return false; @@ -3421,14 +3453,19 @@ static int check_pack_buddy(int cpu, struct task_struct *p) if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p))) return false;
+ /* We agressively pack at wake up */ + if ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) + && !is_buddy_full(buddy)) + return true; /* * If the task is a small one and the buddy is not overloaded, * we use buddy cpu */ - if (!is_light_task(p) || is_buddy_busy(buddy)) - return false; + if (is_light_task(p) && !is_buddy_busy(buddy)) + return true; + + return false;
- return true; }
/*
On Thu, Apr 25, 2013 at 07:23:23PM +0200, Vincent Guittot wrote:
According to the packing policy, the scheduler can pack tasks at different step: -SCHED_PACKING_NONE level: we don't pack any task. -SCHED_PACKING_DEFAULT: we only pack small tasks at wake up when system is not busy. -SCHED_PACKING_FULL: we pack tasks at wake up until a CPU becomes full. During a fork or a exec, we assume that the new task is a full running one and we look for an idle CPU close to the buddy CPU.
This changelog is very short on explaining how it will go about achieving these goals.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 42 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 98166aa..874f330 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3259,13 +3259,16 @@ static struct sched_group * find_idlest_group(struct sched_domain *sd, struct task_struct *p,
So for packing into power domains, wouldn't you typically pick the busiest non- full domain to fill from other non-full?
Picking the idlest non-full seems like it would generate a ping-pong or not actually pack anything.
On 26 April 2013 15:08, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 25, 2013 at 07:23:23PM +0200, Vincent Guittot wrote:
According to the packing policy, the scheduler can pack tasks at different step: -SCHED_PACKING_NONE level: we don't pack any task. -SCHED_PACKING_DEFAULT: we only pack small tasks at wake up when system is not busy. -SCHED_PACKING_FULL: we pack tasks at wake up until a CPU becomes full. During a fork or a exec, we assume that the new task is a full running one and we look for an idle CPU close to the buddy CPU.
This changelog is very short on explaining how it will go about achieving these goals.
I could move some explanation of the cover letter inside the commit : In this case, the CPUs pack their tasks in their buddy until they becomes full. Unlike the previous step, we can't keep the same buddy so we update it during load balance. During the periodic load balance, the scheduler computes the activity of the system thanks the runnable_avg_sum and the cpu_power of all CPUs and then it defines the CPUs that will be used to handle the current activity. The selected CPUs will be their own buddy and will participate to the default load balancing mecanism in order to share the tasks in a fair way, whereas the not selected CPUs will not, and their buddy will be the last selected CPU. The behavior can be summarized as: The scheduler defines how many CPUs are required to handle the current activity, keeps the tasks on these CPUS and perform normal load balancing
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 42 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 98166aa..874f330 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3259,13 +3259,16 @@ static struct sched_group * find_idlest_group(struct sched_domain *sd, struct task_struct *p,
A task that wakes up will be caught by the function check_pack_buddy in order to stay in the CPUs that participates to the packing effort. We will use the find_idlest_group only for fork/exec tasks which are considered as full running tasks so we looks for the idlest CPU close to the buddy.
So for packing into power domains, wouldn't you typically pick the busiest non- full domain to fill from other non-full?
Picking the idlest non-full seems like it would generate a ping-pong or not actually pack anything.
If the buddy CPU is not full and currently idle, we trigger an Idle Load Balance to give it the opportunity to pull more tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 874f330..954adfd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5776,6 +5776,26 @@ end: clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); }
+static int check_nohz_buddy(int cpu) +{ + int buddy = per_cpu(sd_pack_buddy, cpu); + + if (sysctl_sched_packing_mode != SCHED_PACKING_FULL) + return false; + + /* No pack buddy for this CPU */ + if (buddy == -1) + return false; + + if (is_buddy_full(buddy)) + return false; + + if (cpumask_test_cpu(buddy, nohz.idle_cpus_mask)) + return true; + + return false; +} + /* * Current heuristic for kicking the idle load balancer in the presence * of an idle cpu is the system. @@ -5813,6 +5833,10 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) if (rq->nr_running >= 2) goto need_kick;
+ /* the buddy is idle and not busy so we can pack */ + if (check_nohz_buddy(cpu)) + goto need_kick; + rcu_read_lock(); for_each_domain(cpu, sd) { struct sched_group *sg = sd->groups;
On Thu, Apr 25, 2013 at 07:23:24PM +0200, Vincent Guittot wrote:
If the buddy CPU is not full and currently idle, we trigger an Idle Load Balance to give it the opportunity to pull more tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 874f330..954adfd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5776,6 +5776,26 @@ end: clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); } +static int check_nohz_buddy(int cpu) +{
- int buddy = per_cpu(sd_pack_buddy, cpu);
- if (sysctl_sched_packing_mode != SCHED_PACKING_FULL)
return false;
- /* No pack buddy for this CPU */
- if (buddy == -1)
return false;
- if (is_buddy_full(buddy))
return false;
- if (cpumask_test_cpu(buddy, nohz.idle_cpus_mask))
return true;
- return false;
return cpumask_test_cpu(..); ?
+}
/*
- Current heuristic for kicking the idle load balancer in the presence
- of an idle cpu is the system.
@@ -5813,6 +5833,10 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) if (rq->nr_running >= 2) goto need_kick;
- /* the buddy is idle and not busy so we can pack */
- if (check_nohz_buddy(cpu))
goto need_kick;
- rcu_read_lock(); for_each_domain(cpu, sd) { struct sched_group *sg = sd->groups;
Bah.. this code is so confusing.. nohz_balance_kick(cpu) won't actually kick @cpu.
Again.. suppose we're a big cpu (2) and our little buddy cpu (0) is busy, we miss an opportunity to kick the other little cpu into gear (1) and maybe take the big core out.
On 26 April 2013 15:15, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Apr 25, 2013 at 07:23:24PM +0200, Vincent Guittot wrote:
If the buddy CPU is not full and currently idle, we trigger an Idle Load Balance to give it the opportunity to pull more tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 874f330..954adfd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5776,6 +5776,26 @@ end: clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); }
+static int check_nohz_buddy(int cpu) +{
int buddy = per_cpu(sd_pack_buddy, cpu);
if (sysctl_sched_packing_mode != SCHED_PACKING_FULL)
return false;
/* No pack buddy for this CPU */
if (buddy == -1)
return false;
if (is_buddy_full(buddy))
return false;
if (cpumask_test_cpu(buddy, nohz.idle_cpus_mask))
return true;
return false;
return cpumask_test_cpu(..); ?
right
+}
/*
- Current heuristic for kicking the idle load balancer in the presence
- of an idle cpu is the system.
@@ -5813,6 +5833,10 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) if (rq->nr_running >= 2) goto need_kick;
/* the buddy is idle and not busy so we can pack */
if (check_nohz_buddy(cpu))
goto need_kick;
rcu_read_lock(); for_each_domain(cpu, sd) { struct sched_group *sg = sd->groups;
Bah.. this code is so confusing.. nohz_balance_kick(cpu) won't actually kick @cpu.
Again.. suppose we're a big cpu (2) and our little buddy cpu (0) is busy, we miss an opportunity to kick the other little cpu into gear (1) and maybe take the big core out.
You're right that we can miss an opportunity but it means that cpus allocated to packing effort are full and a new buddy cpu is going to be allocated and will pull the task a bit later.
We evaluate the activity level of CPUs, groups and domains in order to define how many CPUs are required to handle the current activity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 954adfd..234ecdd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3465,7 +3465,22 @@ static int check_pack_buddy(int cpu, struct task_struct *p) return true;
return false; +} + +static int get_cpu_activity(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + u32 sum = rq->avg.runnable_avg_sum; + u32 period = rq->avg.runnable_avg_period; + + sum = min(sum, period); + + if (sum == period) { + u32 overload = rq->nr_running > 1 ? 1 : 0; + return power_of(cpu) + overload; + }
+ return (sum * power_of(cpu)) / period; }
/* @@ -4355,6 +4370,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *this; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ + unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
@@ -4383,6 +4399,7 @@ struct sd_lb_stats { struct sg_lb_stats { unsigned long avg_load; /*Avg load across the CPUs of the group */ unsigned long group_load; /* Total load over the CPUs of the group */ + unsigned long group_activity; /* Total activity of the group */ unsigned long sum_nr_running; /* Nr tasks running in the group */ unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long group_capacity; @@ -4629,6 +4646,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, }
sgs->group_load += load; + sgs->group_activity += get_cpu_activity(i); sgs->sum_nr_running += nr_running; sgs->sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i)) @@ -4752,6 +4770,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, return;
sds->total_load += sgs.group_load; + sds->total_activity += sgs.group_activity; sds->total_pwr += sg->sgp->power;
/*
On Thu, Apr 25, 2013 at 06:23:25PM +0100, Vincent Guittot wrote:
We evaluate the activity level of CPUs, groups and domains in order to define how many CPUs are required to handle the current activity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 954adfd..234ecdd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3465,7 +3465,22 @@ static int check_pack_buddy(int cpu, struct task_struct *p) return true; return false; +}
+static int get_cpu_activity(int cpu) +{
- struct rq *rq = cpu_rq(cpu);
- u32 sum = rq->avg.runnable_avg_sum;
- u32 period = rq->avg.runnable_avg_period;
- sum = min(sum, period);
- if (sum == period) {
u32 overload = rq->nr_running > 1 ? 1 : 0;
return power_of(cpu) + overload;
- }
Adding one to indicate cpu overload seems to work well in the cases that I can think of. It may take some time to have effect if the overloaded cpu is in a group with nearly overloaded cpus. Adding one won't enable more cpus in another group until they have all reached overload.
Reaching sum == period takes quite a long time. Would it be an idea to use sum*1024 >= period*1000 like you do in is_buddy_full()?
I think that would be more consistent since overloaded is essential the same as full.
Cheers, Morten
- return (sum * power_of(cpu)) / period;
} /* @@ -4355,6 +4370,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *this; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
@@ -4383,6 +4399,7 @@ struct sd_lb_stats { struct sg_lb_stats { unsigned long avg_load; /*Avg load across the CPUs of the group */ unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long group_activity; /* Total activity of the group */ unsigned long sum_nr_running; /* Nr tasks running in the group */ unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long group_capacity;
@@ -4629,6 +4646,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, } sgs->group_load += load;
sgs->sum_nr_running += nr_running; sgs->sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i))sgs->group_activity += get_cpu_activity(i);
@@ -4752,6 +4770,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, return; sds->total_load += sgs.group_load;
sds->total_pwr += sg->sgp->power;sds->total_activity += sgs.group_activity;
/* -- 1.7.9.5
On 22 May 2013 18:50, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Thu, Apr 25, 2013 at 06:23:25PM +0100, Vincent Guittot wrote:
We evaluate the activity level of CPUs, groups and domains in order to define how many CPUs are required to handle the current activity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 954adfd..234ecdd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3465,7 +3465,22 @@ static int check_pack_buddy(int cpu, struct task_struct *p) return true;
return false;
+}
+static int get_cpu_activity(int cpu) +{
struct rq *rq = cpu_rq(cpu);
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;
sum = min(sum, period);
if (sum == period) {
u32 overload = rq->nr_running > 1 ? 1 : 0;
return power_of(cpu) + overload;
}
Adding one to indicate cpu overload seems to work well in the cases that I can think of. It may take some time to have effect if the overloaded cpu is in a group with nearly overloaded cpus. Adding one won't enable more cpus in another group until they have all reached overload.
This output expresses the activity of the CPU and i don't want to hack it unnecessarily. The +1 is only used to make difference between a cpu fully loaded by 1 task and a cpu which need help of other cpus because it is fully loaded by several tasks which share the runtime of the cpu.
When we define how many cpus are needed for the packing effort, we can take some margins like the is_buddy_full function.
Reaching sum == period takes quite a long time. Would it be an idea to use sum*1024 >= period*1000 like you do in is_buddy_full()?
We could probably use this margin when we update the list of cpus that are involved in the packing effort
Thanks for the comments Vincent
I think that would be more consistent since overloaded is essential the same as full.
Cheers, Morten
return (sum * power_of(cpu)) / period;
}
/* @@ -4355,6 +4370,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *this; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
@@ -4383,6 +4399,7 @@ struct sd_lb_stats { struct sg_lb_stats { unsigned long avg_load; /*Avg load across the CPUs of the group */ unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long group_activity; /* Total activity of the group */ unsigned long sum_nr_running; /* Nr tasks running in the group */ unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long group_capacity;
@@ -4629,6 +4646,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, }
sgs->group_load += load;
sgs->group_activity += get_cpu_activity(i); sgs->sum_nr_running += nr_running; sgs->sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i))
@@ -4752,6 +4770,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, return;
sds->total_load += sgs.group_load;
sds->total_activity += sgs.group_activity; sds->total_pwr += sg->sgp->power; /*
-- 1.7.9.5
Periodically updates the buddy of a CPU according to the current activity of the system. A CPU is its own buddy if it participates to the packing effort. Otherwise, it points to a CPU that participates to the packing effort.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 86 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 234ecdd..28f8ea7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -174,11 +174,17 @@ void sched_init_granularity(void)
#ifdef CONFIG_SMP +static unsigned long power_of(int cpu) +{ + return cpu_rq(cpu)->cpu_power; +} + /* * Save the id of the optimal CPU that should be used to pack small tasks * The value -1 is used when no buddy has been found */ DEFINE_PER_CPU(int, sd_pack_buddy); +DEFINE_PER_CPU(struct sched_domain *, sd_pack_domain);
/* * Look for the best buddy CPU that can be used to pack small tasks @@ -237,6 +243,68 @@ void update_packing_domain(int cpu) }
pr_debug("CPU%d packing on CPU%d\n", cpu, id); + per_cpu(sd_pack_domain, cpu) = sd; + per_cpu(sd_pack_buddy, cpu) = id; +} + +void update_packing_buddy(int cpu, int activity) +{ + struct sched_domain *sd = per_cpu(sd_pack_domain, cpu); + struct sched_group *sg, *pack, *tmp; + int id = cpu; + + if (!sd) + return; + + /* + * The sched_domain of a CPU points on the local sched_group + * and this CPU of this local group is a good candidate + */ + pack = sg = sd->groups; + + /* loop the sched groups to find the best one */ + for (tmp = sg->next; tmp != sg; tmp = tmp->next) { + if ((tmp->sgp->power * pack->group_weight) > + (pack->sgp->power_available * tmp->group_weight)) + continue; + + if (((tmp->sgp->power * pack->group_weight) == + (pack->sgp->power * tmp->group_weight)) + && (cpumask_first(sched_group_cpus(tmp)) >= id)) + continue; + + /* we have found a better group */ + pack = tmp; + + /* Take the 1st CPU of the new group */ + id = cpumask_first(sched_group_cpus(pack)); + } + + if ((cpu == id) || (activity <= power_of(id))) { + per_cpu(sd_pack_buddy, cpu) = id; + return; + } + + for (tmp = pack; activity > 0; tmp = tmp->next) { + if (tmp->sgp->power > activity) { + id = cpumask_first(sched_group_cpus(tmp)); + activity -= power_of(id); + if (cpu == id) + activity = 0; + while ((activity > 0) && (id < nr_cpu_ids)) { + id = cpumask_next(id, sched_group_cpus(tmp)); + activity -= power_of(id); + if (cpu == id) + activity = 0; + } + } else if (cpumask_test_cpu(cpu, sched_group_cpus(tmp))) { + id = cpu; + activity = 0; + } else { + activity -= tmp->sgp->power; + } + } + per_cpu(sd_pack_buddy, cpu) = id; }
@@ -3014,11 +3082,6 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); }
-static unsigned long power_of(int cpu) -{ - return cpu_rq(cpu)->cpu_power; -} - static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4740,6 +4803,22 @@ static bool update_sd_pick_busiest(struct lb_env *env, return false; }
+static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds, + struct sched_domain *sd) +{ + int buddy; + + if (sysctl_sched_packing_mode != SCHED_PACKING_FULL) + return; + + /* Update my buddy */ + if (sd == per_cpu(sd_pack_domain, cpu)) + update_packing_buddy(cpu, sds->total_activity); + + /* Get my new buddy */ + buddy = per_cpu(sd_pack_buddy, cpu); +} + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancing. * @env: The load balancing environment. @@ -4807,6 +4886,8 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sg = sg->next; } while (sg != env->sd->groups); + + update_plb_buddy(env->dst_cpu, balance, sds, env->sd); }
/**
Hi,
On 04/25/2013 07:23 PM, Vincent Guittot wrote:
Periodically updates the buddy of a CPU according to the current activity of the system. A CPU is its own buddy if it participates to the packing effort. Otherwise, it points to a CPU that participates to the packing effort.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 86 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 234ecdd..28f8ea7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -174,11 +174,17 @@ void sched_init_granularity(void) #ifdef CONFIG_SMP +static unsigned long power_of(int cpu) +{
- return cpu_rq(cpu)->cpu_power;
+}
/*
- Save the id of the optimal CPU that should be used to pack small tasks
- The value -1 is used when no buddy has been found
*/ DEFINE_PER_CPU(int, sd_pack_buddy); +DEFINE_PER_CPU(struct sched_domain *, sd_pack_domain); /*
- Look for the best buddy CPU that can be used to pack small tasks
@@ -237,6 +243,68 @@ void update_packing_domain(int cpu) } pr_debug("CPU%d packing on CPU%d\n", cpu, id);
- per_cpu(sd_pack_domain, cpu) = sd;
- per_cpu(sd_pack_buddy, cpu) = id;
+}
+void update_packing_buddy(int cpu, int activity) +{
- struct sched_domain *sd = per_cpu(sd_pack_domain, cpu);
- struct sched_group *sg, *pack, *tmp;
- int id = cpu;
- if (!sd)
return;
- /*
* The sched_domain of a CPU points on the local sched_group
* and this CPU of this local group is a good candidate
*/
- pack = sg = sd->groups;
- /* loop the sched groups to find the best one */
- for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
if ((tmp->sgp->power * pack->group_weight) >
(pack->sgp->power_available * tmp->group_weight))
The power_available struct member is defined in a subsequent patch (12/14), so this patch series would break git bisect.
-- Francesco
On 28 April 2013 10:20, Francesco Lavra francescolavra.fl@gmail.com wrote:
Hi,
On 04/25/2013 07:23 PM, Vincent Guittot wrote:
Periodically updates the buddy of a CPU according to the current activity of the system. A CPU is its own buddy if it participates to the packing effort. Otherwise, it points to a CPU that participates to the packing effort.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 86 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 234ecdd..28f8ea7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -174,11 +174,17 @@ void sched_init_granularity(void)
#ifdef CONFIG_SMP +static unsigned long power_of(int cpu) +{
return cpu_rq(cpu)->cpu_power;
+}
/*
- Save the id of the optimal CPU that should be used to pack small tasks
- The value -1 is used when no buddy has been found
*/ DEFINE_PER_CPU(int, sd_pack_buddy); +DEFINE_PER_CPU(struct sched_domain *, sd_pack_domain);
/*
- Look for the best buddy CPU that can be used to pack small tasks
@@ -237,6 +243,68 @@ void update_packing_domain(int cpu) }
pr_debug("CPU%d packing on CPU%d\n", cpu, id);
per_cpu(sd_pack_domain, cpu) = sd;
per_cpu(sd_pack_buddy, cpu) = id;
+}
+void update_packing_buddy(int cpu, int activity) +{
struct sched_domain *sd = per_cpu(sd_pack_domain, cpu);
struct sched_group *sg, *pack, *tmp;
int id = cpu;
if (!sd)
return;
/*
* The sched_domain of a CPU points on the local sched_group
* and this CPU of this local group is a good candidate
*/
pack = sg = sd->groups;
/* loop the sched groups to find the best one */
for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
if ((tmp->sgp->power * pack->group_weight) >
(pack->sgp->power_available * tmp->group_weight))
The power_available struct member is defined in a subsequent patch (12/14), so this patch series would break git bisect.
Hi Francesco,
yes, power_available should have been added in patch 13/14. I will correct that
Thanks Vincent
-- Francesco
Only CPUs that participates to the packing effort can pull tasks on a busiest group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 28f8ea7..6f63fb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3494,6 +3494,12 @@ static bool is_buddy_full(int cpu) return (sum * 1024 >= period * 1000); }
+static bool is_my_buddy(int cpu, int buddy) +{ + int my_buddy = per_cpu(sd_pack_buddy, cpu); + return (my_buddy == -1) || (buddy == my_buddy); +} + static bool is_light_task(struct task_struct *p) { /* A light task runs less than 20% in average */ @@ -4688,8 +4694,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Bias balancing toward cpus of our domain */ if (local_group) { - if (idle_cpu(i) && !first_idle_cpu && - cpumask_test_cpu(i, sched_group_mask(group))) { + if (is_my_buddy(i, i) && idle_cpu(i) && !first_idle_cpu + && cpumask_test_cpu(i, sched_group_mask(group))) { first_idle_cpu = 1; balance_cpu = i; } @@ -4817,6 +4823,10 @@ static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,
/* Get my new buddy */ buddy = per_cpu(sd_pack_buddy, cpu); + + /* This CPU doesn't act for agressive packing */ + if (buddy != cpu) + sds->busiest = 0; }
/**
Part of this patch is missing, the fix below is needed
@@ -3497,7 +3497,9 @@ static bool is_buddy_full(int cpu) static bool is_my_buddy(int cpu, int buddy) { int my_buddy = per_cpu(sd_pack_buddy, cpu); - return (my_buddy == -1) || (buddy == my_buddy); + + return ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) && + ((my_buddy == -1) || (buddy == my_buddy))); }
static bool is_light_task(struct task_struct *p)
On 25 April 2013 19:23, Vincent Guittot vincent.guittot@linaro.org wrote:
Only CPUs that participates to the packing effort can pull tasks on a busiest group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 28f8ea7..6f63fb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3494,6 +3494,12 @@ static bool is_buddy_full(int cpu) return (sum * 1024 >= period * 1000); }
+static bool is_my_buddy(int cpu, int buddy) +{
int my_buddy = per_cpu(sd_pack_buddy, cpu);
return (my_buddy == -1) || (buddy == my_buddy);
+}
static bool is_light_task(struct task_struct *p) { /* A light task runs less than 20% in average */ @@ -4688,8 +4694,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Bias balancing toward cpus of our domain */ if (local_group) {
if (idle_cpu(i) && !first_idle_cpu &&
cpumask_test_cpu(i, sched_group_mask(group))) {
if (is_my_buddy(i, i) && idle_cpu(i) && !first_idle_cpu
&& cpumask_test_cpu(i, sched_group_mask(group))) { first_idle_cpu = 1; balance_cpu = i; }
@@ -4817,6 +4823,10 @@ static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,
/* Get my new buddy */ buddy = per_cpu(sd_pack_buddy, cpu);
/* This CPU doesn't act for agressive packing */
if (buddy != cpu)
sds->busiest = 0;
}
/**
1.7.9.5
On Fri, Apr 26, 2013 at 11:00:58AM +0100, Vincent Guittot wrote:
Part of this patch is missing, the fix below is needed
@@ -3497,7 +3497,9 @@ static bool is_buddy_full(int cpu) static bool is_my_buddy(int cpu, int buddy) { int my_buddy = per_cpu(sd_pack_buddy, cpu);
- return (my_buddy == -1) || (buddy == my_buddy);
- return ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) &&
((my_buddy == -1) || (buddy == my_buddy)));
}
static bool is_light_task(struct task_struct *p)
On 25 April 2013 19:23, Vincent Guittot vincent.guittot@linaro.org wrote:
Only CPUs that participates to the packing effort can pull tasks on a busiest group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 28f8ea7..6f63fb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3494,6 +3494,12 @@ static bool is_buddy_full(int cpu) return (sum * 1024 >= period * 1000); }
+static bool is_my_buddy(int cpu, int buddy) +{
int my_buddy = per_cpu(sd_pack_buddy, cpu);
return (my_buddy == -1) || (buddy == my_buddy);
+}
Would it make sense to change the function name to something like is_packing_target() and only have one argument?
is_my_buddy() is only used with the same variable for both arguments like below.
static bool is_light_task(struct task_struct *p) { /* A light task runs less than 20% in average */ @@ -4688,8 +4694,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Bias balancing toward cpus of our domain */ if (local_group) {
if (idle_cpu(i) && !first_idle_cpu &&
cpumask_test_cpu(i, sched_group_mask(group))) {
if (is_my_buddy(i, i) && idle_cpu(i) && !first_idle_cpu
&& cpumask_test_cpu(i, sched_group_mask(group))) { first_idle_cpu = 1; balance_cpu = i; }
@@ -4817,6 +4823,10 @@ static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,
/* Get my new buddy */ buddy = per_cpu(sd_pack_buddy, cpu);
/* This CPU doesn't act for agressive packing */
if (buddy != cpu)
sds->busiest = 0;
sds->busiest is a pointer, so I think it should be assigned to NULL instead of 0.
Morten
}
/**
1.7.9.5
On 22 May 2013 17:56, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, Apr 26, 2013 at 11:00:58AM +0100, Vincent Guittot wrote:
Part of this patch is missing, the fix below is needed
@@ -3497,7 +3497,9 @@ static bool is_buddy_full(int cpu) static bool is_my_buddy(int cpu, int buddy) { int my_buddy = per_cpu(sd_pack_buddy, cpu);
- return (my_buddy == -1) || (buddy == my_buddy);
- return ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) &&
((my_buddy == -1) || (buddy == my_buddy)));
}
static bool is_light_task(struct task_struct *p)
On 25 April 2013 19:23, Vincent Guittot vincent.guittot@linaro.org wrote:
Only CPUs that participates to the packing effort can pull tasks on a busiest group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 28f8ea7..6f63fb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3494,6 +3494,12 @@ static bool is_buddy_full(int cpu) return (sum * 1024 >= period * 1000); }
+static bool is_my_buddy(int cpu, int buddy) +{
int my_buddy = per_cpu(sd_pack_buddy, cpu);
return (my_buddy == -1) || (buddy == my_buddy);
+}
Would it make sense to change the function name to something like is_packing_target() and only have one argument?
I have replaced it with is_packing_cpu(int cpu) in my next version. This function returns true if the cpu is part of the packing effort
is_my_buddy() is only used with the same variable for both arguments like below.
static bool is_light_task(struct task_struct *p) { /* A light task runs less than 20% in average */ @@ -4688,8 +4694,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Bias balancing toward cpus of our domain */ if (local_group) {
if (idle_cpu(i) && !first_idle_cpu &&
cpumask_test_cpu(i, sched_group_mask(group))) {
if (is_my_buddy(i, i) && idle_cpu(i) && !first_idle_cpu
&& cpumask_test_cpu(i, sched_group_mask(group))) { first_idle_cpu = 1; balance_cpu = i; }
@@ -4817,6 +4823,10 @@ static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,
/* Get my new buddy */ buddy = per_cpu(sd_pack_buddy, cpu);
/* This CPU doesn't act for agressive packing */
if (buddy != cpu)
sds->busiest = 0;
sds->busiest is a pointer, so I think it should be assigned to NULL instead of 0.
yes
Thanks Vincent
Morten
}
/**
1.7.9.5
This new field power_available reflects the available capacity of a CPU unlike the cpu_power which reflects the current capacity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 2 +- kernel/sched/fair.c | 18 +++++++++++++++--- kernel/sched/sched.h | 1 + 3 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index a82cdeb..03a5143 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -824,7 +824,7 @@ struct sched_group_power { * CPU power of this group, SCHED_LOAD_SCALE being max power for a * single CPU. */ - unsigned int power, power_orig; + unsigned int power, power_orig, power_available; unsigned long next_update; /* * Number of busy cpus in this group. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6f63fb9..756b1e3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -179,6 +179,11 @@ static unsigned long power_of(int cpu) return cpu_rq(cpu)->cpu_power; }
+static unsigned long available_of(int cpu) +{ + return cpu_rq(cpu)->cpu_available; +} + /* * Save the id of the optimal CPU that should be used to pack small tasks * The value -1 is used when no buddy has been found @@ -4588,6 +4593,9 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) if (!power) power = 1;
+ cpu_rq(cpu)->cpu_available = power; + sdg->sgp->power_available = power; + cpu_rq(cpu)->cpu_power = power; sdg->sgp->power = power; } @@ -4596,7 +4604,7 @@ void update_group_power(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long power; + unsigned long power, available; unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval); @@ -4608,7 +4616,7 @@ void update_group_power(struct sched_domain *sd, int cpu) return; }
- power = 0; + power = available = 0;
if (child->flags & SD_OVERLAP) { /* @@ -4618,6 +4626,8 @@ void update_group_power(struct sched_domain *sd, int cpu)
for_each_cpu(cpu, sched_group_cpus(sdg)) power += power_of(cpu); + available += available_of(cpu); + } else { /* * !SD_OVERLAP domains can assume that child groups @@ -4627,11 +4637,13 @@ void update_group_power(struct sched_domain *sd, int cpu) group = child->groups; do { power += group->sgp->power; + available += group->sgp->power_available; group = group->next; } while (group != child->groups); }
- sdg->sgp->power_orig = sdg->sgp->power = power; + sdg->sgp->power_orig = sdg->sgp->power_available = available; + sdg->sgp->power = power; }
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 96b164d..2b6eaf9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -413,6 +413,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_power; + unsigned long cpu_available;
unsigned char idle_balance; /* For active balancing */
The cpu_power is updated for CPUs that don't participate to the packing effort. We consider that their cpu_power is allocated to idleness as it could be allocated by rt. So the cpu_power that remains available for cfs, is set to min value (i.e. 1)
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 35 ++++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 15 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 756b1e3..54c1541 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -224,12 +224,12 @@ void update_packing_domain(int cpu)
/* loop the sched groups to find the best one */ for (tmp = sg->next; tmp != sg; tmp = tmp->next) { - if (tmp->sgp->power * pack->group_weight > - pack->sgp->power * tmp->group_weight) + if (tmp->sgp->power_available * pack->group_weight > + pack->sgp->power_available * tmp->group_weight) continue;
- if ((tmp->sgp->power * pack->group_weight == - pack->sgp->power * tmp->group_weight) + if ((tmp->sgp->power_available * pack->group_weight == + pack->sgp->power_available * tmp->group_weight) && (cpumask_first(sched_group_cpus(tmp)) >= id)) continue;
@@ -269,12 +269,12 @@ void update_packing_buddy(int cpu, int activity)
/* loop the sched groups to find the best one */ for (tmp = sg->next; tmp != sg; tmp = tmp->next) { - if ((tmp->sgp->power * pack->group_weight) > + if ((tmp->sgp->power_available * pack->group_weight) > (pack->sgp->power_available * tmp->group_weight)) continue;
- if (((tmp->sgp->power * pack->group_weight) == - (pack->sgp->power * tmp->group_weight)) + if (((tmp->sgp->power_available * pack->group_weight) == + (pack->sgp->power_available * tmp->group_weight)) && (cpumask_first(sched_group_cpus(tmp)) >= id)) continue;
@@ -285,20 +285,20 @@ void update_packing_buddy(int cpu, int activity) id = cpumask_first(sched_group_cpus(pack)); }
- if ((cpu == id) || (activity <= power_of(id))) { + if ((cpu == id) || (activity <= available_of(id))) { per_cpu(sd_pack_buddy, cpu) = id; return; }
for (tmp = pack; activity > 0; tmp = tmp->next) { - if (tmp->sgp->power > activity) { + if (tmp->sgp->power_available > activity) { id = cpumask_first(sched_group_cpus(tmp)); - activity -= power_of(id); + activity -= available_of(id); if (cpu == id) activity = 0; while ((activity > 0) && (id < nr_cpu_ids)) { id = cpumask_next(id, sched_group_cpus(tmp)); - activity -= power_of(id); + activity -= available_of(id); if (cpu == id) activity = 0; } @@ -306,7 +306,7 @@ void update_packing_buddy(int cpu, int activity) id = cpu; activity = 0; } else { - activity -= tmp->sgp->power; + activity -= tmp->sgp->power_available; } }
@@ -3369,7 +3369,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, }
/* Adjust by relative CPU power of the group */ - avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power; + avg_load = (avg_load * SCHED_POWER_SCALE) + / group->sgp->power_available;
if (local_group) { this_load = avg_load; @@ -3551,10 +3552,10 @@ static int get_cpu_activity(int cpu)
if (sum == period) { u32 overload = rq->nr_running > 1 ? 1 : 0; - return power_of(cpu) + overload; + return available_of(cpu) + overload; }
- return (sum * power_of(cpu)) / period; + return (sum * available_of(cpu)) / period; }
/* @@ -4596,8 +4597,12 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_available = power; sdg->sgp->power_available = power;
+ if (!is_my_buddy(cpu, cpu)) + power = 1; + cpu_rq(cpu)->cpu_power = power; sdg->sgp->power = power; + }
void update_group_power(struct sched_domain *sd, int cpu)
On Thu, Apr 25, 2013 at 06:23:29PM +0100, Vincent Guittot wrote:
@@ -4596,8 +4597,12 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_available = power; sdg->sgp->power_available = power;
- if (!is_my_buddy(cpu, cpu))
power = 1;
Using your fix for is_my_buddy() for patch 11 the above will always be true when sched_packing_mode is to anything else than SCHED_PACKING_FULL. So cpu_power for all cpus in SCHED_PACKING_{DEFAULT,NONE} is 1.
As far is I understand, this is not the intention?
Cheers, Morten
cpu_rq(cpu)->cpu_power = power; sdg->sgp->power = power;
} void update_group_power(struct sched_domain *sd, int cpu) -- 1.7.9.5
On 22 May 2013 17:46, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Thu, Apr 25, 2013 at 06:23:29PM +0100, Vincent Guittot wrote:
@@ -4596,8 +4597,12 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_available = power; sdg->sgp->power_available = power;
if (!is_my_buddy(cpu, cpu))
power = 1;
Using your fix for is_my_buddy() for patch 11 the above will always be true when sched_packing_mode is to anything else than SCHED_PACKING_FULL. So cpu_power for all cpus in SCHED_PACKING_{DEFAULT,NONE} is 1.
As far is I understand, this is not the intention?
yes, the too quick fix for 11/14 has not fixed anything... the final fix is available in git://git.linaro.org/people/vingu/kernel.git sched-pack-small-tasks-v4-fixed.
And it will be part of the next version that i prepare
Vincent
Cheers, Morten
cpu_rq(cpu)->cpu_power = power; sdg->sgp->power = power;
}
void update_group_power(struct sched_domain *sd, int cpu)
1.7.9.5
If a CPU that doesn't participate to the packing effort, has at least one running task, it means that its group is imbalanced and the CPUs can pull this task.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 54c1541..f87aed2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4729,6 +4729,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, max_nr_running = nr_running; if (min_nr_running > nr_running) min_nr_running = nr_running; + + if (!is_my_buddy(i, i) && nr_running > 0) + sgs->group_imb = 1; }
sgs->group_load += load;
Hi,
The patches are available in this git tree: git://git.linaro.org/people/vingu/kernel.git sched-pack-small-tasks-v4-fixed
Vincent
On 25 April 2013 19:23, Vincent Guittot vincent.guittot@linaro.org wrote:
Hi,
This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the tasks in as few as possible CPU/Cluster/Core. It has got 2 packing modes: -The 1st mode packs the small tasks when the system is not too busy. The main goal is to reduce the power consumption in the low system load use cases by minimizing the number of power domain that are enabled but it also keeps the default behavior which is performance oriented. -The 2nd mode packs all tasks in as few as possible power domains in order to improve the power consumption of the system but at the cost of possible performance decrease because of the increase of the rate of ressources sharing compared to the default mode.
The packing is done in 3 steps (the last step is only applicable for the agressive packing mode):
The 1st step looks for the best place to pack tasks in a system according to its topology and it defines a 1st pack buddy CPU for each CPU if there is one available. The policy for defining a buddy CPU is that we want to pack at levels where a group of CPU can be power gated independently from others. To describe this capability, a new flag SD_SHARE_POWERDOMAIN has been introduced, that is used to indicate whether the groups of CPUs of a scheduling domain share their power state. By default, this flag is set in all sched_domain in order to keep unchanged the current behavior of the scheduler and only ARM platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level.
In a 2nd step, the scheduler checks the load average of a task which wakes up as well as the load average of the buddy CPU and it can decide to migrate the light tasks on a not busy buddy. This check is done during the wake up because small tasks tend to wake up between periodic load balance and asynchronously to each other which prevents the default mechanism to catch and migrate them efficiently. A light task is defined by a runnable_avg_sum that is less than 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones: The average CPU load of the task must be less than 20% and the task must have been runnable less than 10ms when it woke up last time in order to be electable for the packing migration. So, a task than runs 1 ms each 5ms will be considered as a small task but a task that runs 50 ms with a period of 500ms, will not. Then, the business of the buddy CPU depends of the load average for the rq and the number of running tasks. A CPU with a load average greater than 50% will be considered as busy CPU whatever the number of running tasks is and this threshold will be reduced by the number of running tasks in order to not increase too much the wake up latency of a task. When the buddy CPU is busy, the scheduler falls back to default CFS policy.
The 3rd step is only used when the agressive packing mode is enable. In this case, the CPUs pack their tasks in their buddy until they becomes full. Unlike the previous step, we can't keep the same buddy so we update it during load balance. During the periodic load balance, the scheduler computes the activity of the system thanks the runnable_avg_sum and the cpu_power of all CPUs and then it defines the CPUs that will be used to handle the current activity. The selected CPUs will be their own buddy and will participate to the default load balancing mecanism in order to share the tasks in a fair way, whereas the not selected CPUs will not, and their buddy will be the last selected CPU. The behavior can be summarized as: The scheduler defines how many CPUs are required to handle the current activity, keeps the tasks on these CPUS and perform normal load balancing (or any evolution of the current load balancer like the use of runnable load avg from Alex https://lkml.org/lkml/2013/4/1/580) on this limited number of CPUs . Like the other steps, the CPUs are selected to minimize the number of power domain that must stay on.
Change since V3:
- Take into account comments on previous version.
- Add an agressive packing mode and a knob to select between the various mode
Change since V2:
- Migrate only a task that wakes up
- Change the light tasks threshold to 20%
- Change the loaded CPU threshold to not pull tasks if the current number of running tasks is null but the load average is already greater than 50%
- Fix the algorithm for selecting the buddy CPU.
Change since V1:
Patch 2/6
- Change the flag name which was not clear. The new name is SD_SHARE_POWERDOMAIN.
- Create an architecture dependent function to tune the sched_domain flags
Patch 3/6
- Fix issues in the algorithm that looks for the best buddy CPU
- Use pr_debug instead of pr_info
- Fix for uniprocessor
Patch 4/6
- Remove the use of usage_avg_sum which has not been merged
Patch 5/6
- Change the way the coherency of runnable_avg_sum and runnable_avg_period is ensured
Patch 6/6
- Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM platform
Previous results for v3:
This series has been tested with hackbench on ARM platform and the results don't show any performance regression
Hackbench 3.9-rc2 +patches Mean Time (10 tests): 2.048 2.015 stdev : 0.047 0.068
Previous results for V2:
This series has been tested with MP3 play back on ARM platform: TC2 HMP (dual CA-15 and 3xCA-7 cluster).
The measurements have been done on an Ubuntu image during 60 seconds of playback and the result has been normalized to 100.
| CA15 | CA7 | total |
default | 81 | 97 | 178 | pack | 13 | 100 | 113 |
Previous results for V1:
The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that it's worth packing small tasks at all topology levels.
The performance tests have been done on both platforms with sysbench. The results don't show any performance regressions. These results are aligned with the policy which uses the normal behavior with heavy use cases.
test: sysbench --test=cpu --num-threads=N --max-requests=R run
Results below is the average duration of 3 tests on the quad CA-9. default is the current scheduler behavior (pack buddy CPU is -1) pack is the scheduler with the pack mechanism
| default | pack |
N=8; R=200 | 3.1999 | 3.1921 | N=8; R=2000 | 31.4939 | 31.4844 | N=12; R=200 | 3.2043 | 3.2084 | N=12; R=2000 | 31.4897 | 31.4831 | N=16; R=200 | 3.1774 | 3.1824 | N=16; R=2000 | 31.4899 | 31.4897 |
The power consumption tests have been done only on TC2 platform which has got accessible power lines and I have used cyclictest to simulate small tasks. The tests show some power consumption improvements.
test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
The measurements have been done during 16 seconds and the result has been normalized to 100
| CA15 | CA7 | total |
default | 100 | 40 | 140 | pack | <1 | 45 | <46 |
The A15 cluster is less power efficient than the A7 cluster but if we assume that the tasks is well spread on both clusters, we can guest estimate that the power consumption on a dual cluster of CA7 would have been for a default kernel:
| CA7 | CA7 | total |
default | 40 | 40 | 80 |
Vincent Guittot (14): Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain sched: pack small tasks sched: pack the idle load balance ARM: sched: clear SD_SHARE_POWERDOMAIN sched: add a knob to choose the packing level sched: agressively pack at wake/fork/exec sched: trig ILB on an idle buddy sched: evaluate the activity level of the system sched: update the buddy CPU sched: filter task pull request sched: create a new field with available capacity sched: update the cpu_power sched: force migration on buddy CPU
arch/arm/kernel/topology.c | 9 + arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 11 +- include/linux/sched/sysctl.h | 8 + include/linux/topology.h | 4 + kernel/sched/core.c | 14 +- kernel/sched/fair.c | 393 +++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 15 +- kernel/sysctl.c | 13 ++ 10 files changed, 423 insertions(+), 46 deletions(-)
-- 1.7.9.5
On 4/25/2013 10:23 AM, Vincent Guittot wrote:
Hi,
This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the tasks in as few as possible CPU/Cluster/Core. It has got 2 packing modes: -The 1st mode packs the small tasks when the system is not too busy. The main goal is to reduce the power consumption in the low system load use cases by minimizing the number of power domain that are enabled but it also keeps the default behavior which is performance oriented. -The 2nd mode packs all tasks in as few as possible power domains in order to improve the power consumption of the system but at the cost of possible performance decrease because of the increase of the rate of ressources sharing compared to the default mode.
so I got to ask the hard question; what percentage of system level (not just cpu level) power consumption gain can you measure (pick your favorite workload)...
on x86 (even on the low power stuff) I expect this to be very far into the noise (since we have per core power gates, and power transitions are pretty fast)
you have some numbers in the back of your mail, but it's hard for me to get a conclusion out of that (they either measure only cpu power, or are just vague in general)
On 26 April 2013 17:00, Arjan van de Ven arjan@linux.intel.com wrote:
On 4/25/2013 10:23 AM, Vincent Guittot wrote:
Hi,
This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the tasks in as few as possible CPU/Cluster/Core. It has got 2 packing modes: -The 1st mode packs the small tasks when the system is not too busy. The main goal is to reduce the power consumption in the low system load use cases by minimizing the number of power domain that are enabled but it also keeps the default behavior which is performance oriented. -The 2nd mode packs all tasks in as few as possible power domains in order to improve the power consumption of the system but at the cost of possible performance decrease because of the increase of the rate of ressources sharing compared to the default mode.
so I got to ask the hard question; what percentage of system level (not just cpu level) power consumption gain can you measure (pick your favorite workload)...
I haven't system level figures for my patches but only for the cpu subsystem. If we use the MP3 results in the back of my mail, they show an improvement of 37 % (113/178) for the CPU subsystem of the platform. If we assume that the CPU subsystem contributes 25% of an embedded system power consumption (this can vary across platform depending of the use of HW accelerator but it should be a almost fair percentage), the patch can impact the power consumption on up to 9%.
on x86 (even on the low power stuff) I expect this to be very far into the noise (since we have per core power gates, and power transitions are pretty fast)
you have some numbers in the back of your mail, but it's hard for me to get a conclusion out of that (they either measure only cpu power, or are just vague in general)
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
so I got to ask the hard question; what percentage of system level (not just cpu level) power consumption gain can you measure (pick your favorite workload)...
I haven't system level figures for my patches but only for the cpu subsystem. If we use the MP3 results in the back of my mail, they show an improvement of 37 % (113/178) for the CPU subsystem of the platform. If we assume that the CPU subsystem contributes 25% of an embedded system power consumption (this can vary across platform depending of the use of HW accelerator but it should be a almost fair percentage), the patch can impact the power consumption on up to 9%.
sadly the math tends to not work quite that easy; memory takes significantly more power when the system is not idle than when it is idle for example. [*] so while reducing cpu power by making it run a bit longer (at lower frequency or slower core or whatever) is a pure win if you only look at the cpu, but it may (or may not) be a loss when looking at a whole system level.
I've learned the hard way that you cannot just look at the cpu numbers; you must look at the whole-system power when playing with such tradeoffs.
That does not mean that your patch is not useful; it very well can be, but without having looked at whole-system power that's a very dangerous conclusion to make. So.. if you get a chance, I'd love to see data on a whole-system level... even for just one workload and one system (playing mp3 sounds like a quite reasonable workload for such things indeed)
[*] I assume that on your ARM systems, memory goes into self refresh during system idle just as it does on x86
On 26 April 2013 17:46, Arjan van de Ven arjan@linux.intel.com wrote:
so I got to ask the hard question; what percentage of system level (not just cpu level) power consumption gain can you measure (pick your favorite workload)...
I haven't system level figures for my patches but only for the cpu subsystem. If we use the MP3 results in the back of my mail, they show an improvement of 37 % (113/178) for the CPU subsystem of the platform. If we assume that the CPU subsystem contributes 25% of an embedded system power consumption (this can vary across platform depending of the use of HW accelerator but it should be a almost fair percentage), the patch can impact the power consumption on up to 9%.
sadly the math tends to not work quite that easy; memory takes significantly more power when the system is not idle than when it is idle for example. [*] so while reducing cpu power by making it run a bit longer (at lower frequency or slower core or whatever) is a pure win if you only look at the cpu, but it may (or may not) be a loss when looking at a whole system level.
I agree. That's why the default packing mode of the patches is trying to not increase (too much) the scheduling latency of tasks but only to force tasks to use the same cpu when ever possible in order to minimize the number CPU/cluster that are used
I've learned the hard way that you cannot just look at the cpu numbers; you must look at the whole-system power when playing with such tradeoffs.
That does not mean that your patch is not useful; it very well can be, but without having looked at whole-system power that's a very dangerous conclusion to make. So.. if you get a chance, I'd love to see data on a whole-system level... even for just one workload and one system (playing mp3 sounds like a quite reasonable workload for such things indeed)
I will try to get such figures
Vincent
[*] I assume that on your ARM systems, memory goes into self refresh during system idle just as it does on x86
On 26 April 2013 17:46, Arjan van de Ven arjan@linux.intel.com wrote:
so I got to ask the hard question; what percentage of system level (not just cpu level) power consumption gain can you measure (pick your favorite workload)...
I haven't system level figures for my patches but only for the cpu subsystem. If we use the MP3 results in the back of my mail, they show an improvement of 37 % (113/178) for the CPU subsystem of the platform. If we assume that the CPU subsystem contributes 25% of an embedded system power consumption (this can vary across platform depending of the use of HW accelerator but it should be a almost fair percentage), the patch can impact the power consumption on up to 9%.
sadly the math tends to not work quite that easy; memory takes significantly more power when the system is not idle than when it is idle for example. [*] so while reducing cpu power by making it run a bit longer (at lower frequency or slower core or whatever) is a pure win if you only look at the cpu, but it may (or may not) be a loss when looking at a whole system level.
I've learned the hard way that you cannot just look at the cpu numbers; you must look at the whole-system power when playing with such tradeoffs.
That does not mean that your patch is not useful; it very well can be, but without having looked at whole-system power that's a very dangerous conclusion to make. So.. if you get a chance, I'd love to see data on a whole-system level... even for just one workload and one system (playing mp3 sounds like a quite reasonable workload for such things indeed)
Hi Arjan,
I don't have full system power consumption but i have gathered some additional figures for mp3 use case
The power consumption percentage stay similar | CA15 | CA7 | total | ------------------------------------- default | 81% | 95% | 178% | pack | 10% | 100% | 110% | agressive | 3% | 100% | 103% | -------------------------------------
I have also gathered the time when all CPUs are in low power state so the entire system can be put in low power mode (DDR in self refresh) | Time | nb / sec | average | -------------------------------------------- default | 85.5% | 296 | 2.9ms | pack | 82.5% | 253 | 3.2ms | agressive | 81.8% | 236 | 3.4ms | --------------------------------------------
We can see a decrease of the time spent in low power mode but on the other side, the number of switch to low power mode also decreases which results in the increase the average duration of low power state.
Regards, Vincent
[*] I assume that on your ARM systems, memory goes into self refresh during system idle just as it does on x86
linaro-kernel@lists.linaro.org