Hi,
This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the small tasks in as few as possible CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power consumption by minimizing the number of power domain that are enabled. The packing is done in 2 steps:
The 1st step looks for the best place to pack tasks in a system according to its topology and it defines a pack buddy CPU for each CPU if there is one available. The policy for defining a buddy CPU is that we pack at all levels where a group of CPU can be power gated independently from others. For describing this capability, a new flag has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the groups of CPUs of a scheduling domain are sharing their power state. By default, this flag has been set in all sched_domain in order to keep unchanged the current behavior of the scheduler.
In a 2nd step, the scheduler checks the load average of a task which wakes up as well as the load average of the buddy CPU and can decide to migrate the task on the buddy. This check is done during the wake up because small tasks tend to wake up between load balance and asynchronously to each other which prevents the default mechanism to catch and migrate them efficiently.
Change since V1:
Patch 2/6 - Change the flag name which was not clear. The new name is SD_SHARE_POWERDOMAIN. - Create an architecture dependent function to tune the sched_domain flags Patch 3/6 - Fix issues in the algorithm that looks for the best buddy CPU - Use pr_debug instead of pr_info - Fix for uniprocessor Patch 4/6 - Remove the use of usage_avg_sum which has not been merged Patch 5/6 - Change the way the coherency of runnable_avg_sum and runnable_avg_period is ensured Patch 6/6 - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM platform
New results for V2:
This series has been tested with MP3 play back on ARM platform: TC2 HMP (dual CA-15 and 3xCA-7 cluster).
The measurements have been done on an Ubuntu image during 60 seconds of playback and the result has been normalized to 100.
| CA15 | CA7 | total | ------------------------------------- default | 81 | 97 | 178 | pack | 13 | 100 | 113 | -------------------------------------
Previous result for V1:
The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that it's worth packing small tasks at all topology levels.
The performance tests have been done on both platforms with sysbench. The results don't show any performance regressions. These results are aligned with the policy which uses the normal behavior with heavy use cases.
test: sysbench --test=cpu --num-threads=N --max-requests=R run
Results below is the average duration of 3 tests on the quad CA-9. default is the current scheduler behavior (pack buddy CPU is -1) pack is the scheduler with the pack mechanism
| default | pack | ----------------------------------- N=8; R=200 | 3.1999 | 3.1921 | N=8; R=2000 | 31.4939 | 31.4844 | N=12; R=200 | 3.2043 | 3.2084 | N=12; R=2000 | 31.4897 | 31.4831 | N=16; R=200 | 3.1774 | 3.1824 | N=16; R=2000 | 31.4899 | 31.4897 | -----------------------------------
The power consumption tests have been done only on TC2 platform which has got accessible power lines and I have used cyclictest to simulate small tasks. The tests show some power consumption improvements.
test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
The measurements have been done during 16 seconds and the result has been normalized to 100
| CA15 | CA7 | total | ------------------------------------- default | 100 | 40 | 140 | pack | <1 | 45 | <46 | -------------------------------------
The A15 cluster is less power efficient than the A7 cluster but if we assume that the tasks is well spread on both clusters, we can guest estimate that the power consumption on a dual cluster of CA7 would have been for a default kernel:
Vincent Guittot (6): Revert "sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" sched: add a new SD SHARE_POWERLINE flag for sched_domain sched: pack small tasks sched: secure access to other CPU statistics sched: pack the idle load balance ARM: sched: clear SD_SHARE_POWERLINE
arch/arm/kernel/topology.c | 9 +++ arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 9 +-- include/linux/topology.h | 4 ++ kernel/sched/core.c | 14 ++-- kernel/sched/fair.c | 134 +++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 14 ++-- 8 files changed, 163 insertions(+), 23 deletions(-)
This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 8 +------- kernel/sched/core.c | 7 +------ kernel/sched/fair.c | 3 +-- kernel/sched/sched.h | 9 +-------- 4 files changed, 4 insertions(+), 23 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index fd17ca3..046e39a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1195,13 +1195,7 @@ struct sched_entity { /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; #endif -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) - /* Per-entity load-tracking */ +#ifdef CONFIG_SMP struct sched_avg avg; #endif }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8482628..c25c75d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1544,12 +1544,7 @@ static void __sched_fork(struct task_struct *p) p->se.vruntime = 0; INIT_LIST_HEAD(&p->se.group_node);
-/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP p->se.avg.runnable_avg_period = 0; p->se.avg.runnable_avg_sum = 0; #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61c7a10..9916d41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2820,8 +2820,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) } #endif /* CONFIG_FAIR_GROUP_SCHED */
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP /* * We choose a half-life close to 1 scheduling period. * Note: The tables below are dependent on this value. diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f00eb80..92ba891 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -226,12 +226,6 @@ struct cfs_rq { #endif
#ifdef CONFIG_SMP -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED /* * CFS Load tracking * Under CFS, load is tracked on a per-entity basis and aggregated up. @@ -241,8 +235,7 @@ struct cfs_rq { u64 runnable_load_avg, blocked_load_avg; atomic64_t decay_counter, removed_load; u64 last_decay; -#endif /* CONFIG_FAIR_GROUP_SCHED */ -/* These always depend on CONFIG_FAIR_GROUP_SCHED */ + #ifdef CONFIG_FAIR_GROUP_SCHED u32 tg_runnable_contrib; u64 tg_load_contrib;
This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPU in a sched_domain level can or not reach a different power state. If clusters can be power gated independently, as an example, the flag should be cleared at CPU level. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gated the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks so the flag is set into all sched_domains
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 4 ++++ kernel/sched/core.c | 6 ++++++ 5 files changed, 13 insertions(+)
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..6d0b617 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \ + | arch_sd_local_flags(0)\ .last_balance = jiffies, \ .balance_interval = 1, \ .nr_balance_failed = 0, \ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d5e86c9..adc8710 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0) \ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 046e39a..3287be1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -844,6 +844,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..3eab293 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -99,6 +99,8 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_CPUPOWER|\ + SD_SHARE_PKG_RESOURCES) \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing() \ @@ -131,6 +133,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \ @@ -161,6 +164,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0) \ | 0*SD_SERIALIZE \ | 1*SD_PREFER_SIBLING \ , \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c25c75d..4f36e9d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5969,6 +5969,11 @@ int __weak arch_sd_sibling_asym_packing(void) return 0*SD_ASYM_PACKING; }
+int __weak arch_sd_local_flags(int level) +{ + return 1*SD_SHARE_POWERDOMAIN; +} + /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -6209,6 +6214,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) | 0*SD_WAKE_AFFINE | 0*SD_SHARE_CPUPOWER | 0*SD_SHARE_PKG_RESOURCES + | 1*SD_SHARE_POWERDOMAIN | 1*SD_SERIALIZE | 0*SD_PREFER_SIBLING | 1*SD_NUMA
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPU in a sched_domain level can or not reach a different power state. If clusters can be power gated independently, as an example, the flag should be cleared at CPU level. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gated the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks so the flag is set into all sched_domains
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 4 ++++ kernel/sched/core.c | 6 ++++++ 5 files changed, 13 insertions(+)
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..6d0b617 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \
.last_balance = jiffies, \ .balance_interval = 1, \ .nr_balance_failed = 0, \| arch_sd_local_flags(0)\
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d5e86c9..adc8710 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \
.last_balance = jiffies, \| arch_sd_local_flags(0) \ | 0*SD_SERIALIZE \ , \
diff --git a/include/linux/sched.h b/include/linux/sched.h index 046e39a..3287be1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -844,6 +844,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..3eab293 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -99,6 +99,8 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \
| arch_sd_local_flags(SD_SHARE_CPUPOWER|\
SD_SHARE_PKG_RESOURCES) \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing() \
@@ -131,6 +133,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \
.last_balance = jiffies, \| arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\ | 0*SD_SERIALIZE \ , \
@@ -161,6 +164,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \
| arch_sd_local_flags(0) \
The general style looks like prefering SD_XXX flag directly.
On 13 December 2012 03:24, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPU in a sched_domain level can or not reach a different power state. If clusters can be power gated independently, as an example, the flag should be cleared at CPU level. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gated the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks so the flag is set into all sched_domains
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 4 ++++ kernel/sched/core.c | 6 ++++++ 5 files changed, 13 insertions(+)
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..6d0b617 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \
| arch_sd_local_flags(0)\ .last_balance = jiffies, \ .balance_interval = 1, \ .nr_balance_failed = 0, \
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d5e86c9..adc8710 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \
| arch_sd_local_flags(0) \ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \
diff --git a/include/linux/sched.h b/include/linux/sched.h index 046e39a..3287be1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -844,6 +844,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..3eab293 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -99,6 +99,8 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \
| arch_sd_local_flags(SD_SHARE_CPUPOWER|\
SD_SHARE_PKG_RESOURCES) \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing() \
@@ -131,6 +133,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \
| arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \
@@ -161,6 +164,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \
| arch_sd_local_flags(0) \
The general style looks like prefering SD_XXX flag directly.
In fact, I have followed the same style as arch_sd_sibling_asym_packing or sd_local_flags which conditionally set flags for sched_domain
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | ----------------------------------- buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 1 + kernel/sched/fair.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 5 +++ 3 files changed, 116 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4f36e9d..3436aad 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
+ update_packing_domain(cpu); update_domain_cache(cpu); }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9916d41..fc93d96 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -163,6 +163,73 @@ void sched_init_granularity(void) update_sysctl(); }
+ +#ifdef CONFIG_SMP +/* + * Save the id of the optimal CPU that should be used to pack small tasks + * The value -1 is used when no buddy has been found + */ +DEFINE_PER_CPU(int, sd_pack_buddy); + +/* Look for the best buddy CPU that can be used to pack small tasks + * We make the assumption that it doesn't wort to pack on CPU that share the + * same powerline. We looks for the 1st sched_domain without the + * SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the lowest + * power per core based on the assumption that their power efficiency is + * better */ +void update_packing_domain(int cpu) +{ + struct sched_domain *sd; + int id = -1; + + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN & SD_LOAD_BALANCE); + if (!sd) + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + else + sd = sd->parent; + + while (sd && (sd->flags && SD_LOAD_BALANCE)) { + struct sched_group *sg = sd->groups; + struct sched_group *pack = sg; + struct sched_group *tmp; + + /* + * The sched_domain of a CPU points on the local sched_group + * and the 1st CPU of this local group is a good candidate + */ + id = cpumask_first(sched_group_cpus(pack)); + + /* loop the sched groups to find the best one */ + for (tmp = sg->next; tmp != sg; tmp = tmp->next) { + if (tmp->sgp->power * pack->group_weight > + pack->sgp->power * tmp->group_weight) + continue; + + if ((tmp->sgp->power * pack->group_weight == + pack->sgp->power * tmp->group_weight) + && (cpumask_first(sched_group_cpus(tmp)) >= id)) + continue; + + /* we have found a better group */ + pack = tmp; + + /* Take the 1st CPU of the new group */ + id = cpumask_first(sched_group_cpus(pack)); + } + + /* Look for another CPU than itself */ + if (id != cpu) + break; + + sd = sd->parent; + } + + pr_debug("CPU%d packing on CPU%d\n", cpu, id); + per_cpu(sd_pack_buddy, cpu) = id; +} + +#endif /* CONFIG_SMP */ + #if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -5083,6 +5150,46 @@ static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cp return true; }
+static bool is_buddy_busy(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + /* + * A busy buddy is a CPU with a high load or a small load with a lot of + * running tasks. + */ + return ((rq->avg.runnable_avg_sum << rq->nr_running) > + rq->avg.runnable_avg_period); +} + +static bool is_light_task(struct task_struct *p) +{ + /* A light task runs less than 25% in average */ + return ((p->se.avg.runnable_avg_sum << 1) < + p->se.avg.runnable_avg_period); +} + +static int check_pack_buddy(int cpu, struct task_struct *p) +{ + int buddy = per_cpu(sd_pack_buddy, cpu); + + /* No pack buddy for this CPU */ + if (buddy == -1) + return false; + + /* buddy is not an allowed CPU */ + if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p))) + return false; + + /* + * If the task is a small one and the buddy is not overloaded, + * we use buddy cpu + */ + if (!is_light_task(p) || is_buddy_busy(buddy)) + return false; + + return true; +}
/* * sched_balance_self: balance the current task (running on cpu) in domains @@ -5120,6 +5227,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) return p->ideal_cpu; #endif
+ if (check_pack_buddy(cpu, p)) + return per_cpu(sd_pack_buddy, cpu); + if (sd_flag & SD_BALANCE_WAKE) { if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 92ba891..3802fc4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -892,6 +892,7 @@ extern const struct sched_class idle_sched_class;
extern void trigger_load_balance(struct rq *rq, int cpu); extern void idle_balance(int this_cpu, struct rq *this_rq); +extern void update_packing_domain(int cpu);
#else /* CONFIG_SMP */
@@ -899,6 +900,10 @@ static inline void idle_balance(int cpu, struct rq *rq) { }
+static inline void update_packing_domain(int cpu) +{ +} + #endif
extern void sysrq_sched_debug_show(void);
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 1 + kernel/sched/fair.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 5 +++ 3 files changed, 116 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4f36e9d..3436aad 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
- update_packing_domain(cpu); update_domain_cache(cpu);
} diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9916d41..fc93d96 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -163,6 +163,73 @@ void sched_init_granularity(void) update_sysctl(); }
+#ifdef CONFIG_SMP +/*
- Save the id of the optimal CPU that should be used to pack small tasks
- The value -1 is used when no buddy has been found
- */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+/* Look for the best buddy CPU that can be used to pack small tasks
- We make the assumption that it doesn't wort to pack on CPU that share the
- same powerline. We looks for the 1st sched_domain without the
- SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the lowest
- power per core based on the assumption that their power efficiency is
- better */
+void update_packing_domain(int cpu) +{
- struct sched_domain *sd;
- int id = -1;
- sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN & SD_LOAD_BALANCE);
- if (!sd)
sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
- else
sd = sd->parent;
- while (sd && (sd->flags && SD_LOAD_BALANCE)) {
struct sched_group *sg = sd->groups;
struct sched_group *pack = sg;
struct sched_group *tmp;
/*
* The sched_domain of a CPU points on the local sched_group
* and the 1st CPU of this local group is a good candidate
*/
id = cpumask_first(sched_group_cpus(pack));
/* loop the sched groups to find the best one */
for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
if (tmp->sgp->power * pack->group_weight >
pack->sgp->power * tmp->group_weight)
continue;
if ((tmp->sgp->power * pack->group_weight ==
pack->sgp->power * tmp->group_weight)
&& (cpumask_first(sched_group_cpus(tmp)) >= id))
continue;
/* we have found a better group */
pack = tmp;
/* Take the 1st CPU of the new group */
id = cpumask_first(sched_group_cpus(pack));
}
/* Look for another CPU than itself */
if (id != cpu)
break;
sd = sd->parent;
- }
- pr_debug("CPU%d packing on CPU%d\n", cpu, id);
- per_cpu(sd_pack_buddy, cpu) = id;
+}
+#endif /* CONFIG_SMP */
#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -5083,6 +5150,46 @@ static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cp return true; } +static bool is_buddy_busy(int cpu) +{
- struct rq *rq = cpu_rq(cpu);
- /*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
- return ((rq->avg.runnable_avg_sum << rq->nr_running) >
If nr_running a bit big, rq->avg.runnable_avg_sum << rq->nr_running is zero. you will get the wrong decision.
rq->avg.runnable_avg_period);
+}
+static bool is_light_task(struct task_struct *p) +{
- /* A light task runs less than 25% in average */
- return ((p->se.avg.runnable_avg_sum << 1) <
p->se.avg.runnable_avg_period);
25% may not suitable for big machine.
+}
+static int check_pack_buddy(int cpu, struct task_struct *p) +{
- int buddy = per_cpu(sd_pack_buddy, cpu);
- /* No pack buddy for this CPU */
- if (buddy == -1)
return false;
- /* buddy is not an allowed CPU */
- if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
return false;
- /*
* If the task is a small one and the buddy is not overloaded,
* we use buddy cpu
*/
- if (!is_light_task(p) || is_buddy_busy(buddy))
return false;
- return true;
+} /*
- sched_balance_self: balance the current task (running on cpu) in domains
@@ -5120,6 +5227,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) return p->ideal_cpu; #endif
- if (check_pack_buddy(cpu, p))
return per_cpu(sd_pack_buddy, cpu);
- if (sd_flag & SD_BALANCE_WAKE) { if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 92ba891..3802fc4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -892,6 +892,7 @@ extern const struct sched_class idle_sched_class; extern void trigger_load_balance(struct rq *rq, int cpu); extern void idle_balance(int this_cpu, struct rq *this_rq); +extern void update_packing_domain(int cpu); #else /* CONFIG_SMP */ @@ -899,6 +900,10 @@ static inline void idle_balance(int cpu, struct rq *rq) { } +static inline void update_packing_domain(int cpu) +{ +}
#endif extern void sysrq_sched_debug_show(void);
On 12/13/2012 10:17 AM, Alex Shi wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
In above big machine example, only one buddy cpu is not sufficient on each of level, like for 4 sockets level, maybe tasks can just full fill 2 sockets, then we just use 2 sockets, that is more performance/power efficient. But one buddy cpu here need to spread tasks to 4 sockets all.
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks
Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ?
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 1 + kernel/sched/fair.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 5 +++ 3 files changed, 116 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4f36e9d..3436aad 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
update_packing_domain(cpu); update_domain_cache(cpu);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9916d41..fc93d96 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -163,6 +163,73 @@ void sched_init_granularity(void) update_sysctl(); }
+#ifdef CONFIG_SMP +/*
- Save the id of the optimal CPU that should be used to pack small tasks
- The value -1 is used when no buddy has been found
- */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+/* Look for the best buddy CPU that can be used to pack small tasks
- We make the assumption that it doesn't wort to pack on CPU that share the
- same powerline. We looks for the 1st sched_domain without the
- SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the lowest
- power per core based on the assumption that their power efficiency is
- better */
+void update_packing_domain(int cpu) +{
struct sched_domain *sd;
int id = -1;
sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN & SD_LOAD_BALANCE);
if (!sd)
sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
else
sd = sd->parent;
while (sd && (sd->flags && SD_LOAD_BALANCE)) {
struct sched_group *sg = sd->groups;
struct sched_group *pack = sg;
struct sched_group *tmp;
/*
* The sched_domain of a CPU points on the local sched_group
* and the 1st CPU of this local group is a good candidate
*/
id = cpumask_first(sched_group_cpus(pack));
/* loop the sched groups to find the best one */
for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
if (tmp->sgp->power * pack->group_weight >
pack->sgp->power * tmp->group_weight)
continue;
if ((tmp->sgp->power * pack->group_weight ==
pack->sgp->power * tmp->group_weight)
&& (cpumask_first(sched_group_cpus(tmp)) >= id))
continue;
/* we have found a better group */
pack = tmp;
/* Take the 1st CPU of the new group */
id = cpumask_first(sched_group_cpus(pack));
}
/* Look for another CPU than itself */
if (id != cpu)
break;
sd = sd->parent;
}
pr_debug("CPU%d packing on CPU%d\n", cpu, id);
per_cpu(sd_pack_buddy, cpu) = id;
+}
+#endif /* CONFIG_SMP */
#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -5083,6 +5150,46 @@ static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cp return true; }
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
return ((rq->avg.runnable_avg_sum << rq->nr_running) >
If nr_running a bit big, rq->avg.runnable_avg_sum << rq->nr_running is zero. you will get the wrong decision.
yes, I'm going to do that like below instead: return (rq->avg.runnable_avg_sum > (rq->avg.runnable_avg_period >> rq->nr_running));
rq->avg.runnable_avg_period);
+}
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 25% in average */
return ((p->se.avg.runnable_avg_sum << 1) <
p->se.avg.runnable_avg_period);
25% may not suitable for big machine.
Threshold is always an issue, which threshold should be suitable for big machine ?
I'm wondering if i should use the imbalance_pct value for computing the threshold
+}
+static int check_pack_buddy(int cpu, struct task_struct *p) +{
int buddy = per_cpu(sd_pack_buddy, cpu);
/* No pack buddy for this CPU */
if (buddy == -1)
return false;
/* buddy is not an allowed CPU */
if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
return false;
/*
* If the task is a small one and the buddy is not overloaded,
* we use buddy cpu
*/
if (!is_light_task(p) || is_buddy_busy(buddy))
return false;
return true;
+}
/*
- sched_balance_self: balance the current task (running on cpu) in domains
@@ -5120,6 +5227,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) return p->ideal_cpu; #endif
if (check_pack_buddy(cpu, p))
return per_cpu(sd_pack_buddy, cpu);
if (sd_flag & SD_BALANCE_WAKE) { if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 92ba891..3802fc4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -892,6 +892,7 @@ extern const struct sched_class idle_sched_class;
extern void trigger_load_balance(struct rq *rq, int cpu); extern void idle_balance(int this_cpu, struct rq *this_rq); +extern void update_packing_domain(int cpu);
#else /* CONFIG_SMP */
@@ -899,6 +900,10 @@ static inline void idle_balance(int cpu, struct rq *rq) { }
+static inline void update_packing_domain(int cpu) +{ +}
#endif
extern void sysrq_sched_debug_show(void);
On 12/13/2012 06:11 PM, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
On 13 December 2012 15:25, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 06:11 PM, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
yes, I have several ideas which were not really relevant for small system but could be interesting for larger system
We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU
The scheme below tries to summaries the idea:
Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 |
But, I don't know how this can interact with NUMA load balance and the better might be to use conf3.
Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
On 13 December 2012 15:53, Vincent Guittot vincent.guittot@linaro.org wrote:
On 13 December 2012 15:25, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 06:11 PM, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
yes, I have several ideas which were not really relevant for small system but could be interesting for larger system
We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU
The scheme below tries to summaries the idea:
Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 |
But, I don't know how this can interact with NUMA load balance and the better might be to use conf3.
I mean conf2 not conf3
Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
On 12/13/2012 11:48 PM, Vincent Guittot wrote:
On 13 December 2012 15:53, Vincent Guittot vincent.guittot@linaro.org wrote:
On 13 December 2012 15:25, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 06:11 PM, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
yes, I have several ideas which were not really relevant for small system but could be interesting for larger system
We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU
The scheme below tries to summaries the idea:
Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 |
But, I don't know how this can interact with NUMA load balance and the better might be to use conf3.
I mean conf2 not conf3
So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket.
And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this.
Obviously, more and more cores is the trend on any kinds of CPU, the buddy system seems hard to catch up this.
On 14 December 2012 02:46, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 11:48 PM, Vincent Guittot wrote:
On 13 December 2012 15:53, Vincent Guittot vincent.guittot@linaro.org wrote:
On 13 December 2012 15:25, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 06:11 PM, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote: > During the creation of sched_domain, we define a pack buddy CPU for each CPU > when one is available. We want to pack at all levels where a group of CPU can > be power gated independently from others. > On a system that can't power gate a group of CPUs independently, the flag is > set at all sched_domain level and the buddy is set to -1. This is the default > behavior. > On a dual clusters / dual cores system which can power gate each core and > cluster independently, the buddy configuration will be : > > | Cluster 0 | Cluster 1 | > | CPU0 | CPU1 | CPU2 | CPU3 | > ----------------------------------- > buddy | CPU0 | CPU0 | CPU0 | CPU2 | > > Small tasks tend to slip out of the periodic load balance so the best place > to choose to migrate them is during their wake up. The decision is in O(1) as > we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
yes, I have several ideas which were not really relevant for small system but could be interesting for larger system
We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU
The scheme below tries to summaries the idea:
Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 |
But, I don't know how this can interact with NUMA load balance and the better might be to use conf3.
I mean conf2 not conf3
So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket.
That the target because we have decided to pack the small tasks in socket 0 when we have parsed the topology at boot. We don't have to loop into sched_domain or sched_group anymore to find the best LCPU when a small tasks wake up.
And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this.
You speak about tasks without any notion of load. This patch only care of small tasks and light LCPU load, but it falls back to default behavior for other situation. So if there are 4 or 8 small tasks, they will migrate to the socket 0 after 1 or up to 3 migration (it depends of the conf and the LCPU they come from).
Then, if too much small tasks wake up simultaneously on the same LCPU, the default load balance will spread them in the core/cluster/socket
Obviously, more and more cores is the trend on any kinds of CPU, the buddy system seems hard to catch up this.
On 12/14/2012 05:33 PM, Vincent Guittot wrote:
On 14 December 2012 02:46, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 11:48 PM, Vincent Guittot wrote:
On 13 December 2012 15:53, Vincent Guittot vincent.guittot@linaro.org wrote:
On 13 December 2012 15:25, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 06:11 PM, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote: > On 12/12/2012 09:31 PM, Vincent Guittot wrote: >> During the creation of sched_domain, we define a pack buddy CPU for each CPU >> when one is available. We want to pack at all levels where a group of CPU can >> be power gated independently from others. >> On a system that can't power gate a group of CPUs independently, the flag is >> set at all sched_domain level and the buddy is set to -1. This is the default >> behavior. >> On a dual clusters / dual cores system which can power gate each core and >> cluster independently, the buddy configuration will be : >> >> | Cluster 0 | Cluster 1 | >> | CPU0 | CPU1 | CPU2 | CPU3 | >> ----------------------------------- >> buddy | CPU0 | CPU0 | CPU0 | CPU2 | >> >> Small tasks tend to slip out of the periodic load balance so the best place >> to choose to migrate them is during their wake up. The decision is in O(1) as >> we only check again one buddy CPU > > Just have a little worry about the scalability on a big machine, like on > a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole > system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That > is different on task distribution decision.
The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
yes, I have several ideas which were not really relevant for small system but could be interesting for larger system
We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU
The scheme below tries to summaries the idea:
Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 |
But, I don't know how this can interact with NUMA load balance and the better might be to use conf3.
I mean conf2 not conf3
So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket.
That the target because we have decided to pack the small tasks in socket 0 when we have parsed the topology at boot. We don't have to loop into sched_domain or sched_group anymore to find the best LCPU when a small tasks wake up.
iteration on domain and group is a advantage feature for power efficient requirement, not shortage. If some CPU are already idle before forking, let another waking CPU check their load/util and then decide which one is best CPU can reduce late migrations, that save both the performance and power.
On the contrary, move task walking on each level buddies is not only bad on performance but also bad on power. Consider the quite big latency of waking a deep idle CPU. we lose too much..
And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this.
You speak about tasks without any notion of load. This patch only care of small tasks and light LCPU load, but it falls back to default behavior for other situation. So if there are 4 or 8 small tasks, they will migrate to the socket 0 after 1 or up to 3 migration (it depends of the conf and the LCPU they come from).
According to your patch, what your mean 'notion of load' is the utilization of cpu, not the load weight of tasks, right?
Yes, I just talked about tasks numbers, but it naturally extends to the task utilization on cpu. like 8 tasks with 25% util, that just can full fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need to wake up another CPU socket while local socket has some LCPU idle...
Then, if too much small tasks wake up simultaneously on the same LCPU, the default load balance will spread them in the core/cluster/socket
Obviously, more and more cores is the trend on any kinds of CPU, the buddy system seems hard to catch up this.
On 16 December 2012 08:12, Alex Shi alex.shi@intel.com wrote:
On 12/14/2012 05:33 PM, Vincent Guittot wrote:
On 14 December 2012 02:46, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 11:48 PM, Vincent Guittot wrote:
On 13 December 2012 15:53, Vincent Guittot vincent.guittot@linaro.org wrote:
On 13 December 2012 15:25, Alex Shi alex.shi@intel.com wrote:
On 12/13/2012 06:11 PM, Vincent Guittot wrote: > On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote: >> On 12/12/2012 09:31 PM, Vincent Guittot wrote: >>> During the creation of sched_domain, we define a pack buddy CPU for each CPU >>> when one is available. We want to pack at all levels where a group of CPU can >>> be power gated independently from others. >>> On a system that can't power gate a group of CPUs independently, the flag is >>> set at all sched_domain level and the buddy is set to -1. This is the default >>> behavior. >>> On a dual clusters / dual cores system which can power gate each core and >>> cluster independently, the buddy configuration will be : >>> >>> | Cluster 0 | Cluster 1 | >>> | CPU0 | CPU1 | CPU2 | CPU3 | >>> ----------------------------------- >>> buddy | CPU0 | CPU0 | CPU0 | CPU2 | >>> >>> Small tasks tend to slip out of the periodic load balance so the best place >>> to choose to migrate them is during their wake up. The decision is in O(1) as >>> we only check again one buddy CPU >> >> Just have a little worry about the scalability on a big machine, like on >> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole >> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That >> is different on task distribution decision. > > The buddy CPU should probably not be the same for all 64 LCPU it > depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
yes, I have several ideas which were not really relevant for small system but could be interesting for larger system
We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU
The scheme below tries to summaries the idea:
Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 |
But, I don't know how this can interact with NUMA load balance and the better might be to use conf3.
I mean conf2 not conf3
So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket.
That the target because we have decided to pack the small tasks in socket 0 when we have parsed the topology at boot. We don't have to loop into sched_domain or sched_group anymore to find the best LCPU when a small tasks wake up.
iteration on domain and group is a advantage feature for power efficient requirement, not shortage. If some CPU are already idle before forking, let another waking CPU check their load/util and then decide which one is best CPU can reduce late migrations, that save both the performance and power.
In fact, we have already done this job once at boot and we consider that moving small tasks in the buddy CPU is always benefit so we don't need to waste time looping sched_domain and sched_group to compute current capacity of each LCPU for each wake up of each small tasks. We want all small tasks and background activity waking up on the same buddy CPU and let the default behavior of the scheduler choosing the best CPU for heavy tasks or loaded CPUs.
On the contrary, move task walking on each level buddies is not only bad on performance but also bad on power. Consider the quite big latency of waking a deep idle CPU. we lose too much..
My result have shown different conclusion. In fact, there is much more chance that the buddy will not be in a deep idle as all the small tasks and background activity are already waking on this CPU.
And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this.
You speak about tasks without any notion of load. This patch only care of small tasks and light LCPU load, but it falls back to default behavior for other situation. So if there are 4 or 8 small tasks, they will migrate to the socket 0 after 1 or up to 3 migration (it depends of the conf and the LCPU they come from).
According to your patch, what your mean 'notion of load' is the utilization of cpu, not the load weight of tasks, right?
Yes but not only. The number of tasks that run simultaneously, is another important input
Yes, I just talked about tasks numbers, but it naturally extends to the task utilization on cpu. like 8 tasks with 25% util, that just can full fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need to wake up another CPU socket while local socket has some LCPU idle...
8 tasks with a running period of 25ms per 100ms that wake up simultaneously should probably run on 8 different LCPU in order to race to idle
Regards, Vincent
Then, if too much small tasks wake up simultaneously on the same LCPU, the default load balance will spread them in the core/cluster/socket
Obviously, more and more cores is the trend on any kinds of CPU, the buddy system seems hard to catch up this.
-- Thanks Alex
The scheme below tries to summaries the idea:
Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 |
But, I don't know how this can interact with NUMA load balance and the better might be to use conf3.
I mean conf2 not conf3
So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket.
That the target because we have decided to pack the small tasks in socket 0 when we have parsed the topology at boot. We don't have to loop into sched_domain or sched_group anymore to find the best LCPU when a small tasks wake up.
iteration on domain and group is a advantage feature for power efficient requirement, not shortage. If some CPU are already idle before forking, let another waking CPU check their load/util and then decide which one is best CPU can reduce late migrations, that save both the performance and power.
In fact, we have already done this job once at boot and we consider that moving small tasks in the buddy CPU is always benefit so we don't need to waste time looping sched_domain and sched_group to compute current capacity of each LCPU for each wake up of each small tasks. We want all small tasks and background activity waking up on the same buddy CPU and let the default behavior of the scheduler choosing the best CPU for heavy tasks or loaded CPUs.
IMHO, the design should be very good for your scenario and your machine, but when the code move to general scheduler, we do want it can handle more general scenarios. like sometime the 'small task' is not as small as tasks in cyclictest which even hardly can run longer than migration granularity or one tick, thus we really don't need to consider task migration cost. But when the task are not too small, migration is more heavier than domain/group walking, that is the common sense in fork/exec/waking balance.
On the contrary, move task walking on each level buddies is not only bad on performance but also bad on power. Consider the quite big latency of waking a deep idle CPU. we lose too much..
My result have shown different conclusion.
That should be due to your tasks are too small to need consider migration cost.
In fact, there is much more chance that the buddy will not be in a deep idle as all the small tasks and background activity are already waking on this CPU.
powertop is helpful to tune your system for more idle time. Another reason is current kernel just try to spread tasks on more cpu for performance consideration. My power scheduling patch should helpful on this.
And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this.
You speak about tasks without any notion of load. This patch only care of small tasks and light LCPU load, but it falls back to default behavior for other situation. So if there are 4 or 8 small tasks, they will migrate to the socket 0 after 1 or up to 3 migration (it depends of the conf and the LCPU they come from).
According to your patch, what your mean 'notion of load' is the utilization of cpu, not the load weight of tasks, right?
Yes but not only. The number of tasks that run simultaneously, is another important input
Yes, I just talked about tasks numbers, but it naturally extends to the task utilization on cpu. like 8 tasks with 25% util, that just can full fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need to wake up another CPU socket while local socket has some LCPU idle...
8 tasks with a running period of 25ms per 100ms that wake up simultaneously should probably run on 8 different LCPU in order to race to idle
nope, it's a rare probability of 8 tasks wakeuping simultaneously. And even so they should run in the same socket for power saving consideration(my power scheduling patch can do this), instead of spread to all sockets.
Regards, Vincent
Then, if too much small tasks wake up simultaneously on the same LCPU, the default load balance will spread them in the core/cluster/socket
Obviously, more and more cores is the trend on any kinds of CPU, the buddy system seems hard to catch up this.
-- Thanks Alex
On 17 December 2012 16:24, Alex Shi alex.shi@intel.com wrote:
> The scheme below tries to summaries the idea: > > Socket | socket 0 | socket 1 | socket 2 | socket 3 | > LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | > buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | > buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | > buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 | > > But, I don't know how this can interact with NUMA load balance and the > better might be to use conf3.
I mean conf2 not conf3
So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket.
That the target because we have decided to pack the small tasks in socket 0 when we have parsed the topology at boot. We don't have to loop into sched_domain or sched_group anymore to find the best LCPU when a small tasks wake up.
iteration on domain and group is a advantage feature for power efficient requirement, not shortage. If some CPU are already idle before forking, let another waking CPU check their load/util and then decide which one is best CPU can reduce late migrations, that save both the performance and power.
In fact, we have already done this job once at boot and we consider that moving small tasks in the buddy CPU is always benefit so we don't need to waste time looping sched_domain and sched_group to compute current capacity of each LCPU for each wake up of each small tasks. We want all small tasks and background activity waking up on the same buddy CPU and let the default behavior of the scheduler choosing the best CPU for heavy tasks or loaded CPUs.
IMHO, the design should be very good for your scenario and your machine, but when the code move to general scheduler, we do want it can handle more general scenarios. like sometime the 'small task' is not as small as tasks in cyclictest which even hardly can run longer than migration
Cyclictest is the ultimate small tasks use case which points out all weaknesses of a scheduler for such kind of tasks. Music playback is a more realistic one and it also shows improvement
granularity or one tick, thus we really don't need to consider task migration cost. But when the task are not too small, migration is more
For which kind of machine are you stating that hypothesis ?
heavier than domain/group walking, that is the common sense in fork/exec/waking balance.
I would have said the opposite: The current scheduler limits its computation of statistic during fork/exec/waking compared to a periodic load balance because it's too heavy. It's even more true for wake up if wake affine is possible.
On the contrary, move task walking on each level buddies is not only bad on performance but also bad on power. Consider the quite big latency of waking a deep idle CPU. we lose too much..
My result have shown different conclusion.
That should be due to your tasks are too small to need consider migration cost.
In fact, there is much more chance that the buddy will not be in a deep idle as all the small tasks and background activity are already waking on this CPU.
powertop is helpful to tune your system for more idle time. Another reason is current kernel just try to spread tasks on more cpu for performance consideration. My power scheduling patch should helpful on this.
And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this.
You speak about tasks without any notion of load. This patch only care of small tasks and light LCPU load, but it falls back to default behavior for other situation. So if there are 4 or 8 small tasks, they will migrate to the socket 0 after 1 or up to 3 migration (it depends of the conf and the LCPU they come from).
According to your patch, what your mean 'notion of load' is the utilization of cpu, not the load weight of tasks, right?
Yes but not only. The number of tasks that run simultaneously, is another important input
Yes, I just talked about tasks numbers, but it naturally extends to the task utilization on cpu. like 8 tasks with 25% util, that just can full fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need to wake up another CPU socket while local socket has some LCPU idle...
8 tasks with a running period of 25ms per 100ms that wake up simultaneously should probably run on 8 different LCPU in order to race to idle
nope, it's a rare probability of 8 tasks wakeuping simultaneously. And
Multimedia is one example of tasks waking up simultaneously
even so they should run in the same socket for power saving consideration(my power scheduling patch can do this), instead of spread to all sockets.
This is may be good for your scenario and your machine :-) Packing small tasks is the best choice for any scenario and machine. It's a more tricky point for not so small tasks because different machine will want different behavior.
Regards, Vincent
Then, if too much small tasks wake up simultaneously on the same LCPU, the default load balance will spread them in the core/cluster/socket
Obviously, more and more cores is the trend on any kinds of CPU, the buddy system seems hard to catch up this.
-- Thanks Alex
-- Thanks Alex
On Tue, Dec 18, 2012 at 5:53 PM, Vincent Guittot vincent.guittot@linaro.org wrote:
On 17 December 2012 16:24, Alex Shi alex.shi@intel.com wrote:
>> The scheme below tries to summaries the idea: >> >> Socket | socket 0 | socket 1 | socket 2 | socket 3 | >> LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | >> buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | >> buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | >> buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 | >> >> But, I don't know how this can interact with NUMA load balance and the >> better might be to use conf3. > > I mean conf2 not conf3
Cyclictest is the ultimate small tasks use case which points out all weaknesses of a scheduler for such kind of tasks. Music playback is a more realistic one and it also shows improvement
granularity or one tick, thus we really don't need to consider task migration cost. But when the task are not too small, migration is more
For which kind of machine are you stating that hypothesis ?
Seems the biggest argument between us is you didn't want to admit 'not too small tasks' exists and that will cause more migrations because your patch.
even so they should run in the same socket for power saving consideration(my power scheduling patch can do this), instead of spread to all sockets.
This is may be good for your scenario and your machine :-) Packing small tasks is the best choice for any scenario and machine.
That's clearly wrong, I had explained many times, your single buddy CPU is impossible packing all tasks for a big machine, like for just 16 LCPU, while it suppose do.
Anyway you have right insist your design. and I thought I can not say more clear about the scalability issue. I won't judge the patch again.
On Thu, 2012-12-13 at 22:25 +0800, Alex Shi wrote:
On 12/13/2012 06:11 PM, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision.
The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
CPU is a bug that slipped into domain degeneration. You should have SIBLING/MC/NUMA (chasing that down is on todo).
-Mike
On 12/14/2012 12:45 PM, Mike Galbraith wrote:
Do you have further ideas for buddy cpu on such example?
Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
CPU is a bug that slipped into domain degeneration. You should have SIBLING/MC/NUMA (chasing that down is on todo).
Maybe. the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
On 12/14/2012 12:45 PM, Mike Galbraith wrote:
Do you have further ideas for buddy cpu on such example?
Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
CPU is a bug that slipped into domain degeneration. You should have SIBLING/MC/NUMA (chasing that down is on todo).
Maybe. the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
What I noticed during (an unrelated) bisection on a 40 core box was domains going from so..
3.4.0-bisect (virgin) [ 5.056214] CPU0 attaching sched-domain: [ 5.065009] domain 0: span 0,32 level SIBLING [ 5.075011] groups: 0 (cpu_power = 589) 32 (cpu_power = 589) [ 5.088381] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [ 5.107669] groups: 0,32 (cpu_power = 1178) 4,36 (cpu_power = 1178) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177) 64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176) [ 5.162115] domain 2: span 0-79 level NODE [ 5.171927] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)
..to so, which looks a little bent. CPU and MC have identical spans, so CPU should have gone away, as it used to do.
3.6.0-bisect (virgin) [ 3.978338] CPU0 attaching sched-domain: [ 3.987125] domain 0: span 0,32 level SIBLING [ 3.997125] groups: 0 (cpu_power = 588) 32 (cpu_power = 589) [ 4.010477] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [ 4.029748] groups: 0,32 (cpu_power = 1177) 4,36 (cpu_power = 1177) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178) 64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177) [ 4.084143] domain 2: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU [ 4.103796] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777) [ 4.124373] domain 3: span 0-79 level NUMA [ 4.134369] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780)
-Mike
On 12/14/2012 03:45 PM, Mike Galbraith wrote:
On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
On 12/14/2012 12:45 PM, Mike Galbraith wrote:
Do you have further ideas for buddy cpu on such example?
> > Which kind of sched_domain configuration have you for such system ? > and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
CPU is a bug that slipped into domain degeneration. You should have SIBLING/MC/NUMA (chasing that down is on todo).
Maybe. the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
What I noticed during (an unrelated) bisection on a 40 core box was domains going from so..
3.4.0-bisect (virgin) [ 5.056214] CPU0 attaching sched-domain: [ 5.065009] domain 0: span 0,32 level SIBLING [ 5.075011] groups: 0 (cpu_power = 589) 32 (cpu_power = 589) [ 5.088381] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [ 5.107669] groups: 0,32 (cpu_power = 1178) 4,36 (cpu_power = 1178) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177) 64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176) [ 5.162115] domain 2: span 0-79 level NODE [ 5.171927] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)
..to so, which looks a little bent. CPU and MC have identical spans, so CPU should have gone away, as it used to do.
better to remove one, and believe you can make it. :)
On 14 December 2012 08:45, Mike Galbraith bitbucket@online.de wrote:
On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
On 12/14/2012 12:45 PM, Mike Galbraith wrote:
Do you have further ideas for buddy cpu on such example?
> > Which kind of sched_domain configuration have you for such system ? > and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
CPU is a bug that slipped into domain degeneration. You should have SIBLING/MC/NUMA (chasing that down is on todo).
Maybe. the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
What I noticed during (an unrelated) bisection on a 40 core box was domains going from so..
3.4.0-bisect (virgin) [ 5.056214] CPU0 attaching sched-domain: [ 5.065009] domain 0: span 0,32 level SIBLING [ 5.075011] groups: 0 (cpu_power = 589) 32 (cpu_power = 589) [ 5.088381] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [ 5.107669] groups: 0,32 (cpu_power = 1178) 4,36 (cpu_power = 1178) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177) 64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176) [ 5.162115] domain 2: span 0-79 level NODE [ 5.171927] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)
..to so, which looks a little bent. CPU and MC have identical spans, so CPU should have gone away, as it used to do.
3.6.0-bisect (virgin) [ 3.978338] CPU0 attaching sched-domain: [ 3.987125] domain 0: span 0,32 level SIBLING [ 3.997125] groups: 0 (cpu_power = 588) 32 (cpu_power = 589) [ 4.010477] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [ 4.029748] groups: 0,32 (cpu_power = 1177) 4,36 (cpu_power = 1177) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178) 64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177) [ 4.084143] domain 2: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU [ 4.103796] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777) [ 4.124373] domain 3: span 0-79 level NUMA [ 4.134369] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780)
Thanks. that's an interesting example of a numa topology
For your sched_domain difference, On 3.4, SD_PREFER_SIBLING was set for both MC and CPU level thanks to sd_balance_for_mc_power and sd_balance_for_package_power On 3.6, SD_PREFER_SIBLING is only set for CPU level and this flag difference with MC level prevents the destruction of CPU sched_domain during the degeneration
We may need to set SD_PREFER_SIBLING for MC level
Vincent
-Mike
On Fri, 2012-12-14 at 11:43 +0100, Vincent Guittot wrote:
On 14 December 2012 08:45, Mike Galbraith bitbucket@online.de wrote:
On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
On 12/14/2012 12:45 PM, Mike Galbraith wrote:
Do you have further ideas for buddy cpu on such example? > > > > Which kind of sched_domain configuration have you for such system ? > > and how many sched_domain level have you ?
it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa.
CPU is a bug that slipped into domain degeneration. You should have SIBLING/MC/NUMA (chasing that down is on todo).
Maybe. the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
What I noticed during (an unrelated) bisection on a 40 core box was domains going from so..
3.4.0-bisect (virgin) [ 5.056214] CPU0 attaching sched-domain: [ 5.065009] domain 0: span 0,32 level SIBLING [ 5.075011] groups: 0 (cpu_power = 589) 32 (cpu_power = 589) [ 5.088381] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [ 5.107669] groups: 0,32 (cpu_power = 1178) 4,36 (cpu_power = 1178) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177) 64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176) [ 5.162115] domain 2: span 0-79 level NODE [ 5.171927] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)
..to so, which looks a little bent. CPU and MC have identical spans, so CPU should have gone away, as it used to do.
3.6.0-bisect (virgin) [ 3.978338] CPU0 attaching sched-domain: [ 3.987125] domain 0: span 0,32 level SIBLING [ 3.997125] groups: 0 (cpu_power = 588) 32 (cpu_power = 589) [ 4.010477] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [ 4.029748] groups: 0,32 (cpu_power = 1177) 4,36 (cpu_power = 1177) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178) 64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177) [ 4.084143] domain 2: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU [ 4.103796] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777) [ 4.124373] domain 3: span 0-79 level NUMA [ 4.134369] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780)
Thanks. that's an interesting example of a numa topology
For your sched_domain difference, On 3.4, SD_PREFER_SIBLING was set for both MC and CPU level thanks to sd_balance_for_mc_power and sd_balance_for_package_power On 3.6, SD_PREFER_SIBLING is only set for CPU level and this flag difference with MC level prevents the destruction of CPU sched_domain during the degeneration
We may need to set SD_PREFER_SIBLING for MC level
Ah, that explains oddity. (todo--).
Hm, seems changing flags should trigger a rebuild. (todo++,drat).
-Mike
CPU is a bug that slipped into domain degeneration. You should have SIBLING/MC/NUMA (chasing that down is on todo).
Uh, the SD_PREFER_SIBLING on cpu domain is recovered by myself for a share memory benchmark regression. But consider all the situations, I think the flag is better to be removed.
============
From 96bee9a03b2048f2686fbd7de0e2aee458dbd917 Mon Sep 17 00:00:00 2001
From: Alex Shi alex.shi@intel.com Date: Mon, 17 Dec 2012 09:42:57 +0800 Subject: [PATCH 01/18] sched: remove SD_PERFER_SIBLING flag
The flag was introduced in commit b5d978e0c7e79a. Its purpose seems trying to fullfill one node first in NUMA machine via pulling tasks from other nodes when the node has capacity.
Its advantage is when few tasks share memories among them, pulling together is helpful on locality, so has performance gain. The shortage is it will keep unnecessary task migrations thrashing among different nodes, that reduces the performance gain, and just hurt performance if tasks has no memory cross.
Thinking about the sched numa balancing patch is coming. The small advantage are meaningless to us, So better to remove this flag.
Reported-by: Mike Galbraith efault@gmx.de Signed-off-by: Alex Shi alex.shi@intel.com --- include/linux/sched.h | 1 - include/linux/topology.h | 2 -- kernel/sched/core.c | 1 - kernel/sched/fair.c | 19 +------------------ 4 files changed, 1 insertion(+), 22 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 5dafac3..6dca96c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -836,7 +836,6 @@ enum cpu_idle_type { #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ -#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ #define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
extern int __weak arch_sd_sibiling_asym_packing(void); diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..15864d1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -100,7 +100,6 @@ int arch_update_cpu_topology(void); | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ - | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing() \ , \ .last_balance = jiffies, \ @@ -162,7 +161,6 @@ int arch_update_cpu_topology(void); | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ - | 1*SD_PREFER_SIBLING \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5dae0d2..8ed2784 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6014,7 +6014,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) | 0*SD_SHARE_CPUPOWER | 0*SD_SHARE_PKG_RESOURCES | 1*SD_SERIALIZE - | 0*SD_PREFER_SIBLING | sd_local_flags(level) , .last_balance = jiffies, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 59e072b..5d175f2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4339,13 +4339,9 @@ static bool update_sd_pick_busiest(struct lb_env *env, static inline void update_sd_lb_stats(struct lb_env *env, int *balance, struct sd_lb_stats *sds) { - struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; struct sg_lb_stats sgs; - int load_idx, prefer_sibling = 0; - - if (child && child->flags & SD_PREFER_SIBLING) - prefer_sibling = 1; + int load_idx;
load_idx = get_sd_load_idx(env->sd, env->idle);
@@ -4362,19 +4358,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, sds->total_load += sgs.group_load; sds->total_pwr += sg->sgp->power;
- /* - * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity to one so that we'll try - * and move all the excess tasks away. We lower the capacity - * of a group only if the local group has the capacity to fit - * these excess tasks, i.e. nr_running < group_capacity. The - * extra check prevents the case where you always pull from the - * heaviest group when it is already under-utilized (possible - * with a large weight task outweighs the tasks on the system). - */ - if (prefer_sibling && !local_group && sds->this_has_capacity) - sgs.group_capacity = min(sgs.group_capacity, 1UL); - if (local_group) { sds->this_load = sgs.avg_load; sds->this = sg;
Hi Vincent,
On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
return ((rq->avg.runnable_avg_sum << rq->nr_running) >
If nr_running a bit big, rq->avg.runnable_avg_sum << rq->nr_running is zero. you will get the wrong decision.
yes, I'm going to do that like below instead: return (rq->avg.runnable_avg_sum > (rq->avg.runnable_avg_period >> rq->nr_running));
Doesn't it consider nr_running too much? It seems current is_buddy_busy returns false on a cpu that has 1 task runs 40% cputime, but returns true on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15% cputime each, right?
I don't know what is correct, but just guessing that in a cpu's point of view it'd be busier if it has a higher runnable_avg_sum than a higher nr_running IMHO.
rq->avg.runnable_avg_period);
+}
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 25% in average */
return ((p->se.avg.runnable_avg_sum << 1) <
p->se.avg.runnable_avg_period);
25% may not suitable for big machine.
Threshold is always an issue, which threshold should be suitable for big machine ?
I'm wondering if i should use the imbalance_pct value for computing the threshold
Anyway, I wonder how 'sum << 1' computes 25%. Shouldn't it be << 2 ?
Thanks, Namhyung
On 21 December 2012 06:47, Namhyung Kim namhyung@kernel.org wrote:
Hi Vincent,
On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
return ((rq->avg.runnable_avg_sum << rq->nr_running) >
If nr_running a bit big, rq->avg.runnable_avg_sum << rq->nr_running is zero. you will get the wrong decision.
yes, I'm going to do that like below instead: return (rq->avg.runnable_avg_sum > (rq->avg.runnable_avg_period >> rq->nr_running));
Doesn't it consider nr_running too much? It seems current is_buddy_busy returns false on a cpu that has 1 task runs 40% cputime, but returns true on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15% cputime each, right?
Yes it's right.
I don't know what is correct, but just guessing that in a cpu's point of view it'd be busier if it has a higher runnable_avg_sum than a higher nr_running IMHO.
The nr_running is used to point out how many tasks are running simultaneously and the potential scheduling latency of adding
rq->avg.runnable_avg_period);
+}
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 25% in average */
return ((p->se.avg.runnable_avg_sum << 1) <
p->se.avg.runnable_avg_period);
25% may not suitable for big machine.
Threshold is always an issue, which threshold should be suitable for big machine ?
I'm wondering if i should use the imbalance_pct value for computing the threshold
Anyway, I wonder how 'sum << 1' computes 25%. Shouldn't it be << 2 ?
The 1st version of the patch was using << 2 but I received a comment saying that it was may be not enough aggressive so I have updated the formula with << 1 but forgot to update the comment. I will align comment and formula in the next version. Thanks for pointing this
Vincent
Thanks, Namhyung
On 21 December 2012 09:53, Vincent Guittot vincent.guittot@linaro.org wrote:
On 21 December 2012 06:47, Namhyung Kim namhyung@kernel.org wrote:
Hi Vincent,
On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
On 12/12/2012 09:31 PM, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
return ((rq->avg.runnable_avg_sum << rq->nr_running) >
If nr_running a bit big, rq->avg.runnable_avg_sum << rq->nr_running is zero. you will get the wrong decision.
yes, I'm going to do that like below instead: return (rq->avg.runnable_avg_sum > (rq->avg.runnable_avg_period >> rq->nr_running));
Doesn't it consider nr_running too much? It seems current is_buddy_busy returns false on a cpu that has 1 task runs 40% cputime, but returns true on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15% cputime each, right?
Yes it's right.
I don't know what is correct, but just guessing that in a cpu's point of view it'd be busier if it has a higher runnable_avg_sum than a higher nr_running IMHO.
sorry, the mail has been sent before i finish it
The nr_running is used to point out how many tasks are running simultaneously and the potential scheduling latency of adding
The nr_running is used to point out how many tasks are running simultaneously and as a result the potential scheduling latency. I have used the shift instruction because it was quite simple and efficient but it may give too much weight to nr_running. I could use a simple division instead of shifting runnable_avg_sum
rq->avg.runnable_avg_period);
+}
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 25% in average */
return ((p->se.avg.runnable_avg_sum << 1) <
p->se.avg.runnable_avg_period);
25% may not suitable for big machine.
Threshold is always an issue, which threshold should be suitable for big machine ?
I'm wondering if i should use the imbalance_pct value for computing the threshold
Anyway, I wonder how 'sum << 1' computes 25%. Shouldn't it be << 2 ?
The 1st version of the patch was using << 2 but I received a comment saying that it was may be not enough aggressive so I have updated the formula with << 1 but forgot to update the comment. I will align comment and formula in the next version. Thanks for pointing this
Vincent
Thanks, Namhyung
If a CPU accesses the runnable_avg_sum and runnable_avg_period fields of its buddy CPU while the latter updates it, it can get the new version of a field and the old version of the other one. This can generate erroneous decisions. We don't want to use a lock mechanism for ensuring the coherency because of the overhead in this critical path. The previous attempt can't ensure coherency of both fields for 100% of the platform and use case as it will depend of the toolchain and the platform architecture. The runnable_avg_period of a runqueue tends to the max value in less than 345ms after plugging a CPU, which implies that we could use the max value instead of reading runnable_avg_period after 345ms. During the starting phase, we must ensure a minimum of coherency between the fields. A simple rule is runnable_avg_sum <= runnable_avg_period.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fc93d96..f1a4c24 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5153,13 +5153,16 @@ static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cp static bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu); + u32 sum = rq->avg.runnable_avg_sum; + u32 period = rq->avg.runnable_avg_period; + + sum = min(sum, period);
/* * A busy buddy is a CPU with a high load or a small load with a lot of * running tasks. */ - return ((rq->avg.runnable_avg_sum << rq->nr_running) > - rq->avg.runnable_avg_period); + return ((sum << rq->nr_running) > period); }
static bool is_light_task(struct task_struct *p)
Look for an idle CPU close to the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f1a4c24..6475f54 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7274,7 +7274,25 @@ static struct {
static inline int find_new_ilb(int call_cpu) { + struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask); + int buddy = per_cpu(sd_pack_buddy, call_cpu); + + /* + * If we have a pack buddy CPU, we try to run load balance on a CPU + * that is close to the buddy. + */ + if (buddy != -1) + for_each_domain(buddy, sd) { + if (sd->flags & SD_SHARE_CPUPOWER) + continue; + + ilb = cpumask_first_and(sched_domain_span(sd), + nohz.idle_cpus_mask); + + if (ilb < nr_cpu_ids) + break; + }
if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb;
The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be power gated independantly. So we clear SD_SHARE_POWERDOMAIN at MC and CPU level.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 79282eb..f89a4a2 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -201,6 +201,15 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS];
+int arch_sd_local_flags(int level) +{ + /* Powergate at threading level doesn't make sense */ + if (level & SD_SHARE_CPUPOWER) + return 1*SD_SHARE_POWERDOMAIN; + + return 0*SD_SHARE_POWERDOMAIN; +} + const struct cpumask *cpu_coregroup_mask(int cpu) { return &cpu_topology[cpu].core_sibling;