Hi,
This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the small tasks in as few as possible CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power consumption in the low load use cases by minimizing the number of power domain that are enabled. The packing is done in 2 steps:
The 1st step looks for the best place to pack tasks in a system according to its topology and it defines a pack buddy CPU for each CPU if there is one available. We define the best CPU during the build of the sched_domain instead of evaluating it at runtime because it can be difficult to define a stable buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is that we pack at all levels inside a node where a group of CPU can be power gated independently from others. For describing this capability, a new flag has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the groups of CPUs of a scheduling domain are sharing their power state. By default, this flag has been set in all sched_domain in order to keep unchanged the current behavior of the scheduler and only ARM platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level.
In a 2nd step, the scheduler checks the load average of a task which wakes up as well as the load average of the buddy CPU and it can decide to migrate the light tasks on a not busy buddy. This check is done during the wake up because small tasks tend to wake up between periodic load balance and asynchronously to each other which prevents the default mechanism to catch and migrate them efficiently. A light task is defined by a runnable_avg_sum that is less than 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones: The average CPU load of the task must be less than 20% and the task must have been runnable less than 10ms when it woke up last time in order to be electable for the packing migration. So, a task than runs 1 ms each 5ms will be considered as a small task but a task that runs 50 ms with a period of 500ms, will not. Then, the business of the buddy CPU depends of the load average for the rq and the number of running tasks. A CPU with a load average greater than 50% will be considered as busy CPU whatever the number of running tasks is and this threshold will be reduced by the number of running tasks in order to not increase too much the wake up latency of a task. When the buddy CPU is busy, the scheduler falls back to default CFS policy.
Change since V2:
- Migrate only a task that wakes up - Change the light tasks threshold to 20% - Change the loaded CPU threshold to not pull tasks if the current number of running tasks is null but the load average is already greater than 50% - Fix the algorithm for selecting the buddy CPU.
Change since V1:
Patch 2/6 - Change the flag name which was not clear. The new name is SD_SHARE_POWERDOMAIN. - Create an architecture dependent function to tune the sched_domain flags Patch 3/6 - Fix issues in the algorithm that looks for the best buddy CPU - Use pr_debug instead of pr_info - Fix for uniprocessor Patch 4/6 - Remove the use of usage_avg_sum which has not been merged Patch 5/6 - Change the way the coherency of runnable_avg_sum and runnable_avg_period is ensured Patch 6/6 - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM platform
New results for v3:
This series has been tested with hackbench on ARM platform and the results don't show any performance regression
Hackbench 3.9-rc2 +patches Mean Time (10 tests): 2.048 2.015 stdev : 0.047 0.068
Previous results for V2:
This series has been tested with MP3 play back on ARM platform: TC2 HMP (dual CA-15 and 3xCA-7 cluster).
The measurements have been done on an Ubuntu image during 60 seconds of playback and the result has been normalized to 100.
| CA15 | CA7 | total | ------------------------------------- default | 81 | 97 | 178 | pack | 13 | 100 | 113 | -------------------------------------
Previous results for V1:
The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that it's worth packing small tasks at all topology levels.
The performance tests have been done on both platforms with sysbench. The results don't show any performance regressions. These results are aligned with the policy which uses the normal behavior with heavy use cases.
test: sysbench --test=cpu --num-threads=N --max-requests=R run
Results below is the average duration of 3 tests on the quad CA-9. default is the current scheduler behavior (pack buddy CPU is -1) pack is the scheduler with the pack mechanism
| default | pack | ----------------------------------- N=8; R=200 | 3.1999 | 3.1921 | N=8; R=2000 | 31.4939 | 31.4844 | N=12; R=200 | 3.2043 | 3.2084 | N=12; R=2000 | 31.4897 | 31.4831 | N=16; R=200 | 3.1774 | 3.1824 | N=16; R=2000 | 31.4899 | 31.4897 | -----------------------------------
The power consumption tests have been done only on TC2 platform which has got accessible power lines and I have used cyclictest to simulate small tasks. The tests show some power consumption improvements.
test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
The measurements have been done during 16 seconds and the result has been normalized to 100
| CA15 | CA7 | total | ------------------------------------- default | 100 | 40 | 140 | pack | <1 | 45 | <46 | -------------------------------------
The A15 cluster is less power efficient than the A7 cluster but if we assume that the tasks is well spread on both clusters, we can guest estimate that the power consumption on a dual cluster of CA7 would have been for a default kernel:
| CA7 | CA7 | total | ------------------------------------- default | 40 | 40 | 80 | -------------------------------------
Vincent Guittot (6): Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain sched: pack small tasks sched: secure access to other CPU statistics sched: pack the idle load balance ARM: sched: clear SD_SHARE_POWERDOMAIN
arch/arm/kernel/topology.c | 9 +++ arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 9 +-- include/linux/topology.h | 4 + kernel/sched/core.c | 14 ++-- kernel/sched/fair.c | 149 +++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 14 ++-- 8 files changed, 169 insertions(+), 32 deletions(-)
This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 8 +------- kernel/sched/core.c | 7 +------ kernel/sched/fair.c | 13 ++----------- kernel/sched/sched.h | 9 +-------- 4 files changed, 5 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index d35d2b6..16b0d18 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1159,13 +1159,7 @@ struct sched_entity { /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; #endif - -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP /* Per-entity load-tracking */ struct sched_avg avg; #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7f12624..54eaaa2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1561,12 +1561,7 @@ static void __sched_fork(struct task_struct *p) p->se.vruntime = 0; INIT_LIST_HEAD(&p->se.group_node);
-/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP p->se.avg.runnable_avg_period = 0; p->se.avg.runnable_avg_sum = 0; #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a33e59..9c2f726 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) } #endif /* CONFIG_FAIR_GROUP_SCHED */
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP /* * We choose a half-life close to 1 scheduling period. * Note: The tables below are dependent on this value. @@ -3394,12 +3393,6 @@ unlock: }
/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED -/* * Called immediately before a task is migrated to a new cpu; task_cpu(p) and * cfs_rq_of(p) references at time of call are still valid and identify the * previous cpu. However, the caller only guarantees p->pi_lock is held; no @@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load); } } -#endif #endif /* CONFIG_SMP */
static unsigned long @@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SMP .select_task_rq = select_task_rq_fair, -#ifdef CONFIG_FAIR_GROUP_SCHED .migrate_task_rq = migrate_task_rq_fair, -#endif + .rq_online = rq_online_fair, .rq_offline = rq_offline_fair,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cc03cfd..7f36024f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -227,12 +227,6 @@ struct cfs_rq { #endif
#ifdef CONFIG_SMP -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED /* * CFS Load tracking * Under CFS, load is tracked on a per-entity basis and aggregated up. @@ -242,8 +236,7 @@ struct cfs_rq { u64 runnable_load_avg, blocked_load_avg; atomic64_t decay_counter, removed_load; u64 last_decay; -#endif /* CONFIG_FAIR_GROUP_SCHED */ -/* These always depend on CONFIG_FAIR_GROUP_SCHED */ + #ifdef CONFIG_FAIR_GROUP_SCHED u32 tg_runnable_contrib; u64 tg_load_contrib;
This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPUs in a sched_domain level can or not reach a different power state. If groups of cores can be power gated independently, as an example, the flag should be cleared at CPU level. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gated the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks so the flag is set into all sched_domains.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 4 ++++ kernel/sched/core.c | 6 ++++++ 5 files changed, 13 insertions(+)
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..6d0b617 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \ + | arch_sd_local_flags(0)\ .last_balance = jiffies, \ .balance_interval = 1, \ .nr_balance_failed = 0, \ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d5e86c9..adc87102 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0) \ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 16b0d18..a82cdeb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -809,6 +809,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..3eab293 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -99,6 +99,8 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_CPUPOWER|\ + SD_SHARE_PKG_RESOURCES) \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing() \ @@ -131,6 +133,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \ @@ -161,6 +164,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0) \ | 0*SD_SERIALIZE \ | 1*SD_PREFER_SIBLING \ , \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 54eaaa2..b827e0c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5938,6 +5938,11 @@ int __weak arch_sd_sibling_asym_packing(void) return 0*SD_ASYM_PACKING; }
+int __weak arch_sd_local_flags(int level) +{ + return 1*SD_SHARE_POWERDOMAIN; +} + /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -6127,6 +6132,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) | 0*SD_WAKE_AFFINE | 0*SD_SHARE_CPUPOWER | 0*SD_SHARE_PKG_RESOURCES + | 1*SD_SHARE_POWERDOMAIN | 1*SD_SERIALIZE | 0*SD_PREFER_SIBLING | sd_local_flags(level)
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | ----------------------------------- buddy | CPU0 | CPU0 | CPU0 | CPU2 |
Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/core.c | 1 + kernel/sched/fair.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 5 +++ 3 files changed, 121 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b827e0c..21c35ce 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5662,6 +5662,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
+ update_packing_domain(cpu); update_top_cache_domain(cpu); }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9c2f726..021c7b7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -160,6 +160,76 @@ void sched_init_granularity(void) update_sysctl(); }
+ +#ifdef CONFIG_SMP +/* + * Save the id of the optimal CPU that should be used to pack small tasks + * The value -1 is used when no buddy has been found + */ +DEFINE_PER_CPU(int, sd_pack_buddy); + +/* + * Look for the best buddy CPU that can be used to pack small tasks + * We make the assumption that it doesn't wort to pack on CPU that share the + * same powerline. We look for the 1st sched_domain without the + * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest + * power per core based on the assumption that their power efficiency is + * better + */ +void update_packing_domain(int cpu) +{ + struct sched_domain *sd; + int id = -1; + + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN); + if (!sd) + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + else + sd = sd->parent; + + while (sd && (sd->flags & SD_LOAD_BALANCE) + && !(sd->flags & SD_SHARE_POWERDOMAIN)) { + struct sched_group *sg = sd->groups; + struct sched_group *pack = sg; + struct sched_group *tmp; + + /* + * The sched_domain of a CPU points on the local sched_group + * and the 1st CPU of this local group is a good candidate + */ + id = cpumask_first(sched_group_cpus(pack)); + + /* loop the sched groups to find the best one */ + for (tmp = sg->next; tmp != sg; tmp = tmp->next) { + if (tmp->sgp->power * pack->group_weight > + pack->sgp->power * tmp->group_weight) + continue; + + if ((tmp->sgp->power * pack->group_weight == + pack->sgp->power * tmp->group_weight) + && (cpumask_first(sched_group_cpus(tmp)) >= id)) + continue; + + /* we have found a better group */ + pack = tmp; + + /* Take the 1st CPU of the new group */ + id = cpumask_first(sched_group_cpus(pack)); + } + + /* Look for another CPU than itself */ + if (id != cpu) + break; + + sd = sd->parent; + } + + pr_debug("CPU%d packing on CPU%d\n", cpu, id); + per_cpu(sd_pack_buddy, cpu) = id; +} + +#endif /* CONFIG_SMP */ + #if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -3291,6 +3361,47 @@ done: return target; }
+static bool is_buddy_busy(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + /* + * A busy buddy is a CPU with a high load or a small load with a lot of + * running tasks. + */ + return (rq->avg.runnable_avg_sum > + (rq->avg.runnable_avg_period / (rq->nr_running + 2))); +} + +static bool is_light_task(struct task_struct *p) +{ + /* A light task runs less than 20% in average */ + return ((p->se.avg.runnable_avg_sum * 5) < + (p->se.avg.runnable_avg_period)); +} + +static int check_pack_buddy(int cpu, struct task_struct *p) +{ + int buddy = per_cpu(sd_pack_buddy, cpu); + + /* No pack buddy for this CPU */ + if (buddy == -1) + return false; + + /* buddy is not an allowed CPU */ + if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p))) + return false; + + /* + * If the task is a small one and the buddy is not overloaded, + * we use buddy cpu + */ + if (!is_light_task(p) || is_buddy_busy(buddy)) + return false; + + return true; +} + /* * sched_balance_self: balance the current task (running on cpu) in domains * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and @@ -3319,6 +3430,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1; new_cpu = prev_cpu; + + /* We pack only at wake up and not new task */ + if (check_pack_buddy(new_cpu, p)) + return per_cpu(sd_pack_buddy, new_cpu); }
rcu_read_lock(); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7f36024f..96b164d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -872,6 +872,7 @@ extern const struct sched_class idle_sched_class;
extern void trigger_load_balance(struct rq *rq, int cpu); extern void idle_balance(int this_cpu, struct rq *this_rq); +extern void update_packing_domain(int cpu);
#else /* CONFIG_SMP */
@@ -879,6 +880,10 @@ static inline void idle_balance(int cpu, struct rq *rq) { }
+static inline void update_packing_domain(int cpu) +{ +} + #endif
extern void sysrq_sched_debug_show(void);
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with
a lot of
* running tasks.
*/
return (rq->avg.runnable_avg_sum >
(rq->avg.runnable_avg_period / (rq->nr_running
- 2)));
+}
Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc..
Hi,
On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with
a lot of
* running tasks.
*/
return (rq->avg.runnable_avg_sum >
(rq->avg.runnable_avg_period / (rq->nr_running
- 2)));
+}
Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc..
Peter, lets say the run-queue has 50% utilization and is running 2 tasks. And we wish to find out if it is busy. We would compare this metric with the cpu power, which lets say is 100.
rq->util * 100 < cpu_of(rq)->power.
In the above scenario would we declare the cpu _not_busy? Or would we do the following:
(rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it is just enough _busy_ to not take on more processes?
@Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU with *high rq utilization*, as far as the equation goes.
Regards Preeti U Murthy
On 27 March 2013 11:21, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi,
On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with
a lot of
* running tasks.
*/
return (rq->avg.runnable_avg_sum >
(rq->avg.runnable_avg_period / (rq->nr_running
- 2)));
+}
Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc..
Peter, lets say the run-queue has 50% utilization and is running 2 tasks. And we wish to find out if it is busy. We would compare this metric with the cpu power, which lets say is 100.
rq->util * 100 < cpu_of(rq)->power.
I don't use cpu_of(rq)->power in the definition of the business
In the above scenario would we declare the cpu _not_busy? Or would we do the following:
In the above scenario, the CPU is busy
By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period In addition, i take into account the number of tasks already in the runqueue in order to define the business of a CPU. A CPU with a load of 50% without any tasks in the runqeue in not busy at this time and we can migrate tasks on it but if the CPU already has 2 tasks in its runqueue, it means that newly wake up task will have to share the CPU with other tasks so we consider that the CPU is already busy and we will fall back to default behavior. The equation considers that a CPU is not busy if 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)
(rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it is just enough _busy_ to not take on more processes?
@Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU with *high rq utilization*, as far as the equation goes.
I can update the comment. Is the comment below more clear ?
/* * A busy buddy is a CPU with a high average running time or a small average running time but a lot of * running tasks in its runqueue which are already sharing this CPU time. */
Vincent
Regards Preeti U Murthy
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
On 27 March 2013 11:21, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi,
On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with
a lot of
* running tasks.
*/
return (rq->avg.runnable_avg_sum >
(rq->avg.runnable_avg_period / (rq->nr_running
- 2)));
+}
Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc..
By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period In addition, i take into account the number of tasks already in the runqueue in order to define the business of a CPU. A CPU with a load of 50% without any tasks in the runqeue in not busy at this time and we can migrate tasks on it but if the CPU already has 2 tasks in its runqueue, it means that newly wake up task will have to share the CPU with other tasks so we consider that the CPU is already busy and we will fall back to default behavior. The equation considers that a CPU is not busy if 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)
I'm still somewhat confused by all this. So raising nr_running will lower the required utilization to be considered busy. Suppose we have 50 tasks running, all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its picking its nose for half the time.
I'm assuming it's mean to limit process latency or so? Why are you concerned with that? This seems like an arbitrary power vs performance tweak without solid effidence its needed or even wanted.
On 26 April 2013 12:30, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
On 27 March 2013 11:21, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi,
On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with
a lot of
* running tasks.
*/
return (rq->avg.runnable_avg_sum >
(rq->avg.runnable_avg_period / (rq->nr_running
- 2)));
+}
Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc..
By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period In addition, i take into account the number of tasks already in the runqueue in order to define the business of a CPU. A CPU with a load of 50% without any tasks in the runqeue in not busy at this time and we can migrate tasks on it but if the CPU already has 2 tasks in its runqueue, it means that newly wake up task will have to share the CPU with other tasks so we consider that the CPU is already busy and we will fall back to default behavior. The equation considers that a CPU is not busy if 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)
I'm still somewhat confused by all this. So raising nr_running will lower the required utilization to be considered busy. Suppose we have 50 tasks running, all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its picking its nose for half the time.
I'm assuming it's mean to limit process latency or so? Why are you concerned with that? This seems like an arbitrary power vs performance tweak without solid effidence its needed or even wanted.
Yes the goal was to limit the wake up latency because this version was only trying to modify the scheduler behavior when the system was not busy in order to pack the small tasks like background activities but without decreasing the performance so we were concerned by wakeup latency.
The new version proposes a more aggressive mode that packs all tasks until CPUs becomes full.
Vincent
On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
Hi,
On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with
a lot of
* running tasks.
*/
return (rq->avg.runnable_avg_sum >
(rq->avg.runnable_avg_period / (rq->nr_running
- 2)));
+}
Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc..
Peter, lets say the run-queue has 50% utilization and is running 2 tasks. And we wish to find out if it is busy. We would compare this metric with the cpu power, which lets say is 100.
rq->util * 100 < cpu_of(rq)->power.
In the above scenario would we declare the cpu _not_busy? Or would we do the following:
(rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it is just enough _busy_ to not take on more processes?
That is just confused... ->power doesn't have anything to do with a per-cpu measure. ->power is a inter-cpu measure of relative compute capacity.
Mixing in nr_running confuses things even more; it doesn't matter how many tasks it takes to push utilization up to 100%; once its there the cpu simply cannot run more.
So colour me properly confused..
Hi Peter,
On 04/26/2013 03:48 PM, Peter Zijlstra wrote:
On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
Hi,
On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
* A busy buddy is a CPU with a high load or a small load with
a lot of
* running tasks.
*/
return (rq->avg.runnable_avg_sum >
(rq->avg.runnable_avg_period / (rq->nr_running
- 2)));
+}
Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc..
Peter, lets say the run-queue has 50% utilization and is running 2 tasks. And we wish to find out if it is busy. We would compare this metric with the cpu power, which lets say is 100.
rq->util * 100 < cpu_of(rq)->power.
In the above scenario would we declare the cpu _not_busy? Or would we do the following:
(rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it is just enough _busy_ to not take on more processes?
That is just confused... ->power doesn't have anything to do with a per-cpu measure. ->power is a inter-cpu measure of relative compute capacity.
Ok.
Mixing in nr_running confuses things even more; it doesn't matter how many tasks it takes to push utilization up to 100%; once its there the cpu simply cannot run more.
True, this is from the perspective of the CPU. But will not the tasks on this CPU get throttled if, you find the utilization of this CPU < 100% and decide to put more tasks on it?
Regards Preeti U Murthy
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 20% in average */
return ((p->se.avg.runnable_avg_sum * 5) <
(p->se.avg.runnable_avg_period));
+}
OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into this as well.
PJT, do you have any sane solution for this, I forgot what the result of the last discussion was -- was there any?
On 26 March 2013 13:37, Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 20% in average */
return ((p->se.avg.runnable_avg_sum * 5) <
(p->se.avg.runnable_avg_period));
+}
OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into this as well.
Hi Peter,
The packing small tasks is only applied at wake up and not during fork or exec so the runnable_avg_* should have been initialized. As you mentionned, we assume that a fresh task is fully loaded and let the default scheduler behavior to select a target CPU
Vincent
PJT, do you have any sane solution for this, I forgot what the result of the last discussion was -- was there any?
Hi Peter,
On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 20% in average */
return ((p->se.avg.runnable_avg_sum * 5) <
(p->se.avg.runnable_avg_period));
+}
OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into this as well.
PJT, do you have any sane solution for this, I forgot what the result of the last discussion was -- was there any?
The conclusion after last discussion between PJT and Alex was that the load contribution of a fresh task be set to "full" during "__sched_fork()".
task->se.avg.load_avg_contrib = task->se.load.weight during __sched_fork() is reflected in the latest power aware scheduler patchset by Alex.
Thanks
Regards Preeti U Murthy
On 03/27/2013 12:33 PM, Preeti U Murthy wrote:
Hi Peter,
On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
+static bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 20% in average */
return ((p->se.avg.runnable_avg_sum * 5) <
(p->se.avg.runnable_avg_period));
+}
OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into this as well.
PJT, do you have any sane solution for this, I forgot what the result of the last discussion was -- was there any?
The conclusion after last discussion between PJT and Alex was that the load contribution of a fresh task be set to "full" during "__sched_fork()".
task->se.avg.load_avg_contrib = task->se.load.weight during __sched_fork() is reflected in the latest power aware scheduler patchset by Alex.
Yes, the new forked runnable load was set as full utilisation in V5 power aware scheduling. PJT, Mike and I both agree on this. PJT just discussion how to give the full load to new forked task. and we get agreement in my coming V6 power aware scheduling patchset.
Thanks
Regards Preeti U Murthy
On Wed, 2013-03-27 at 12:48 +0800, Alex Shi wrote:
Yes, the new forked runnable load was set as full utilisation in V5 power aware scheduling. PJT, Mike and I both agree on this. PJT just discussion how to give the full load to new forked task. and we get agreement in my coming V6 power aware scheduling patchset.
Great!, means I can mark the v5 thread read! :-)
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
I suppose this is adequate for the 'small' systems you currently have; but given that Samsung is already bragging with its 'octo'-core Exynos 5 (4+4 big-little thing) does this solution scale?
Isn't this basically related to picking the NO_HZ cpu; if the system isn't fully symmetric with its power gates you want the NO_HZ cpu to be the 'special' cpu. If it is symmetric we really don't care which core is left 'running' and we can even select a new pack cpu from the idle cores once the old one is fully utilized.
Re-using (or integrating) with NO_HZ has the dual advantage that you'll make NO_HZ do the right thing for big-little (you typically want a little core to be the one staying 'awake' and once someone makes NO_HZ scale this all gets to scale along with it.
On 26 March 2013 13:46, Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be :
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
I suppose this is adequate for the 'small' systems you currently have; but given that Samsung is already bragging with its 'octo'-core Exynos 5 (4+4 big-little thing) does this solution scale?
The packing is only done at MC and CPU level to minimize the number of transition.
Isn't this basically related to picking the NO_HZ cpu; if the system isn't fully symmetric with its power gates you want the NO_HZ cpu to be the 'special' cpu. If it is symmetric we really don't care which core is left 'running' and we can even select a new pack cpu from the idle cores once the old one is fully utilized.
I agree that on a symmetric system, we don't really care about which core is selected but we want to use the same one whenever possible to prevent a ping pong between several cores or groups of cores, which is power consuming. By forcing a NOHZ cpu, your background activity will smoothly pack on this CPU and will not be spread on your system. When a CPU is fully loaded, we don't fall in a low CPU load use case and the periodic load balance can handle the situation to select a new target CPU which is close to the buddy CPU
Re-using (or integrating) with NO_HZ has the dual advantage that you'll make NO_HZ do the right thing for big-little (you typically want a little core to be the one staying 'awake' and once someone makes NO_HZ scale this all gets to scale along with it.
I think that you have answered to this question in your comment of patch 5, isn't it?
Vincent
Isn't this basically related to picking the NO_HZ cpu; if the system isn't fully symmetric with its power gates you want the NO_HZ cpu to be the 'special' cpu. If it is symmetric we really don't care which core is left 'running' and we can even select a new pack cpu from the idle cores once the old one is fully utilized.
you don't really care much sure, but there's some advantages for sorting "all the way left", e.g. to linux cpu 0. Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
Isn't this basically related to picking the NO_HZ cpu; if the system isn't fully symmetric with its power gates you want the NO_HZ cpu to be the 'special' cpu. If it is symmetric we really don't care which core is left 'running' and we can even select a new pack cpu from the idle cores once the old one is fully utilized.
you don't really care much sure, but there's some advantages for sorting "all the way left", e.g. to linux cpu 0. Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
Right, and I suspect all the big-little nonsense will have the little cores on low numbers as well (is this architected or can a creative licensee screw us over?)
So find_new_ilb() already does cpumask_first(), so it has a strong leftmost preference. We just need to make sure it indeed does the right thing and doesn't have some unintended side effect.
On 27 March 2013 09:46, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
Isn't this basically related to picking the NO_HZ cpu; if the system isn't fully symmetric with its power gates you want the NO_HZ cpu to be the 'special' cpu. If it is symmetric we really don't care which core is left 'running' and we can even select a new pack cpu from the idle cores once the old one is fully utilized.
you don't really care much sure, but there's some advantages for sorting "all the way left", e.g. to linux cpu 0. Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
Right, and I suspect all the big-little nonsense will have the little cores on low numbers as well (is this architected or can a creative licensee screw us over?)
It's not mandatory to have little cores on low numbers even if it's advised
So find_new_ilb() already does cpumask_first(), so it has a strong leftmost preference. We just need to make sure it indeed does the right thing and doesn't have some unintended side effect.
On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
It's not mandatory to have little cores on low numbers even if it's advised
ARGH!
I haven't followed this thread closely, so just a random comment from me. An argument from some is that they want to boot Linux on the big CPU to be quicker. The most convenient is to have the big CPU at 0.
On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote:
On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
It's not mandatory to have little cores on low numbers even if it's advised
ARGH!
I haven't followed this thread closely, so just a random comment from me. An argument from some is that they want to boot Linux on the big CPU to be quicker. The most convenient is to have the big CPU at 0.
I suppose that's almost sensible ;-) I just despair at the amount of variation that's allowed.
I'm guessing that swapping cpus in the bootloader or someplace really early is equally hard in that we (Linux) assume we boot on cpu 0 or something like that?
On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote:
On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote:
On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
It's not mandatory to have little cores on low numbers even if it's advised
ARGH!
I haven't followed this thread closely, so just a random comment from me. An argument from some is that they want to boot Linux on the big CPU to be quicker. The most convenient is to have the big CPU at 0.
I suppose that's almost sensible ;-) I just despair at the amount of variation that's allowed.
I'm guessing that swapping cpus in the bootloader or someplace really early is equally hard in that we (Linux) assume we boot on cpu 0 or something like that?
I think it's worth trying (by changing the CPU topology in the DT). At a quick look, I don't see anything hard-coded in the kernel boot sequence. It uses smp_processor_id() which translates to current_thread_info()->cpu on ARM. I'm not sure how early we need this but it's probably after DT parsing, so we could set 'cpu' to a non-zero value for the booting processor. There are a few tweaks in the arch/arm code code with cpu_logical_map setup (which maps between smp_processor_id and the actual hardware CPU id and assumes 0 is the booting CPU).
So if the above works, the scheduler guys can mandate that little CPUs are always first and for ARM it would be a matter of getting the right CPU topology in the DT (independent of what hw vendors think of CPU topology) and booting Linux on CPU 4 etc.
On Wed, 27 Mar 2013, Catalin Marinas wrote:
So if the above works, the scheduler guys can mandate that little CPUs are always first and for ARM it would be a matter of getting the right CPU topology in the DT (independent of what hw vendors think of CPU topology) and booting Linux on CPU 4 etc.
Just a note about that: if the scheduler mandates little CPUs first, that should _not_ have any implications on the DT content. DT is not about encoding Linux specific implementation details. It is simple enough to tweak the CPU logical map at run time when enumeratiing CPUs.
Nicolas
On Wed, Mar 27, 2013 at 05:18:53PM +0000, Nicolas Pitre wrote:
On Wed, 27 Mar 2013, Catalin Marinas wrote:
So if the above works, the scheduler guys can mandate that little CPUs are always first and for ARM it would be a matter of getting the right CPU topology in the DT (independent of what hw vendors think of CPU topology) and booting Linux on CPU 4 etc.
Just a note about that: if the scheduler mandates little CPUs first, that should _not_ have any implications on the DT content. DT is not about encoding Linux specific implementation details. It is simple enough to tweak the CPU logical map at run time when enumeratiing CPUs.
You are right, though a simpler way (hack) to tweak the cpu_logical_map is to change the DT ;).
But the problem is that the kernel doesn't know which CPU is big and which is little, unless you specify this in some way via the DT. It can be the cpu nodes order or some other means.
On 27 March 2013 17:36, Catalin Marinas catalin.marinas@arm.com wrote:
On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote:
On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote:
On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
It's not mandatory to have little cores on low numbers even if it's advised
ARGH!
I haven't followed this thread closely, so just a random comment from me. An argument from some is that they want to boot Linux on the big CPU to be quicker. The most convenient is to have the big CPU at 0.
I suppose that's almost sensible ;-) I just despair at the amount of variation that's allowed.
I'm guessing that swapping cpus in the bootloader or someplace really early is equally hard in that we (Linux) assume we boot on cpu 0 or something like that?
I think it's worth trying (by changing the CPU topology in the DT). At a quick look, I don't see anything hard-coded in the kernel boot sequence. It uses smp_processor_id() which translates to current_thread_info()->cpu on ARM. I'm not sure how early we need this but it's probably after DT parsing, so we could set 'cpu' to a non-zero value for the booting processor. There are a few tweaks in the arch/arm code code with cpu_logical_map setup (which maps between smp_processor_id and the actual hardware CPU id and assumes 0 is the booting CPU).
The are few other places in the code that makes the assumption that the 1st online CPU is the booting CPU like the disable_nonboot_cpus
So if the above works, the scheduler guys can mandate that little CPUs are always first and for ARM it would be a matter of getting the right CPU topology in the DT (independent of what hw vendors think of CPU topology) and booting Linux on CPU 4 etc.
-- Catalin
On Wed, Mar 27, 2013 at 05:20:28PM +0000, Vincent Guittot wrote:
On 27 March 2013 17:36, Catalin Marinas catalin.marinas@arm.com wrote:
On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote:
On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote:
On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
It's not mandatory to have little cores on low numbers even if it's advised
ARGH!
I haven't followed this thread closely, so just a random comment from me. An argument from some is that they want to boot Linux on the big CPU to be quicker. The most convenient is to have the big CPU at 0.
I suppose that's almost sensible ;-) I just despair at the amount of variation that's allowed.
I'm guessing that swapping cpus in the bootloader or someplace really early is equally hard in that we (Linux) assume we boot on cpu 0 or something like that?
I think it's worth trying (by changing the CPU topology in the DT). At a quick look, I don't see anything hard-coded in the kernel boot sequence. It uses smp_processor_id() which translates to current_thread_info()->cpu on ARM. I'm not sure how early we need this but it's probably after DT parsing, so we could set 'cpu' to a non-zero value for the booting processor. There are a few tweaks in the arch/arm code code with cpu_logical_map setup (which maps between smp_processor_id and the actual hardware CPU id and assumes 0 is the booting CPU).
The are few other places in the code that makes the assumption that the 1st online CPU is the booting CPU like the disable_nonboot_cpus
Another thing - boot_cpu_init() is using smp_processor_id() before we get the setup_arch() call where we parse the DT.
On Wed, 27 Mar 2013, Vincent Guittot wrote:
On 27 March 2013 09:46, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
Isn't this basically related to picking the NO_HZ cpu; if the system isn't fully symmetric with its power gates you want the NO_HZ cpu to be the 'special' cpu. If it is symmetric we really don't care which core is left 'running' and we can even select a new pack cpu from the idle cores once the old one is fully utilized.
you don't really care much sure, but there's some advantages for sorting "all the way left", e.g. to linux cpu 0. Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
Right, and I suspect all the big-little nonsense will have the little cores on low numbers as well (is this architected or can a creative licensee screw us over?)
It's not mandatory to have little cores on low numbers even if it's advised
We can trivially move things around in the logical CPU mapping if that simplifies things. However the boot CPU might not be CPU0 in that case which might be less trivial.
Nicolas
If a CPU accesses the runnable_avg_sum and runnable_avg_period fields of its buddy CPU while the latter updates it, it can get the new version of a field and the old version of the other one. This can generate erroneous decisions. We don't want to use a lock mechanism for ensuring the coherency because of the overhead in this critical path. The previous attempt can't ensure coherency of both fields for 100% of the platform and use case as it will depend of the toolchain and the platform architecture. The runnable_avg_period of a runqueue tends to the max value in less than 345ms after plugging a CPU, which implies that we could use the max value instead of reading runnable_avg_period after 345ms. During the starting phase, we must ensure a minimum of coherency between the fields. A simple rule is runnable_avg_sum <= runnable_avg_period.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/fair.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 021c7b7..b636199 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3364,13 +3364,16 @@ done: static bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu); + u32 sum = rq->avg.runnable_avg_sum; + u32 period = rq->avg.runnable_avg_period; + + sum = min(sum, period);
/* * A busy buddy is a CPU with a high load or a small load with a lot of * running tasks. */ - return (rq->avg.runnable_avg_sum > - (rq->avg.runnable_avg_period / (rq->nr_running + 2))); + return (sum > (period / (rq->nr_running + 2))); }
static bool is_light_task(struct task_struct *p)
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
@@ -3364,13 +3364,16 @@ done: static bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu);
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;
sum = min(sum, period);
OK this makes sense; use a simple sanity constraint instead of going overboard on serialization -- however, why is this a separate patch?
That is, this could easily be part of the patch that introduces is_buddy_busy(); also you likely want part of this patch's changelog to become a comment that goes right above this min() :-)
On 26 March 2013 13:50, Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
@@ -3364,13 +3364,16 @@ done: static bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu);
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;
sum = min(sum, period);
OK this makes sense; use a simple sanity constraint instead of going overboard on serialization -- however, why is this a separate patch?
There is no real reason other than explaining why I have added this additional check
That is, this could easily be part of the patch that introduces is_buddy_busy(); also you likely want part of this patch's changelog to become a comment that goes right above this min() :-)
Yes, i 'm going to do that
Look for an idle CPU close to the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b636199..52a7736 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5455,7 +5455,25 @@ static struct {
static inline int find_new_ilb(int call_cpu) { + struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask); + int buddy = per_cpu(sd_pack_buddy, call_cpu); + + /* + * If we have a pack buddy CPU, we try to run load balance on a CPU + * that is close to the buddy. + */ + if (buddy != -1) + for_each_domain(buddy, sd) { + if (sd->flags & SD_SHARE_CPUPOWER) + continue; + + ilb = cpumask_first_and(sched_domain_span(sd), + nohz.idle_cpus_mask); + + if (ilb < nr_cpu_ids) + break; + }
if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb;
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
Look for an idle CPU close to the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com
kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b636199..52a7736 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5455,7 +5455,25 @@ static struct { static inline int find_new_ilb(int call_cpu) {
struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask);
int buddy = per_cpu(sd_pack_buddy, call_cpu);
/*
* If we have a pack buddy CPU, we try to run load balance on
a CPU
* that is close to the buddy.
*/
if (buddy != -1)
for_each_domain(buddy, sd) {
if (sd->flags & SD_SHARE_CPUPOWER)
continue;
ilb = cpumask_first_and(sched_domain_span(sd),
nohz.idle_cpus_mask);
if (ilb < nr_cpu_ids)
break;
}
/me hands you a fresh bucket of curlies, no reason to skimp on them.
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
On 26 March 2013 13:52, Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
Look for an idle CPU close to the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com
kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b636199..52a7736 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5455,7 +5455,25 @@ static struct {
static inline int find_new_ilb(int call_cpu) {
struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask);
int buddy = per_cpu(sd_pack_buddy, call_cpu);
/*
* If we have a pack buddy CPU, we try to run load balance on
a CPU
* that is close to the buddy.
*/
if (buddy != -1)
for_each_domain(buddy, sd) {
if (sd->flags & SD_SHARE_CPUPOWER)
continue;
ilb = cpumask_first_and(sched_domain_span(sd),
nohz.idle_cpus_mask);
if (ilb < nr_cpu_ids)
break;
}
/me hands you a fresh bucket of curlies, no reason to skimp on them.
ok
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little.
On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little.
Earlier you wrote:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 | ------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
If we'd use the ILB as packing cpu, we would simply select a next pack target once the old one fills up.
On 26 March 2013 15:42, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little.
Earlier you wrote:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
yes
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
We will check 4,5,7 at MC level in order to pack in the group of A15 (because they are not sharing the same power domain). If none of them are idle, we will look at CPU level and will check CPUs 0-3.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
The primary target is to pack the tasks only when we are in a not busy system so you will have a power improvement without performance decrease. is_light_task function returns false and is_buddy_busy function true before the buddy is fully loaded and the scheduler will fall back into the default behavior which spreads tasks and races to idle.
We can extend the buddy CPU and the packing mechanism to fill one CPU before filling another buddy but it's not always the best choice for performance and/or power and thus it will imply to have a knob to select this full packing mode.
If we'd use the ILB as packing cpu, we would simply select a next pack target once the old one fills up.
Yes, we will be able to pack the long running tasks and the wake up will take care of the short tasks
On 03/26/2013 11:55 PM, Vincent Guittot wrote:
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
yes
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
We will check 4,5,7 at MC level in order to pack in the group of A15 (because they are not sharing the same power domain). If none of them are idle, we will look at CPU level and will check CPUs 0-3.
So you increase a fixed step here.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
The primary target is to pack the tasks only when we are in a not busy system so you will have a power improvement without performance decrease. is_light_task function returns false and is_buddy_busy function true before the buddy is fully loaded and the scheduler will fall back into the default behavior which spreads tasks and races to idle.
We can extend the buddy CPU and the packing mechanism to fill one CPU before filling another buddy but it's not always the best choice for performance and/or power and thus it will imply to have a knob to select this full packing mode.
Just one buddy to pack tasks for whole level cpus definitely has scalability problem. That is not good for powersaving in most of scenarios.
On 27 March 2013 05:56, Alex Shi alex.shi@intel.com wrote:
On 03/26/2013 11:55 PM, Vincent Guittot wrote:
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
yes
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
We will check 4,5,7 at MC level in order to pack in the group of A15 (because they are not sharing the same power domain). If none of them are idle, we will look at CPU level and will check CPUs 0-3.
So you increase a fixed step here.
I have modified the find_new_ilb function to look for the best idle CPU instead of just picking the first CPU of idle_cpus_mask.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
The primary target is to pack the tasks only when we are in a not busy system so you will have a power improvement without performance decrease. is_light_task function returns false and is_buddy_busy function true before the buddy is fully loaded and the scheduler will fall back into the default behavior which spreads tasks and races to idle.
We can extend the buddy CPU and the packing mechanism to fill one CPU before filling another buddy but it's not always the best choice for performance and/or power and thus it will imply to have a knob to select this full packing mode.
Just one buddy to pack tasks for whole level cpus definitely has scalability problem. That is not good for powersaving in most of scenarios.
This patch doesn't want to pack all kind of tasks in all scenario but only the small tasks that run less that 10ms and when the CPU is not already too busy with other tasks so you don't have to cope with long wake up latency and performance regression and only one CPU will be powered up for these background activities. Nevertheless, I can extend the packing small tasks to pack all tasks in any scenario in as few CPUs as possible. This will imply to choose a new buddy CPU when the previous one is full during the ILB selection as an example and to add a knob to select this mode which will modify the performance of the system. But the primary target is not to have a knob and not to reduce performance in most of scenario.
Regards, Vincent
-- Thanks Alex
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
We will check 4,5,7 at MC level in order to pack in the group of A15 (because they are not sharing the same power domain). If none of them are idle, we will look at CPU level and will check CPUs 0-3.
So you increase a fixed step here.
I have modified the find_new_ilb function to look for the best idle CPU instead of just picking the first CPU of idle_cpus_mask.
That's better. But using a fixed buddy is still not flexible, and involve more checking in this time critical balancing. Consider the most of SMP system, cpu is equal, so any of other cpu can play the role of buddy in your design. That means no buddy cpu is better, like my version packing.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
The primary target is to pack the tasks only when we are in a not busy system so you will have a power improvement without performance decrease. is_light_task function returns false and is_buddy_busy function true before the buddy is fully loaded and the scheduler will fall back into the default behavior which spreads tasks and races to idle.
We can extend the buddy CPU and the packing mechanism to fill one CPU before filling another buddy but it's not always the best choice for performance and/or power and thus it will imply to have a knob to select this full packing mode.
Just one buddy to pack tasks for whole level cpus definitely has scalability problem. That is not good for powersaving in most of scenarios.
This patch doesn't want to pack all kind of tasks in all scenario but only the small tasks that run less that 10ms and when the CPU is not already too busy with other tasks so you don't have to cope with long wake up latency and performance regression and only one CPU will be powered up for these background activities. Nevertheless, I can extend the packing small tasks to pack all tasks in any scenario in as few CPUs as possible. This will imply to choose a new buddy CPU when the previous one is full during the ILB selection as an example and to add a knob to select this mode which will modify the performance of the system. But the primary target is not to have a knob and not to reduce performance in most of scenario.
Arguing the performance/power balance does no much sense without detailed scenario. We just want to seek a flexible compromise way. But fixed buddy cpu is not flexible. and it may lose many possible powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle all tasks, we don't need to wake another core. or if 2 cores in one socket can handle tasks, we also don't need to wakeup another socket.
Regards, Vincent
-- Thanks Alex
On 27 March 2013 09:47, Alex Shi alex.shi@intel.com wrote:
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
We will check 4,5,7 at MC level in order to pack in the group of A15 (because they are not sharing the same power domain). If none of them are idle, we will look at CPU level and will check CPUs 0-3.
So you increase a fixed step here.
I have modified the find_new_ilb function to look for the best idle CPU instead of just picking the first CPU of idle_cpus_mask.
That's better. But using a fixed buddy is still not flexible, and involve more checking in this time critical balancing. Consider the most of SMP system, cpu is equal, so any of other cpu can play the role of buddy in your design. That means no buddy cpu is better, like my version packing.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
The primary target is to pack the tasks only when we are in a not busy system so you will have a power improvement without performance decrease. is_light_task function returns false and is_buddy_busy function true before the buddy is fully loaded and the scheduler will fall back into the default behavior which spreads tasks and races to idle.
We can extend the buddy CPU and the packing mechanism to fill one CPU before filling another buddy but it's not always the best choice for performance and/or power and thus it will imply to have a knob to select this full packing mode.
Just one buddy to pack tasks for whole level cpus definitely has scalability problem. That is not good for powersaving in most of scenarios.
This patch doesn't want to pack all kind of tasks in all scenario but only the small tasks that run less that 10ms and when the CPU is not already too busy with other tasks so you don't have to cope with long wake up latency and performance regression and only one CPU will be powered up for these background activities. Nevertheless, I can extend the packing small tasks to pack all tasks in any scenario in as few CPUs as possible. This will imply to choose a new buddy CPU when the previous one is full during the ILB selection as an example and to add a knob to select this mode which will modify the performance of the system. But the primary target is not to have a knob and not to reduce performance in most of scenario.
Arguing the performance/power balance does no much sense without detailed scenario. We just want to seek a flexible compromise way. But fixed buddy cpu is not flexible. and it may lose many possible powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle all tasks, we don't need to wake another core. or if 2 cores in one socket can handle tasks, we also don't need to wakeup another socket.
Using 2 SMT for all tasks implies to accept latency and to share resources like cache and memory bandwidth so it means that you also accept some potential performance decrease which implies that someone must select this mode with a knob. The primary goal of the patchset is not to select between powersaving and performance but to stay in performance mode. We pack the small tasks in one CPU so the performance will not decrease but the low load scenario will consume less power. Then, I can add another step which will be more power saving aggressive with a potential cost of performance and i this case the buddy CPU will be updated dynamically according to the system load
Regards, Vincent
-- Thanks Alex
-- Thanks Alex
On 03/27/2013 06:30 PM, Vincent Guittot wrote:
Arguing the performance/power balance does no much sense without
detailed scenario. We just want to seek a flexible compromise way. But fixed buddy cpu is not flexible. and it may lose many possible powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle all tasks, we don't need to wake another core. or if 2 cores in one socket can handle tasks, we also don't need to wakeup another socket.
Using 2 SMT for all tasks implies to accept latency and to share resources like cache and memory bandwidth so it means that you also accept some potential performance decrease which implies that someone must select this mode with a knob. The primary goal of the patchset is not to select between powersaving and performance but to stay in performance mode. We pack the small tasks in one CPU so the performance will not decrease but the low load scenario will consume less power. Then, I can add another step which will be more power saving aggressive with a potential cost of performance and i this case the buddy CPU will be updated dynamically according to the system load
Predication of small task behavior is often wrong. so for performance purpose, packing task is a bad idea.
On Wed, 2013-03-27 at 12:56 +0800, Alex Shi wrote:
Just one buddy to pack tasks for whole level cpus definitely has scalability problem.
Right, but note we already have this scalability problem in the form of the ILB. Some people were working on sorting that but then someone changed jobs and I think it fell in some deep and dark crack.
Venki, Suresh?
Peter,
After some toughts about your comments,I can update the buddy cpu during ILB or periofdic LB to a new idle core and extend the packing mechanism Does this additional mechanism sound better for you ?
Vincent
On 26 March 2013 15:42, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little.
Earlier you wrote:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
If we'd use the ILB as packing cpu, we would simply select a next pack target once the old one fills up.
Hi Vincent,
On 04/05/2013 04:38 PM, Vincent Guittot wrote:
Peter,
After some toughts about your comments,I can update the buddy cpu during ILB or periofdic LB to a new idle core and extend the packing mechanism Does this additional mechanism sound better for you ?
If the primary goal of this patchset is to pack small tasks in fewer power domains then why not see if the power aware scheduler patchset by Alex does the same for you? The reason being:
a.The power aware scheduler also checks if a task is small enough to be packed on a cpu which has just enough capacity to take on that task(leader cpu). This cpu belongs to a scheduler group which is nearly full(group_leader),so we end up packing tasks.
b.The overhead of assigning a buddy cpu gets eliminated because the best cpu for packing is decided during wake up.
c.This is a scalable solution because if the leader cpu is busy,then any other idle cpu from that group_leader is chosen.Eventually you end up packing anyway.
The reason that I am suggesting this is that we could unify the power awareness of the scheduler under one umbrella.And i believe that the current power aware scheduler patchset is flexible enough to do this and that we must cash in on it.
Thanks
Regards Preeti U Murthy
Vincent
On 26 March 2013 15:42, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little.
Earlier you wrote:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
If we'd use the ILB as packing cpu, we would simply select a next pack target once the old one fills up.
On Monday, 22 April 2013, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 04/05/2013 04:38 PM, Vincent Guittot wrote:
Peter,
After some toughts about your comments,I can update the buddy cpu during ILB or periofdic LB to a new idle core and extend the packing mechanism Does this additional mechanism sound better for you ?
Hi Preeti,
I have had a look at Alex patches but i have some concerns with his patches -There no notion of power domain which is quite important when we speak about power saving IMHO. Packing tasks has got an interest if the idle CPUs can reach a useful low power state independently from busy CPUs. Architectures have different low power state capabilities which must be taken into account. In addition, you can have system which have CPUs with better power efficiency and this kind of system are not taken into account. -There are some computation of statistics on a potentially large number of cpus and groups at each task wake up. This overhead concerns me and such amount of computation should only be done when we have more time like the periodic load balance. -There are some heuristics that will be hard to tune: *powersaving balance period set as 8*max_interval *power saving can do some performance load balance if there was no performance load balance in the last 32 balances with no more than 4 perf balance in the last 64 balance *sched_burst_threshold
I'm going to send a proposal for a more aggressive and scalable mode of my patches which will take care of my concerns. Let see how this new patchset can fit with Alex's ones
Regards, Vincent
If the primary goal of this patchset is to pack small tasks in fewer power domains then why not see if the power aware scheduler patchset by Alex does the same for you? The reason being:
a.The power aware scheduler also checks if a task is small enough to be packed on a cpu which has just enough capacity to take on that task(leader cpu). This cpu belongs to a scheduler group which is nearly full(group_leader),so we end up packing tasks.
b.The overhead of assigning a buddy cpu gets eliminated because the best cpu for packing is decided during wake up.
c.This is a scalable solution because if the leader cpu is busy,then any other idle cpu from that group_leader is chosen.Eventually you end up packing anyway.
The reason that I am suggesting this is that we could unify the power awareness of the scheduler under one umbrella.And i believe that the current power aware scheduler patchset is flexible enough to do this and that we must cash in on it.
Thanks
Regards Preeti U Murthy
Vincent
On 26 March 2013 15:42, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little.
Earlier you wrote:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
If we'd use the ILB as packing cpu, we would simply select a next pack target once the old one fills up.
Le 22 avr. 2013 21:57, "Vincent Guittot" vincent.guittot@linaro.org a écrit :
On Monday, 22 April 2013, Preeti U Murthy preeti@linux.vnet.ibm.com
wrote:
Hi Vincent,
On 04/05/2013 04:38 PM, Vincent Guittot wrote:
Peter,
After some toughts about your comments,I can update the buddy cpu during ILB or periofdic LB to a new idle core and extend the packing mechanism Does this additional mechanism sound better for you ?
Hi Preeti,
I have had a look at Alex patches but i have some concerns with his
patches
-There no notion of power domain which is quite important when we speak
about power saving IMHO. Packing tasks has got an interest if the idle CPUs can reach a useful low power state independently from busy CPUs. Architectures have different low power state capabilities which must be taken into account. In addition, you can have system which have CPUs with better power efficiency and this kind of system are not taken into account.
-There are some computation of statistics on a potentially large number
of cpus and groups at each task wake up. This overhead concerns me and such amount of computation should only be done when we have more time like the periodic load balance.
-There are some heuristics that will be hard to tune: *powersaving balance period set as 8*max_interval *power saving can do some performance load balance if there was no
performance load balance in the last 32 balances with no more than 4 perf balance in the last 64 balance
*sched_burst_threshold
I'm going to send a proposal for a more aggressive and scalable mode of
my patches which will take care of my concerns. Let see how this new patchset can fit with Alex's ones
Regards, Vincent
If the primary goal of this patchset is to pack small tasks in fewer power domains then why not see if the power aware scheduler patchset by Alex does the same for you? The reason being:
a.The power aware scheduler also checks if a task is small enough to be packed on a cpu which has just enough capacity to take on that task(leader cpu). This cpu belongs to a scheduler group which is nearly full(group_leader),so we end up packing tasks.
b.The overhead of assigning a buddy cpu gets eliminated because the best cpu for packing is decided during wake up.
c.This is a scalable solution because if the leader cpu is busy,then any other idle cpu from that group_leader is chosen.Eventually you end up packing anyway.
The reason that I am suggesting this is that we could unify the power awareness of the scheduler under one umbrella.And i believe that the current power aware scheduler patchset is flexible enough to do this and that we must cash in on it.
Thanks
Regards Preeti U Murthy
Vincent
On 26 March 2013 15:42, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little.
Earlier you wrote:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
If we'd use the ILB as packing cpu, we would simply select a next pack target once the old one fills up.
Thanks you, Preeti and Vincent to talk the power aware scheduler for details! believe this open discussion is helpful to conduct a a more comprehensive solution. :)
Hi Preeti,
I have had a look at Alex patches but i have some concerns with his patches -There no notion of power domain which is quite important when we speak about power saving IMHO. Packing tasks has got an interest if the idle CPUs can reach a useful low power state independently from busy CPUs. Architectures have different low power state capabilities which must be taken into account. In addition, you can have system which have CPUs with better power efficiency and this kind of system are not taken into account.
I agree with you on this point. and like what's you done to add new flag in sched domain. It also make scheduler easy pick up new idea in balancing. BTW, Currently, the my balance is trying pack task per SMT, maybe packing task per cpu horse power is more compatible for other archs?
-There are some computation of statistics on a potentially large number of cpus and groups at each task wake up. This overhead concerns me and such amount of computation should only be done when we have more time like the periodic load balance.
Usually, some computation is far slighter then the task migration. If the computation is helpful to reduce future possible migration, it will save much. On current code, I observed the fork balancing can distribute task well in powersaving policy. That means the computation is worth.
-There are some heuristics that will be hard to tune: *powersaving balance period set as 8*max_interval *power saving can do some performance load balance if there was no performance load balance in the last 32 balances with no more than 4 perf balance in the last 64 balance
Do you have other tunning opinions on the numbers? I am glad to hear any input.
*sched_burst_threshold
I find it is useful on 3.8 kernel when aim7 cause a very imbalance wakeup. but now aim7 is calm down after lock-stealing RWSEM used in kernel, maybe need to re-evaluate this on future new version.
I'm going to send a proposal for a more aggressive and scalable mode of my patches which will take care of my concerns. Let see how this new patchset can fit with Alex's ones
Hi Alex,
I have one point below.
On 04/23/2013 07:53 AM, Alex Shi wrote:
Thanks you, Preeti and Vincent to talk the power aware scheduler for details! believe this open discussion is helpful to conduct a a more comprehensive solution. :)
Hi Preeti,
I have had a look at Alex patches but i have some concerns with his patches -There no notion of power domain which is quite important when we speak about power saving IMHO. Packing tasks has got an interest if the idle CPUs can reach a useful low power state independently from busy CPUs. Architectures have different low power state capabilities which must be taken into account. In addition, you can have system which have CPUs with better power efficiency and this kind of system are not taken into account.
I agree with you on this point. and like what's you done to add new flag in sched domain. It also make scheduler easy pick up new idea in balancing. BTW, Currently, the my balance is trying pack task per SMT, maybe packing task per cpu horse power is more compatible for other archs?
Correct me if I am wrong,but the scheduler today does not compare the task load to the destination cpu power before moving the task to the destination cpu.This could be during:
1. Load balancing: In move_tasks(), only the imbalance is verified against the task load before moving tasks and does not necessarily check if the destination cpu has enough cpu power to handle these tasks.
2. select_task_rq_fair(): For a forked task, the idlest cpu in the group leader is found during power save balance( I am focussing only on the power save policy),and is returned as the destination cpu for the forked task.But I feel we need to check if the idle cpu has the cpu power to handle the task load.
Why I am bringing about this point is due to a use case which we might need to handle in the power aware scheduler going ahead.That being the big.LITTLE cpus. We would ideally want the short running tasks on the LITTLE cpus and the long running tasks on the big cpus.
While the power aware scheduler strives to pack tasks,it should not end up packing long running tasks on LITTLE cpus. Not having big cpus to handle short running tasks is the next step of course but atleast not throttle the long running tasks by scheduling them on LITTLE cpus.
Thanks
Regards Preeti U Murthy
On 4/22/2013 7:23 PM, Alex Shi wrote:
Thanks you, Preeti and Vincent to talk the power aware scheduler for details! believe this open discussion is helpful to conduct a a more comprehensive solution. :)
Hi Preeti,
I have had a look at Alex patches but i have some concerns with his patches -There no notion of power domain which is quite important when we speak about power saving IMHO. Packing tasks has got an interest if the idle CPUs can reach a useful low power state independently from busy CPUs. Architectures have different low power state capabilities which must be taken into account. In addition, you can have system which have CPUs with better power efficiency and this kind of system are not taken into account.
I agree with you on this point. and like what's you done to add new flag in sched domain.
For x86 we should not be setting such flag then; we don't have a way for some cpu packages to go to an extra deep power state if they're completely idle. (this afaik is true for both Intel and AMD)
On Tue, Apr 23, 2013 at 08:30:58AM -0700, Arjan van de Ven wrote:
For x86 we should not be setting such flag then; we don't have a way for some cpu packages to go to an extra deep power state if they're completely idle. (this afaik is true for both Intel and AMD)
You say 'some'; this implies we do for others, right?
We can dynamically set the required flags in the arch setup when it makes sense.
Hi Vincent,
Thank you very much for bringing about the differences between your goals and the working of the power aware scheduler patchset.This was essential for us to understand the various requirements from a power aware scheduler.After you post out the patchset we could try and evaluate the following points again.
Thanks
Regards Preeti U Murthy
On 04/23/2013 01:27 AM, Vincent Guittot wrote:
On Monday, 22 April 2013, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 04/05/2013 04:38 PM, Vincent Guittot wrote:
Peter,
After some toughts about your comments,I can update the buddy cpu during ILB or periofdic LB to a new idle core and extend the packing mechanism Does this additional mechanism sound better for you ?
Hi Preeti,
I have had a look at Alex patches but i have some concerns with his patches -There no notion of power domain which is quite important when we speak about power saving IMHO. Packing tasks has got an interest if the idle CPUs can reach a useful low power state independently from busy CPUs. Architectures have different low power state capabilities which must be taken into account. In addition, you can have system which have CPUs with better power efficiency and this kind of system are not taken into account. -There are some computation of statistics on a potentially large number of cpus and groups at each task wake up. This overhead concerns me and such amount of computation should only be done when we have more time like the periodic load balance. -There are some heuristics that will be hard to tune: *powersaving balance period set as 8*max_interval *power saving can do some performance load balance if there was no performance load balance in the last 32 balances with no more than 4 perf balance in the last 64 balance *sched_burst_threshold
I'm going to send a proposal for a more aggressive and scalable mode of my patches which will take care of my concerns. Let see how this new patchset can fit with Alex's ones
Regards, Vincent
If the primary goal of this patchset is to pack small tasks in fewer power domains then why not see if the power aware scheduler patchset by Alex does the same for you? The reason being:
a.The power aware scheduler also checks if a task is small enough to be packed on a cpu which has just enough capacity to take on that task(leader cpu). This cpu belongs to a scheduler group which is nearly full(group_leader),so we end up packing tasks.
b.The overhead of assigning a buddy cpu gets eliminated because the best cpu for packing is decided during wake up.
c.This is a scalable solution because if the leader cpu is busy,then any other idle cpu from that group_leader is chosen.Eventually you end up packing anyway.
The reason that I am suggesting this is that we could unify the power awareness of the scheduler under one umbrella.And i believe that the current power aware scheduler patchset is flexible enough to do this and that we must cash in on it.
Thanks
Regards Preeti U Murthy
Vincent
On 26 March 2013 15:42, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible?
The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little.
Earlier you wrote:
| Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 |
buddy | CPU0 | CPU0 | CPU0 | CPU2 |
So extrapolating that to a 4+4 big-little you'd get something like:
| little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
Right?
So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there.
Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around.
If we'd use the ILB as packing cpu, we would simply select a next pack target once the old one fills up.
The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be power gated independantly. So we clear SD_SHARE_POWERDOMAIN at MC and CPU level.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/kernel/topology.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 79282eb..f89a4a2 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -201,6 +201,15 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS];
+int arch_sd_local_flags(int level) +{ + /* Powergate at threading level doesn't make sense */ + if (level & SD_SHARE_CPUPOWER) + return 1*SD_SHARE_POWERDOMAIN; + + return 0*SD_SHARE_POWERDOMAIN; +} + const struct cpumask *cpu_coregroup_mask(int cpu) { return &cpu_topology[cpu].core_sibling;
Hi Vincent,
The power aware scheduler patchset that has been released by Alex recently [https://lkml.org/lkml/2013/2/18/4], also has the idea of packing of small tasks.In that patchset, the waking of small tasks in done on a leader cpu, which of course varies dynamically. https://lkml.org/lkml/2013/2/18/6
I will refer henceforth to the above patchset as Patch A and your current patchset in this mail as Patch B.
The difference in Patch A is that it is packing small tasks onto a leader cpu, i.e. a cpu which has just enough capacity to take on itself a light weight task. Essentially belonging to a group which has a favourable grp_power and the leader cpu having a suitable rq->util.
In Patch B, you are choosing a buddy cpu, belonging to a group with the least cpu power, and while deciding to pack tasks onto the buddy cpu, this again boils down to, verifying if the grp_power is favourable and the buddy cpu has a suitable rq->util.
In my opinion, this goes to say that both the patches are using similar metrics to decide if a cpu is capable of handling the light weight task. The difference lies in how far the search for the appropriate cpu can proceed. While Patch A continues to search at all levels of sched domains till the last to find the appropriate cpu, Patch B queries only the buddy CPU.
The point I am trying to make is that, considering the above similarities and differences, is it possible for us to see if the ideas that both these patches bring in can be merged into one, since both of them are having the common goal of packing small tasks.
Thanks
Regards Preeti U Murthy
On 03/22/2013 05:55 PM, Vincent Guittot wrote:
Hi,
This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the small tasks in as few as possible CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power consumption in the low load use cases by minimizing the number of power domain that are enabled. The packing is done in 2 steps:
The 1st step looks for the best place to pack tasks in a system according to its topology and it defines a pack buddy CPU for each CPU if there is one available. We define the best CPU during the build of the sched_domain instead of evaluating it at runtime because it can be difficult to define a stable buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is that we pack at all levels inside a node where a group of CPU can be power gated independently from others. For describing this capability, a new flag has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the groups of CPUs of a scheduling domain are sharing their power state. By default, this flag has been set in all sched_domain in order to keep unchanged the current behavior of the scheduler and only ARM platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level.
In a 2nd step, the scheduler checks the load average of a task which wakes up as well as the load average of the buddy CPU and it can decide to migrate the light tasks on a not busy buddy. This check is done during the wake up because small tasks tend to wake up between periodic load balance and asynchronously to each other which prevents the default mechanism to catch and migrate them efficiently. A light task is defined by a runnable_avg_sum that is less than 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones: The average CPU load of the task must be less than 20% and the task must have been runnable less than 10ms when it woke up last time in order to be electable for the packing migration. So, a task than runs 1 ms each 5ms will be considered as a small task but a task that runs 50 ms with a period of 500ms, will not. Then, the business of the buddy CPU depends of the load average for the rq and the number of running tasks. A CPU with a load average greater than 50% will be considered as busy CPU whatever the number of running tasks is and this threshold will be reduced by the number of running tasks in order to not increase too much the wake up latency of a task. When the buddy CPU is busy, the scheduler falls back to default CFS policy.
Change since V2:
- Migrate only a task that wakes up
- Change the light tasks threshold to 20%
- Change the loaded CPU threshold to not pull tasks if the current number of running tasks is null but the load average is already greater than 50%
- Fix the algorithm for selecting the buddy CPU.
Change since V1:
Patch 2/6
- Change the flag name which was not clear. The new name is SD_SHARE_POWERDOMAIN.
- Create an architecture dependent function to tune the sched_domain flags
Patch 3/6
- Fix issues in the algorithm that looks for the best buddy CPU
- Use pr_debug instead of pr_info
- Fix for uniprocessor
Patch 4/6
- Remove the use of usage_avg_sum which has not been merged
Patch 5/6
- Change the way the coherency of runnable_avg_sum and runnable_avg_period is ensured
Patch 6/6
- Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM platform
New results for v3:
This series has been tested with hackbench on ARM platform and the results don't show any performance regression
Hackbench 3.9-rc2 +patches Mean Time (10 tests): 2.048 2.015 stdev : 0.047 0.068
Previous results for V2:
This series has been tested with MP3 play back on ARM platform: TC2 HMP (dual CA-15 and 3xCA-7 cluster).
The measurements have been done on an Ubuntu image during 60 seconds of playback and the result has been normalized to 100.
| CA15 | CA7 | total |
default | 81 | 97 | 178 | pack | 13 | 100 | 113 |
Previous results for V1:
The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that it's worth packing small tasks at all topology levels.
The performance tests have been done on both platforms with sysbench. The results don't show any performance regressions. These results are aligned with the policy which uses the normal behavior with heavy use cases.
test: sysbench --test=cpu --num-threads=N --max-requests=R run
Results below is the average duration of 3 tests on the quad CA-9. default is the current scheduler behavior (pack buddy CPU is -1) pack is the scheduler with the pack mechanism
| default | pack |
N=8; R=200 | 3.1999 | 3.1921 | N=8; R=2000 | 31.4939 | 31.4844 | N=12; R=200 | 3.2043 | 3.2084 | N=12; R=2000 | 31.4897 | 31.4831 | N=16; R=200 | 3.1774 | 3.1824 | N=16; R=2000 | 31.4899 | 31.4897 |
The power consumption tests have been done only on TC2 platform which has got accessible power lines and I have used cyclictest to simulate small tasks. The tests show some power consumption improvements.
test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
The measurements have been done during 16 seconds and the result has been normalized to 100
| CA15 | CA7 | total |
default | 100 | 40 | 140 | pack | <1 | 45 | <46 |
The A15 cluster is less power efficient than the A7 cluster but if we assume that the tasks is well spread on both clusters, we can guest estimate that the power consumption on a dual cluster of CA7 would have been for a default kernel:
| CA7 | CA7 | total |
default | 40 | 40 | 80 |
Vincent Guittot (6): Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain sched: pack small tasks sched: secure access to other CPU statistics sched: pack the idle load balance ARM: sched: clear SD_SHARE_POWERDOMAIN
arch/arm/kernel/topology.c | 9 +++ arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 9 +-- include/linux/topology.h | 4 + kernel/sched/core.c | 14 ++-- kernel/sched/fair.c | 149 +++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 14 ++-- 8 files changed, 169 insertions(+), 32 deletions(-)
On 23 March 2013 12:55, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
The power aware scheduler patchset that has been released by Alex recently [https://lkml.org/lkml/2013/2/18/4], also has the idea of packing of small tasks.In that patchset, the waking of small tasks in done on a leader cpu, which of course varies dynamically. https://lkml.org/lkml/2013/2/18/6
I will refer henceforth to the above patchset as Patch A and your current patchset in this mail as Patch B.
The difference in Patch A is that it is packing small tasks onto a leader cpu, i.e. a cpu which has just enough capacity to take on itself a light weight task. Essentially belonging to a group which has a favourable grp_power and the leader cpu having a suitable rq->util.
In Patch B, you are choosing a buddy cpu, belonging to a group with the least cpu power, and while deciding to pack tasks onto the buddy cpu, this again boils down to, verifying if the grp_power is favourable and the buddy cpu has a suitable rq->util.
In my opinion, this goes to say that both the patches are using similar metrics to decide if a cpu is capable of handling the light weight task. The difference lies in how far the search for the appropriate cpu can proceed. While Patch A continues to search at all levels of sched domains till the last to find the appropriate cpu, Patch B queries only the buddy CPU.
The point I am trying to make is that, considering the above similarities and differences, is it possible for us to see if the ideas that both these patches bring in can be merged into one, since both of them are having the common goal of packing small tasks.
Hi Preeti,
I agree that the final goal is similar but we also have important differences to keep in mind among which the share (or not) of power domain that is used to decide where it's worth packing, the selection of leader CPU which is the most power efficient core and the use of a knob to select a policy. So, you're right that we should find a way to have both working together and still keep these specificities.
Regards, Vincent
Thanks
Regards Preeti U Murthy
On 03/22/2013 05:55 PM, Vincent Guittot wrote:
Hi,
This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the small tasks in as few as possible CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power consumption in the low load use cases by minimizing the number of power domain that are enabled. The packing is done in 2 steps:
The 1st step looks for the best place to pack tasks in a system according to its topology and it defines a pack buddy CPU for each CPU if there is one available. We define the best CPU during the build of the sched_domain instead of evaluating it at runtime because it can be difficult to define a stable buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is that we pack at all levels inside a node where a group of CPU can be power gated independently from others. For describing this capability, a new flag has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the groups of CPUs of a scheduling domain are sharing their power state. By default, this flag has been set in all sched_domain in order to keep unchanged the current behavior of the scheduler and only ARM platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level.
In a 2nd step, the scheduler checks the load average of a task which wakes up as well as the load average of the buddy CPU and it can decide to migrate the light tasks on a not busy buddy. This check is done during the wake up because small tasks tend to wake up between periodic load balance and asynchronously to each other which prevents the default mechanism to catch and migrate them efficiently. A light task is defined by a runnable_avg_sum that is less than 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones: The average CPU load of the task must be less than 20% and the task must have been runnable less than 10ms when it woke up last time in order to be electable for the packing migration. So, a task than runs 1 ms each 5ms will be considered as a small task but a task that runs 50 ms with a period of 500ms, will not. Then, the business of the buddy CPU depends of the load average for the rq and the number of running tasks. A CPU with a load average greater than 50% will be considered as busy CPU whatever the number of running tasks is and this threshold will be reduced by the number of running tasks in order to not increase too much the wake up latency of a task. When the buddy CPU is busy, the scheduler falls back to default CFS policy.
Change since V2:
- Migrate only a task that wakes up
- Change the light tasks threshold to 20%
- Change the loaded CPU threshold to not pull tasks if the current number of running tasks is null but the load average is already greater than 50%
- Fix the algorithm for selecting the buddy CPU.
Change since V1:
Patch 2/6
- Change the flag name which was not clear. The new name is SD_SHARE_POWERDOMAIN.
- Create an architecture dependent function to tune the sched_domain flags
Patch 3/6
- Fix issues in the algorithm that looks for the best buddy CPU
- Use pr_debug instead of pr_info
- Fix for uniprocessor
Patch 4/6
- Remove the use of usage_avg_sum which has not been merged
Patch 5/6
- Change the way the coherency of runnable_avg_sum and runnable_avg_period is ensured
Patch 6/6
- Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM platform
New results for v3:
This series has been tested with hackbench on ARM platform and the results don't show any performance regression
Hackbench 3.9-rc2 +patches Mean Time (10 tests): 2.048 2.015 stdev : 0.047 0.068
Previous results for V2:
This series has been tested with MP3 play back on ARM platform: TC2 HMP (dual CA-15 and 3xCA-7 cluster).
The measurements have been done on an Ubuntu image during 60 seconds of playback and the result has been normalized to 100.
| CA15 | CA7 | total |
default | 81 | 97 | 178 | pack | 13 | 100 | 113 |
Previous results for V1:
The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that it's worth packing small tasks at all topology levels.
The performance tests have been done on both platforms with sysbench. The results don't show any performance regressions. These results are aligned with the policy which uses the normal behavior with heavy use cases.
test: sysbench --test=cpu --num-threads=N --max-requests=R run
Results below is the average duration of 3 tests on the quad CA-9. default is the current scheduler behavior (pack buddy CPU is -1) pack is the scheduler with the pack mechanism
| default | pack |
N=8; R=200 | 3.1999 | 3.1921 | N=8; R=2000 | 31.4939 | 31.4844 | N=12; R=200 | 3.2043 | 3.2084 | N=12; R=2000 | 31.4897 | 31.4831 | N=16; R=200 | 3.1774 | 3.1824 | N=16; R=2000 | 31.4899 | 31.4897 |
The power consumption tests have been done only on TC2 platform which has got accessible power lines and I have used cyclictest to simulate small tasks. The tests show some power consumption improvements.
test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
The measurements have been done during 16 seconds and the result has been normalized to 100
| CA15 | CA7 | total |
default | 100 | 40 | 140 | pack | <1 | 45 | <46 |
The A15 cluster is less power efficient than the A7 cluster but if we assume that the tasks is well spread on both clusters, we can guest estimate that the power consumption on a dual cluster of CA7 would have been for a default kernel:
| CA7 | CA7 | total |
default | 40 | 40 | 80 |
Vincent Guittot (6): Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain sched: pack small tasks sched: secure access to other CPU statistics sched: pack the idle load balance ARM: sched: clear SD_SHARE_POWERDOMAIN
arch/arm/kernel/topology.c | 9 +++ arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 9 +-- include/linux/topology.h | 4 + kernel/sched/core.c | 14 ++-- kernel/sched/fair.c | 149 +++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 14 ++-- 8 files changed, 169 insertions(+), 32 deletions(-)
linaro-kernel@lists.linaro.org