This patchset uses the per-entity-load-tracking patchset which will soon be available in the kernel.It is based on the tip/master tree before the (HEAD at b654f92c06e562c)integration of per-entity-load-tracking patchset. The first 8 latest patches of sched:per-entity-load-tracking alone have been imported to the tree from the quilt series of Peter(when they were present) to avoid the complexities of task groups and to hold back the optimizations of this patchset for now.This patchset is based at this level. Refer https://lkml.org/lkml/2012/10/12/9.This series is a continuation of the patchset in this link.
This patchset is an attempt to begin the integration of PJT's metric with the load balancer in a step wise fashion,and progress based on the consequences.This patchset has been tested with the config excluding CONFIG_FAIR_GROUP_SCHED.
The following issues have been considered towards this: [NOTE:an x% task referred to in the logs and below is calculated over a duty cycle of 10ms.]
1.Consider a scenario,where there are two 10% tasks running on a cpu.The present code will consider the load on this queue to be 2048,while using PJT's metric the load is calculated to be <1000,rarely exceeding this limit.Although the tasks are not contributing much to the cpu load,they are decided to be moved by the scheduler.
But one could argue that 'not moving one of these tasks could throttle them.If there was an idle cpu,perhaps we could have moved them'.While the power save mode would have been fine with not moving the task,the performance mode would prefer not to throttle the tasks.We could strive to strike a balance by making this decision tunable with certain parameters. This patchset includes such tunables.This issue is addressed in Patch[1/2].
*The advantage of this behavior of PJT's metric has been demonstrated via an experiment*.Please see the reply to this cover letter to be posted right away.
2.We need to be able to do this cautiously,as the scheduler code is too complex.This patchset is an attempt to begin the integration of PJT's metric with the load balancer in a step wise fashion,and progress based on the consequences. *What this patchset essentially does is in two primary places of the scheduler,PJT's metric has replaced the existing metric to make decisions for load balancing*. 1.load_balance() 2.select_task_rq_fair()
This description of the patches are below:
Patch[1/13]: This patch aims at detecting short running tasks and prevent their movement.In update_sg_lb_stats,dismiss a sched group as a candidate for load balancing,if load calculated by PJT's metric says that the average load on the sched_group <= 1024+(.15*1024). This is a tunable,which can be varied after sufficient experiments.
Patch[2/13]:In the current scheduler greater load would be analogous to more number of tasks.Therefore when the busiest group is picked from the sched domain in update_sd_lb_stats,only the loads of the groups are compared between them.If we were to use PJT's metric,a higher load does not necessarily mean more number of tasks.This patch addresses this issue.
Patch[3/13] to Patch[13/13] : Replacement of the existing metrics deciding load balancing and selecting a runqueue for load placement,with the PJT's metric and subsequent usage of PJT's metric for schduling.
3.The Primary advantage that I see in integrating PJT's metric with the core scheduler is listed below:
1. Excluding short running tasks from being candidates for load balancing. This would avoid unnecessary migrations when the CPU is not sufficiently loaded.This advantage has been portrayed in the results of the experiment.
Run the workload attached.There are 8 threads spwaned each being 10% tasks. The number of migrations was measured from /proc/schedstat
Machine: 1 socket 4 core pre-nehalem.
Experimental Setup: cat /proc/schedstat > stat_initial gcc -Wall -Wshadow -lpthread -o test test.c cat /proc/schedstat > stat_final The difference in the number of pull requests from both these files have been calculated and are as below:
Observations: With_Patchset Without_patchset --------------------------------------------------------------------- Average_number_of_migrations 0 46 Average_number_of_records/s 9,71,114 9,45,158
With more memory intensive workloads, a higher difference in the number of migrations is seen without any performance compromise.
---
Preeti U Murthy (13): sched:Prevent movement of short running tasks during load balancing sched:Pick the apt busy sched group during load balancing sched:Decide whether there be transfer of loads based on the PJT's metric sched:Decide group_imb using PJT's metric sched:Calculate imbalance using PJT's metric sched:Changing find_busiest_queue to use PJT's metric sched:Change move_tasks to use PJT's metric sched:Some miscallaneous changes in load_balance sched:Modify check_asym_packing to use PJT's metric sched:Modify fix_small_imbalance to use PJT's metric sched:Modify find_idlest_group to use PJT's metric sched:Modify find_idlest_cpu to use PJT's metric sched:Modifying wake_affine to use PJT's metric
kernel/sched/fair.c | 262 ++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 186 insertions(+), 76 deletions(-)
Prevent sched groups with low load as tracked by PJT's metrics from being candidates of the load balance routine.This metric is chosen to be 1024+15%*1024.But using PJT's metrics it has been observed that even when three 10% tasks are running,the load sometimes does not exceed this threshold.The call should be taken if the tasks can afford to be throttled.
This is why an additional metric has been included,which can determine how long we can tolerate tasks not being moved even if the load is low.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dbddcf6..e02dad4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4188,6 +4188,7 @@ struct sd_lb_stats { */ struct sg_lb_stats { unsigned long avg_load; /*Avg load across the CPUs of the group */ + u64 avg_cfs_runnable_load; /* Equivalent of avg_load but calculated using PJT's metric */ unsigned long group_load; /* Total load over the CPUs of the group */ unsigned long sum_nr_running; /* Nr tasks running in the group */ unsigned long sum_weighted_load; /* Weighted load of group's tasks */ @@ -4504,6 +4505,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, unsigned long load, max_cpu_load, min_cpu_load; unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; + u64 group_load = 0; /* computed using PJT's metric */ int i;
if (local_group) @@ -4548,6 +4550,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (idle_cpu(i)) sgs->idle_cpus++;
+ group_load += cpu_rq(i)->cfs.runnable_load_avg; update_sg_numa_stats(sgs, rq); }
@@ -4572,6 +4575,19 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
/* + * Check if the sched group has not crossed the threshold. + * + * Also check if the sched_group although being within the threshold,is not + * queueing too many tasks.If yes to both,then make it an + * invalid candidate for load balancing + * + * The below condition is included as a tunable to meet performance and power needs + */ + sgs->avg_cfs_runnable_load = (group_load * SCHED_POWER_SCALE) / group->sgp->power; + if (sgs->avg_cfs_runnable_load <= 1178 && sgs->sum_nr_running <= 2) + sgs->avg_cfs_runnable_load = 0; + + /* * Consider the group unbalanced when the imbalance is larger * than the average weight of a task. *
If a sched group has passed the test for sufficient load in update_sg_lb_stats,to qualify for load balancing,then PJT's metrics has to be used to qualify the right sched group as the busiest group.
The scenario which led to this patch is shown below: Consider Task1 and Task2 to be a long running task and Tasks 3,4,5,6 to be short running tasks
Task3 Task4 Task1 Task5 Task2 Task6 ------ ------ SCHED_GRP1 SCHED_GRP2
Normal load calculator would qualify SCHED_GRP2 as the candidate for sd->busiest due to the following loads that it calculates.
SCHED_GRP1:2048 SCHED_GRP2:4096
Load calculator would probably qualify SCHED_GRP1 as the candidate for sd->busiest due to the following loads that it calculates
SCHED_GRP1:3200 SCHED_GRP2:1156
This patch aims to strike a balance between the loads of the group and the number of tasks running on the group to decide the busiest group in the sched_domain.
This means we will need to use the PJT's metrics but with an additional constraint.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e02dad4..aafa3c1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -165,7 +165,8 @@ void sched_init_granularity(void) #else # define WMULT_CONST (1UL << 32) #endif - +#define NR_THRESHOLD 2 +#define LOAD_THRESHOLD 1 #define WMULT_SHIFT 32
/* @@ -4169,6 +4170,7 @@ struct sd_lb_stats { /* Statistics of the busiest group */ unsigned int busiest_idle_cpus; unsigned long max_load; + u64 max_sg_load; /* Equivalent of max_load but calculated using pjt's metric*/ unsigned long busiest_load_per_task; unsigned long busiest_nr_running; unsigned long busiest_group_capacity; @@ -4628,8 +4630,24 @@ static bool update_sd_pick_busiest(struct lb_env *env, struct sched_group *sg, struct sg_lb_stats *sgs) { - if (sgs->avg_load <= sds->max_load) - return false; + /* Use PJT's metrics to qualify a sched_group as busy + * + * But a low load sched group may be queueing up many tasks + * So before dismissing a sched group with lesser load,ensure + * that the number of processes on it is checked if it is + * not too less loaded than the max load so far + * + * But as of now as LOAD_THRESHOLD is 1,this check is a nop. + * But we could vary LOAD_THRESHOLD suitably to bring in this check + */ + if (sgs->avg_cfs_runnable_load <= sds->max_sg_load) { + if (sgs->avg_cfs_runnable_load > LOAD_THRESHOLD * sds->max_sg_load) { + if (sgs->sum_nr_running <= (NR_THRESHOLD + sds->busiest_nr_running)) + return false; + } else { + return false; + } + }
if (sgs->sum_nr_running > sgs->group_capacity) return true; @@ -4708,6 +4726,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, sds->this_idle_cpus = sgs.idle_cpus; } else if (update_sd_pick_busiest(env, sds, sg, &sgs)) { sds->max_load = sgs.avg_load; + sds->max_sg_load = sgs.avg_cfs_runnable_load; sds->busiest = sg; sds->busiest_nr_running = sgs.sum_nr_running; sds->busiest_idle_cpus = sgs.idle_cpus;
Additional parameters relevant to sched group load balancing statistics and sched domain's load balancing statistics which are calculated using the per entity load tracking are used.
The intention is to begin to use PJT's metric in the first level of load balancing to see if the current sched group is capable of pulling tasks upon itself.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aafa3c1..67a916d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4157,12 +4157,16 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *this; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ + u64 total_sgs_load; /* Equivalent to total_load except using PJT's metrics */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ + u64 avg_sgs_load; /* Equivalent to avg_load but calculated with PJT's metrics */
/** Statistics of this group */ unsigned long this_load; + u64 this_sg_load; /* Equivalent to this_load but calculated using PJT's metric*/ unsigned long this_load_per_task; + u64 this_sg_load_per_task; /* Equivalent to this_load_per_task but using PJT's metric*/ unsigned long this_nr_running; unsigned long this_has_capacity; unsigned int this_idle_cpus; @@ -4170,8 +4174,9 @@ struct sd_lb_stats { /* Statistics of the busiest group */ unsigned int busiest_idle_cpus; unsigned long max_load; - u64 max_sg_load; /* Equivalent of max_load but calculated using pjt's metric*/ + u64 max_sg_load; /* Equivalent of max_load but calculated using PJT's metric*/ unsigned long busiest_load_per_task; + u64 busiest_sg_load_per_task; /*Equivalent of busiest_load_per_task but using PJT's metric*/ unsigned long busiest_nr_running; unsigned long busiest_group_capacity; unsigned long busiest_has_capacity; @@ -4192,6 +4197,7 @@ struct sg_lb_stats { unsigned long avg_load; /*Avg load across the CPUs of the group */ u64 avg_cfs_runnable_load; /* Equivalent of avg_load but calculated using PJT's metric */ unsigned long group_load; /* Total load over the CPUs of the group */ + u64 group_cfs_runnable_load; /* Equivalent of group_load but calculated using PJT's metric */ unsigned long sum_nr_running; /* Nr tasks running in the group */ unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long group_capacity; @@ -4507,7 +4513,6 @@ static inline void update_sg_lb_stats(struct lb_env *env, unsigned long load, max_cpu_load, min_cpu_load; unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; - u64 group_load = 0; /* computed using PJT's metric */ int i;
if (local_group) @@ -4552,7 +4557,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (idle_cpu(i)) sgs->idle_cpus++;
- group_load += cpu_rq(i)->cfs.runnable_load_avg; + /* Tracking load using PJT's metric */ + sgs->group_cfs_runnable_load += cpu_rq(i)->cfs.runnable_load_avg; update_sg_numa_stats(sgs, rq); }
@@ -4585,8 +4591,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, * * The below condition is included as a tunable to meet performance and power needs */ - sgs->avg_cfs_runnable_load = (group_load * SCHED_POWER_SCALE) / group->sgp->power; - if (sgs->avg_cfs_runnable_load <= 1178 && sgs->sum_nr_running <= 2) + sgs->avg_cfs_runnable_load = (sgs->group_cfs_runnable_load * SCHED_POWER_SCALE) / group->sgp->power; + if (sgs->avg_cfs_runnable_load <= 1178 && sgs->sum_nr_running <= 2 && !local_group) sgs->avg_cfs_runnable_load = 0;
/* @@ -4702,6 +4708,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, return;
sds->total_load += sgs.group_load; + /* Tracking load using PJT's metrics */ + sds->total_sgs_load += sgs.group_cfs_runnable_load; sds->total_pwr += sg->sgp->power;
/* @@ -4719,9 +4727,11 @@ static inline void update_sd_lb_stats(struct lb_env *env,
if (local_group) { sds->this_load = sgs.avg_load; + sds->this_sg_load = sgs.avg_cfs_runnable_load; sds->this = sg; sds->this_nr_running = sgs.sum_nr_running; sds->this_load_per_task = sgs.sum_weighted_load; + sds->this_sg_load_per_task = sgs.group_cfs_runnable_load; sds->this_has_capacity = sgs.group_has_capacity; sds->this_idle_cpus = sgs.idle_cpus; } else if (update_sd_pick_busiest(env, sds, sg, &sgs)) { @@ -4732,6 +4742,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, sds->busiest_idle_cpus = sgs.idle_cpus; sds->busiest_group_capacity = sgs.group_capacity; sds->busiest_load_per_task = sgs.sum_weighted_load; + sds->busiest_sg_load_per_task = sgs.group_cfs_runnable_load; sds->busiest_has_capacity = sgs.group_has_capacity; sds->busiest_group_weight = sgs.group_weight; sds->group_imb = sgs.group_imb; @@ -4972,6 +4983,7 @@ find_busiest_group(struct lb_env *env, int *balance) goto out_balanced;
sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr; + sds.avg_sgs_load = (SCHED_POWER_SCALE * sds.total_sgs_load) / sds.total_pwr;
/* * If the busiest group is imbalanced the below checks don't @@ -4990,14 +5002,16 @@ find_busiest_group(struct lb_env *env, int *balance) * If the local group is more busy than the selected busiest group * don't try and pull any tasks. */ - if (sds.this_load >= sds.max_load) + /* The following metrics has been changed to test PJT's metric */ + if (sds.this_sg_load >= sds.max_sg_load) goto out_balanced;
/* * Don't pull any tasks if this group is already above the domain * average load. */ - if (sds.this_load >= sds.avg_load) + /* The following metrics has been changed to test PJT's metric */ + if (sds.this_sg_load >= sds.avg_sgs_load) goto out_balanced;
if (env->idle == CPU_IDLE) { @@ -5015,7 +5029,10 @@ find_busiest_group(struct lb_env *env, int *balance) * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use * imbalance_pct to be conservative. */ - if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load) + /* The following metrics has been changed to test PJT's + * metric + */ + if (100 * sds.max_sg_load <= env->sd->imbalance_pct * sds.this_sg_load) goto out_balanced; }
Additional parameters for deciding a sched group's imbalance status which are calculated using the per entity load tracking are used.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 67a916d..77363c6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3748,6 +3748,7 @@ struct lb_env { int new_dst_cpu; enum cpu_idle_type idle; long imbalance; + long long load_imbalance; /* PJT metric equivalent of imbalance */ /* The set of CPUs under consideration for load-balancing */ struct cpumask *cpus;
@@ -4513,6 +4514,11 @@ static inline void update_sg_lb_stats(struct lb_env *env, unsigned long load, max_cpu_load, min_cpu_load; unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; + + /* Decide imb based on PJT's metric */ + u64 cpu_runnable_load, max_cpu_runnable_load, min_cpu_runnable_load; + u64 avg_sg_load_per_task = 0; + int i;
if (local_group) @@ -4521,6 +4527,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, /* Tally up the load of all CPUs in the group */ max_cpu_load = 0; min_cpu_load = ~0UL; + max_cpu_runnable_load = 0; + min_cpu_runnable_load = ~0ULL; max_nr_running = 0; min_nr_running = ~0UL;
@@ -4545,6 +4553,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (min_cpu_load > load) min_cpu_load = load;
+ cpu_runnable_load = cpu_rq(i)->cfs.runnable_load_avg; + if (cpu_runnable_load > max_cpu_runnable_load) + max_cpu_runnable_load = cpu_runnable_load; + if (min_cpu_runnable_load > cpu_runnable_load) + min_cpu_runnable_load = cpu_runnable_load; + if (nr_running > max_nr_running) max_nr_running = nr_running; if (min_nr_running > nr_running) @@ -4604,10 +4618,13 @@ static inline void update_sg_lb_stats(struct lb_env *env, * normalized nr_running number somewhere that negates * the hierarchy? */ - if (sgs->sum_nr_running) + if (sgs->sum_nr_running) { avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; + avg_sg_load_per_task = sgs->group_cfs_runnable_load / sgs->sum_nr_running; + }
- if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && + /* The following decision is made on PJT's metric */ + if ((max_cpu_runnable_load - min_cpu_runnable_load) >= avg_sg_load_per_task && (max_nr_running - min_nr_running) > 1) sgs->group_imb = 1;
@@ -5047,6 +5064,7 @@ out_balanced:
ret: env->imbalance = 0; + env->load_imbalance = 0; return NULL; }
Additional parameters which decide the amount of imbalance in the sched domain calculated using PJT's metric are used.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 36 +++++++++++++++++++++++------------- 1 file changed, 23 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 77363c6..fca6606 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4892,12 +4892,14 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) */ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { - unsigned long max_pull, load_above_capacity = ~0UL; + /* Additional parameters introduced to use PJT's metric */ + u64 max_load_pull, load_above_busiest_capacity = ~0ULL;
- sds->busiest_load_per_task /= sds->busiest_nr_running; + /* Calculation using PJT's metric */ + sds->busiest_sg_load_per_task /= sds->busiest_nr_running; if (sds->group_imb) { - sds->busiest_load_per_task = - min(sds->busiest_load_per_task, sds->avg_load); + sds->busiest_sg_load_per_task = + min(sds->busiest_sg_load_per_task, sds->avg_sgs_load); }
/* @@ -4905,21 +4907,24 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * max load less than avg load(as we skip the groups at or below * its cpu_power, while calculating max_load..) */ - if (sds->max_load < sds->avg_load) { + if (sds->max_sg_load < sds->avg_sgs_load) { env->imbalance = 0; + env->load_imbalance = 0; return fix_small_imbalance(env, sds); }
if (!sds->group_imb) { /* * Don't want to pull so many tasks that a group would go idle. + * Also the below change due to the integration with PJT's + * metric */ - load_above_capacity = (sds->busiest_nr_running - + load_above_busiest_capacity = (sds->busiest_nr_running - sds->busiest_group_capacity);
- load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); + load_above_busiest_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
- load_above_capacity /= sds->busiest->sgp->power; + load_above_busiest_capacity /= sds->busiest->sgp->power; }
/* @@ -4932,11 +4937,16 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * Be careful of negative numbers as they'll appear as very large values * with unsigned longs. */ - max_pull = min(sds->max_load - sds->avg_load, load_above_capacity); + /* + * The below maximum load to be pulled is based on the PJT's metric + */ + max_load_pull = min(sds->max_sg_load - sds->avg_sgs_load, load_above_busiest_capacity);
- /* How much load to actually move to equalise the imbalance */ - env->imbalance = min(max_pull * sds->busiest->sgp->power, - (sds->avg_load - sds->this_load) * sds->this->sgp->power) + /* How much load to actually move to equalise the imbalance + * Calculated using PJT's metric + */ + env->load_imbalance = min(max_load_pull * sds->busiest->sgp->power, + (sds->avg_sgs_load - sds->this_sg_load) * sds->this->sgp->power) / SCHED_POWER_SCALE;
/* @@ -4945,7 +4955,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * a think about bumping its value to force at least one task to be * moved */ - if (env->imbalance < sds->busiest_load_per_task) + if (env->load_imbalance < sds->busiest_sg_load_per_task) return fix_small_imbalance(env, sds);
}
Additional parameters which decide the busiest cpu in the chosen sched group calculated using PJT's metric are used
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fca6606..bb1c71b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5085,7 +5085,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, struct sched_group *group) { struct rq *busiest = NULL, *rq; - unsigned long max_load = 0; + u64 max_cpu_load = 0; int i;
for_each_cpu(i, sched_group_cpus(group)) { @@ -5093,6 +5093,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE); unsigned long wl; + u64 runnable_load;/* Equivalent of wl,calculated using PJT's metric */
if (!capacity) capacity = fix_small_capacity(env->sd, group); @@ -5102,12 +5103,14 @@ static struct rq *find_busiest_queue(struct lb_env *env,
rq = cpu_rq(i); wl = weighted_cpuload(i); + runnable_load = cpu_rq(i)->cfs.runnable_load_avg;
/* * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu power. + * The below decision is based on PJT's metric */ - if (capacity && rq->nr_running == 1 && wl > env->imbalance) + if (capacity && rq->nr_running == 1 && runnable_load > env->load_imbalance) continue;
/* @@ -5117,9 +5120,11 @@ static struct rq *find_busiest_queue(struct lb_env *env, * running at a lower capacity. */ wl = (wl * SCHED_POWER_SCALE) / power; + runnable_load = (runnable_load * SCHED_POWER_SCALE) / power;
- if (wl > max_load) { - max_load = wl; + /* Below decision has been changed to use PJT's metric */ + if (runnable_load > max_cpu_load) { + max_cpu_load = runnable_load; busiest = rq; } }
Make decisions based on PJT's metrics and the dependent metrics about which tasks to move to reduce the imbalance.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bb1c71b..bd7b69d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3957,7 +3957,7 @@ static int move_tasks(struct lb_env *env) unsigned long load; int pulled = 0;
- if (env->imbalance <= 0) + if (env->load_imbalance <= 0) return 0;
again: @@ -3984,7 +3984,8 @@ again: if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed) goto next;
- if ((load / 2) > env->imbalance) + /* The below being changed to use the PJT's metric */ + if ((load / 2) > env->load_imbalance) goto next;
if (!can_migrate_task(p, env)) @@ -3992,7 +3993,8 @@ again:
move_task(p, env); pulled++; - env->imbalance -= load; + /* Using PJT's metric */ + env->load_imbalance -= load;
#ifdef CONFIG_PREEMPT /* @@ -4007,8 +4009,9 @@ again: /* * We only want to steal up to the prescribed amount of * weighted load. + * But the below modification is to use PJT's metric */ - if (env->imbalance <= 0) + if (env->load_imbalance <= 0) goto out;
continue; @@ -4145,7 +4148,8 @@ static inline void update_h_load(long cpu)
static unsigned long task_h_load(struct task_struct *p) { - return p->se.load.weight; + /* The below is changed to use PJT's metric*/ + return p->se.avg.load_avg_contrib; } #endif
Modify certain decisions in load_balance to use the imbalance amount as calculated by the PJT's metric.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bd7b69d..68a6b1d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5284,7 +5284,10 @@ more_balance: * moreover subsequent load balance cycles should correct the * excess load moved. */ - if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0 && + /* + * The following decision based on PJT's metric + */ + if ((env.flags & LBF_SOME_PINNED) && env.load_imbalance > 0 && lb_iterations++ < max_lb_iterations) {
env.dst_rq = cpu_rq(env.new_dst_cpu);
Make appropriate modifications in check_asym_packing to reflect PJT's metric.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 68a6b1d..3b18f5f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4814,6 +4814,8 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
env->imbalance = DIV_ROUND_CLOSEST( sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE); + env->load_imbalance = DIV_ROUND_CLOSEST( + sds->max_sg_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
return 1; }
Additional parameters which aid in taking the decisions in fix_small_imbalance which are calculated using PJT's metric are used.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++-------------------- 1 file changed, 33 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3b18f5f..a5affbc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2936,8 +2936,9 @@ static unsigned long cpu_avg_load_per_task(int cpu) struct rq *rq = cpu_rq(cpu); unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
- if (nr_running) + if (nr_running) { return rq->load.weight / nr_running; + }
return 0; } @@ -4830,27 +4831,38 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds) static inline void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { - unsigned long tmp, pwr_now = 0, pwr_move = 0; + /* Parameters introduced to use PJT's metrics */ + u64 tmp, pwr_now = 0, pwr_move = 0; unsigned int imbn = 2; unsigned long scaled_busy_load_per_task; + u64 scaled_busy_sg_load_per_task; /* Parameter to use PJT's metric */ + unsigned long nr_running = ACCESS_ONCE(cpu_rq(env->dst_cpu)->nr_running);
if (sds->this_nr_running) { - sds->this_load_per_task /= sds->this_nr_running; - if (sds->busiest_load_per_task > - sds->this_load_per_task) + sds->this_sg_load_per_task /= sds->this_nr_running; + if (sds->busiest_sg_load_per_task > + sds->this_sg_load_per_task) imbn = 1; } else { - sds->this_load_per_task = - cpu_avg_load_per_task(env->dst_cpu); + if (nr_running) { + sds->this_sg_load_per_task = + /* The below decision based on PJT's metric */ + cpu_rq(env->dst_cpu)->cfs.runnable_load_avg / nr_running; + } else { + sds->this_sg_load_per_task = 0; + } }
scaled_busy_load_per_task = sds->busiest_load_per_task * SCHED_POWER_SCALE; + scaled_busy_sg_load_per_task = sds->busiest_sg_load_per_task + * SCHED_POWER_SCALE; scaled_busy_load_per_task /= sds->busiest->sgp->power; + scaled_busy_sg_load_per_task /= sds->busiest->sgp->power;
- if (sds->max_load - sds->this_load + scaled_busy_load_per_task >= - (scaled_busy_load_per_task * imbn)) { - env->imbalance = sds->busiest_load_per_task; + if (sds->max_sg_load - sds->this_sg_load + scaled_busy_sg_load_per_task >= + (scaled_busy_sg_load_per_task * imbn)) { + env->load_imbalance = sds->busiest_sg_load_per_task; return; }
@@ -4861,33 +4873,33 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) */
pwr_now += sds->busiest->sgp->power * - min(sds->busiest_load_per_task, sds->max_load); + min(sds->busiest_sg_load_per_task, sds->max_sg_load); pwr_now += sds->this->sgp->power * - min(sds->this_load_per_task, sds->this_load); + min(sds->this_sg_load_per_task, sds->this_sg_load); pwr_now /= SCHED_POWER_SCALE;
/* Amount of load we'd subtract */ - tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) / + tmp = (sds->busiest_sg_load_per_task * SCHED_POWER_SCALE) / sds->busiest->sgp->power; - if (sds->max_load > tmp) + if (sds->max_sg_load > tmp) pwr_move += sds->busiest->sgp->power * - min(sds->busiest_load_per_task, sds->max_load - tmp); + min(sds->busiest_sg_load_per_task, sds->max_sg_load - tmp);
/* Amount of load we'd add */ - if (sds->max_load * sds->busiest->sgp->power < - sds->busiest_load_per_task * SCHED_POWER_SCALE) - tmp = (sds->max_load * sds->busiest->sgp->power) / + if (sds->max_sg_load * sds->busiest->sgp->power < + sds->busiest_sg_load_per_task * SCHED_POWER_SCALE) + tmp = (sds->max_sg_load * sds->busiest->sgp->power) / sds->this->sgp->power; else - tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) / + tmp = (sds->busiest_sg_load_per_task * SCHED_POWER_SCALE) / sds->this->sgp->power; pwr_move += sds->this->sgp->power * - min(sds->this_load_per_task, sds->this_load + tmp); + min(sds->this_sg_load_per_task, sds->this_sg_load + tmp); pwr_move /= SCHED_POWER_SCALE;
/* Move if we gain throughput */ if (pwr_move > pwr_now) - env->imbalance = sds->busiest_load_per_task; + env->load_imbalance = sds->busiest_sg_load_per_task; }
/**
Additional parameters introduced to perform this function which are calculated using PJT's metrics and its helpers.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a5affbc..c64be1c1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3173,11 +3173,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int load_idx) { struct sched_group *idlest = NULL, *group = sd->groups; - unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long this_load = 0; + u64 min_sg_load = ~0ULL, this_sg_load = 0;/* Helpers for PJT's metrics */ int imbalance = 100 + (sd->imbalance_pct-100)/2;
do { unsigned long load, avg_load; + u64 avg_sg_load;/* Helpers for PJT's metrics */ int local_group; int i;
@@ -3191,6 +3193,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
/* Tally up the load of all CPUs in the group */ avg_load = 0; + avg_sg_load = 0;
for_each_cpu(i, sched_group_cpus(group)) { /* Bias balancing toward cpus of our domain */ @@ -3200,20 +3203,23 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, load = target_load(i, load_idx);
avg_load += load; + avg_sg_load += cpu_rq(i)->cfs.runnable_load_avg; }
/* Adjust by relative CPU power of the group */ avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power; + avg_sg_load = (avg_sg_load * SCHED_POWER_SCALE) / group->sgp->power;
if (local_group) { this_load = avg_load; - } else if (avg_load < min_load) { - min_load = avg_load; + this_sg_load = avg_sg_load; + } else if (avg_sg_load < min_sg_load) {/* Decision changed to suit PJT's metric */ + min_sg_load = avg_sg_load; idlest = group; } } while (group = group->next, group != sd->groups);
- if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || 100*this_sg_load < imbalance*min_sg_load) return NULL; return idlest; }
Additional parameters introduced to perform this function which are calculated using PJT's metrics and its helpers.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c64be1c1..15ec528 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3230,16 +3230,18 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, static int find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) { - unsigned long load, min_load = ULONG_MAX; + unsigned long load; + u64 cpu_load, min_cpu_load = ~0ULL; int idlest = -1; int i;
/* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { load = weighted_cpuload(i); + cpu_load = cpu_rq(i)->cfs.runnable_load_avg;
- if (load < min_load || (load == min_load && i == this_cpu)) { - min_load = load; + if (cpu_load < min_cpu_load || (cpu_load == min_cpu_load && i == this_cpu)) { + min_cpu_load = cpu_load; idlest = i; } }
Additional parameters introduced to perform this function which are calculated using PJT's metrics and its helpers.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- kernel/sched/fair.c | 34 +++++++++++++++------------------- 1 file changed, 15 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 15ec528..b4b572c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2931,19 +2931,6 @@ static unsigned long power_of(int cpu) return cpu_rq(cpu)->cpu_power; }
-static unsigned long cpu_avg_load_per_task(int cpu) -{ - struct rq *rq = cpu_rq(cpu); - unsigned long nr_running = ACCESS_ONCE(rq->nr_running); - - if (nr_running) { - return rq->load.weight / nr_running; - } - - return 0; -} - - static void task_waking_fair(struct task_struct *p) { struct sched_entity *se = &p->se; @@ -3085,16 +3072,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; int idx, this_cpu, prev_cpu; - unsigned long tl_per_task; + u64 tl_per_task; /* Modified to reflect PJT's metric */ struct task_group *tg; - unsigned long weight; + unsigned long weight, nr_running; int balanced;
idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); - load = source_load(prev_cpu, idx); - this_load = target_load(this_cpu, idx); + /* Both of the below have been modified to use PJT's metric */ + load = cpu_rq(prev_cpu)->cfs.runnable_load_avg; + this_load = cpu_rq(this_cpu)->cfs.runnable_load_avg; + nr_running = cpu_rq(this_cpu)->nr_running;
/* * If sync wakeup then subtract the (maximum possible) @@ -3104,6 +3093,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (sync) { tg = task_group(current); weight = current->se.load.weight; + weight = current->se.avg.load_avg_contrib;
this_load += effective_load(tg, this_cpu, -weight, -weight); load += effective_load(tg, prev_cpu, 0, -weight); @@ -3111,6 +3101,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
tg = task_group(p); weight = p->se.load.weight; + /* The below change to reflect PJT's metric */ + weight = p->se.avg.load_avg_contrib;
/* * In low-load situations, where prev_cpu is idle and this_cpu is idle @@ -3146,11 +3138,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) return 1;
schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); - tl_per_task = cpu_avg_load_per_task(this_cpu); + /* Below modification to use PJT's metric */ + if (nr_running) + tl_per_task = cpu_rq(this_cpu)->cfs.runnable_load_avg / nr_running; + else + tl_per_task = 0;
if (balanced || (this_load <= load && - this_load + target_load(prev_cpu, idx) <= tl_per_task)) { + this_load + cpu_rq(prev_cpu)->cfs.runnable_load_avg <= tl_per_task)) { /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and
The benchmark: /* * test.c - Simulate workloads that load the CPU differently * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as * published by the Free Software Foundation; version 2 of the License. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 * USA */
/* * This workload spawns threads which request for allocation of a * memory chunk,write to it and free it.The duty cycle of these threads * can be varied.The idea is to simulate tasks which load the cpu * to different extents. */
#include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <string.h> #include <time.h> #include <sys/time.h> #include <sys/resource.h> #include "malloc.h"
/* Variable entities */ static unsigned int seconds; static unsigned int threads; static unsigned int mem_chunk_size; static unsigned int sleep_at; static unsigned int sleep_interval;
typedef size_t mem_slot_t;/* 8 bytes */ static unsigned int slot_size = sizeof(mem_slot_t);
/* Other parameters */ static volatile int start; static time_t start_time; static unsigned int records_read; pthread_mutex_t records_count_lock = PTHREAD_MUTEX_INITIALIZER;
static unsigned int write_to_mem(void) { int i, j; mem_slot_t *scratch_pad, *temp; mem_chunk_size = slot_size * 256; mem_slot_t *end;
/* The below two parameters ensure that it is 10% workload * with a duty cycle of 10ms.The number of records read in * 1s without sleep was observed and appropriately calculated * for 1ms.This number turned out to be 1228. */ sleep_at = 1228; /* sleep for every 1228 records */ sleep_interval = 9000; /* sleep for 9 ms */
for (i=0; start == 1; i++) { scratch_pad = (mem_slot_t *)malloc(mem_chunk_size); if (scratch_pad == NULL) { fprintf(stderr,"Could not allocate memory\n"); exit(1); } end = scratch_pad + (mem_chunk_size / slot_size); for (temp = scratch_pad, j=0; temp < end; temp++, j++) *temp = (mem_slot_t)j;
free(scratch_pad);
if (sleep_at && !(i % sleep_at)) usleep(sleep_interval); } return (i); }
static void * thread_run(void *arg) {
unsigned int records_local;
/* Wait for the start signal */
while (start == 0);
records_local = write_to_mem();
pthread_mutex_lock(&records_count_lock); records_read += records_local; pthread_mutex_unlock(&records_count_lock);
return NULL; }
static void start_threads() { double diff_time; unsigned int i; int err; threads = 8; seconds = 10;
pthread_t thread_array[threads]; for (i = 0; i < threads; i++) { err = pthread_create(&thread_array[i], NULL, thread_run, NULL); if (err) { fprintf(stderr, "Error creating thread %d\n", i); exit(1); } } start_time = time(NULL); start = 1; sleep(seconds); start = 0; diff_time = difftime(time(NULL), start_time);
for (i = 0; i < threads; i++) { err = pthread_join(thread_array[i], NULL); if (err) { fprintf(stderr, "Error joining thread %d\n", i); exit(1); } } printf("%u records/s\n", (unsigned int) (((double) records_read)/diff_time));
} int main() { start_threads(); return 0; }
Regards Preeti U Murthy
OK, so I tried reading a few patches and I'm completely failing.. maybe its late and my brain stopped working, but it simply doesn't make any sense.
Most changelogs and comments aren't really helping either. At best they mention what you're doing, not why and how. This means I get to basically duplicate your entire thought pattern and I might as well do the patches myself.
I also don't see the 'big' picture of what you're doing, you start out by some weird avoid short running task movement.. why is that a good start?
I would have expected a series like this to replace rq->cpu_load / rq->load with a saner metric and go from there.. instead it looks like its going about at things completely backwards. Fixing small details instead of the big things.
Do explain..
Hi Peter, Thank you very much for your feedback.
On 10/25/2012 09:26 PM, Peter Zijlstra wrote:
OK, so I tried reading a few patches and I'm completely failing.. maybe its late and my brain stopped working, but it simply doesn't make any sense.
Most changelogs and comments aren't really helping either. At best they mention what you're doing, not why and how. This means I get to basically duplicate your entire thought pattern and I might as well do the patches myself.
I also don't see the 'big' picture of what you're doing, you start out by some weird avoid short running task movement.. why is that a good start?
I would have expected a series like this to replace rq->cpu_load / rq->load with a saner metric and go from there.. instead it looks like its going about at things completely backwards. Fixing small details instead of the big things.
Let me see if I get what you are saying here right.You want to replace for example cfs_rq->load.weight with a saner metric because it does not consider the run time of the sched entities queued on it,merely their priorities.If this is right,in this patchset I believe cfs_rq->runnable_load_avg would be that right metric because it considers the run time of the sched entities queued on it.
So why didnt I replace? I added cfs_rq->runnable_load_avg as an additional metric instead of replacing the older metric.I let the old metric be as a dead metric and used the newer metric as an alternative.so if this alternate metric does not do us good we have the old metric to fall back on.
At best they mention what you're doing, not why and how.
So the above explains *what* I am doing.
*How* am i doing it: Everywhere the scheduler needs to make a decision on:
a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of cfs_rq->runnable_load_avg to decide this instead of sum of cfs_rq->load.weight.
b.find_busiest_queue/find_idlest_queue: use cfs_rq->runnable_load_avg to decide this instead of cfs_rq->load.weight
c.move_tasks: use se->avg.load_avg_contrib to decide the weight of of each task instead of se->load.weight as the former reflects the runtime of the sched entity and hence its actual load.
This is what my patches3-13 do.Merely *Replace*.
*Why am I doing it*: Feed the load balancer with a more realistic metric to do load balancing and observe the consequences.
you start out by some weird avoid short running task movement. why is that a good start?
The short running tasks are not really burdening you,they have very little run time,so why move them? Throughout the concept of load balancing the focus is on *balancing the *existing* load* between the sched groups.But not really evaluating the *absolute load* of any given sched group.
Why is this *the start*? This is like a round of elimination before the actual interview round ;) Try to have only those sched groups as candidates for load balancing that are sufficiently loaded.[Patch1] This *sufficiently loaded* is decided by the new metric explained in the *How* above.
I also don't see the 'big' picture of what you're doing
Patch1: is explained above.*End result:Good candidates for lb.* Patch2: 10% 10% 10% 100% ------ ------ SCHED_GP1 SCHED_GP2
Before Patch After Patch ----------------------------------- SCHED_GP1 load:3072 SCHED_GP1:512 SCHED_GP2 load:1024 SCHED_GP2:1024
SCHED_GP1 is busiest SCHED_GP2 is busiest:
But Patch2 re-decides between GP1 and GP2 to check if the latency of tasks is getting affected although there is less load on GP1.If yes it is a better *busy * gp.
*End Result: Better candidates for lb*
Rest of the patches: now that we have our busy sched group,let us load balance with the aid of the new metric. *End Result: Hopefully a more sensible movement of loads* This is how I build the picture.
Regards Preeti
Hi Peter, Thank you very much for your feedback.Please ignore the previous post.I am extremely sorry about the word wrap issues with it.
On 10/25/2012 09:26 PM, Peter Zijlstra wrote:
OK, so I tried reading a few patches and I'm completely failing.. maybe its late and my brain stopped working, but it simply doesn't make any sense.
Most changelogs and comments aren't really helping either. At best they mention what you're doing, not why and how. This means I get to basically duplicate your entire thought pattern and I might as well do the patches myself.
I also don't see the 'big' picture of what you're doing, you start out by some weird avoid short running task movement.. why is that a good start?
I would have expected a series like this to replace rq->cpu_load / rq->load with a saner metric and go from there.. instead it looks like its going about at things completely backwards. Fixing small details instead of the big things.
Do explain..
Let me see if I get what you are saying here right.You want to replace for example cfs_rq->load.weight with a saner metric because it does not consider the run time of the sched entities queued on it,merely their priorities.If this is right,in this patchset I believe cfs_rq->runnable_load_avg would be that right metric because it considers the run time of the sched entities queued on it.
So why didnt I replace? I added cfs_rq->runnable_load_avg as an additional metric instead of replacing the older metric.I let the old metric be as a dead metric and used the newer metric as an alternative.so if this alternate metric does not do us good we have the old metric to fall back on.
At best they mention what you're doing, not why and how.
So the above explains *what* I am doing.
*How* am i doing it: Everywhere the scheduler needs to make a decision on:
a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of cfs_rq->runnable_load_avg to decide this instead of sum of cfs_rq->load.weight.
b.find_busiest_queue/find_idlest_queue: use cfs_rq->runnable_load_avg to decide this instead of cfs_rq->load.weight
c.move_tasks: use se->avg.load_avg_contrib to decide the weight of of each task instead of se->load.weight as the former reflects the runtime of the sched entity and hence its actual load.
This is what my patches3-13 do.Merely *Replace*.
*Why am I doing it*: Feed the load balancer with a more realistic metric to do load balancing and observe the consequences.
you start out by some weird avoid short running task movement. why is that a good start?
The short running tasks are not really burdening you,they have very little run time,so why move them? Throughout the concept of load balancing the focus is on *balancing the *existing* load* between the sched groups.But not really evaluating the *absolute load* of any given sched group.
Why is this *the start*? This is like a round of elimination before the actual interview round Try to have only those sched groups as candidates for load balancing that are sufficiently loaded.[Patch1] This *sufficiently loaded* is decided by the new metric explained in the *How* above.
I also don't see the 'big' picture of what you're doing
Patch1: is explained above.*End result:Good candidates for lb.* Patch2: 10% 10% 10% 100% ------ ------ SCHED_GP1 SCHED_GP2
Before Patch After Patch ----------------------------------- SCHED_GP1 load:3072 SCHED_GP1:512 SCHED_GP2 load:1024 SCHED_GP2:1024
SCHED_GP1 is busiest SCHED_GP2 is busiest:
But Patch2 re-decides between GP1 and GP2 to check if the latency of tasks is getting affected although there is less load on GP1.If yes it is a better *busy * gp.
*End Result: Better candidates for lb*
Rest of the patches: now that we have our busy sched group,let us load balance with the aid of the new metric. *End Result: Hopefully a more sensible movement of loads* This is how I build the picture.
Regards Preeti U Murthy
On Thu, 2012-10-25 at 23:42 +0530, Preeti U Murthy wrote:
Do explain..
Let me see if I get what you are saying here right.You want to replace for example cfs_rq->load.weight with a saner metric because it does not consider the run time of the sched entities queued on it,merely their priorities.If this is right,in this patchset I believe cfs_rq->runnable_load_avg would be that right metric because it considers the run time of the sched entities queued on it.
firstly, cfs_rq is the wrong place for a per-cpu load measure, secondly why add another load field instead of fixing the one we have?
So why didnt I replace? I added cfs_rq->runnable_load_avg as an additional metric instead of replacing the older metric.I let the old metric be as a dead metric and used the newer metric as an alternative.so if this alternate metric does not do us good we have the old metric to fall back on.
That's wrong.. either it works and we can apply the patches or it doesn't and we won't. What you did makes it very hard to see you preserve the current balancer -- which afaict you don't, you change the balancer with the very first patch.
Why not update weighted_cpuload(), update_idle_cpu_load() and update_cpu_load_active() to use another metric and go from there. If you do that the whole balancer will auto-magically use the new weight measure.
Once you have that running, you can look at modifying it.
At best they mention what you're doing, not why and how.
So the above explains *what* I am doing.
*How* am i doing it: Everywhere the scheduler needs to make a decision on:
a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of cfs_rq->runnable_load_avg to decide this instead of sum of cfs_rq->load.weight.
But the first patches are about adding new balancing conditions, that's a complete fail..
b.find_busiest_queue/find_idlest_queue: use cfs_rq->runnable_load_avg to decide this instead of cfs_rq->load.weight
See, if you would have changed the input function (weighted_cpuload), you wouldn't have needed to touch any of this.
c.move_tasks: use se->avg.load_avg_contrib to decide the weight of of each task instead of se->load.weight as the former reflects the runtime of the sched entity and hence its actual load.
The changelog in that patch (7) is completely devoid of any useful information.
This is what my patches3-13 do.Merely *Replace*.
*Why am I doing it*: Feed the load balancer with a more realistic metric to do load balancing and observe the consequences.
I know why you're doing the entire thing, but you fail to motivate each individual change. A changelog should read something like:
current code does blah, this has a problem when blah, we can improve this doing blah because blah.
you start out by some weird avoid short running task movement. why is that a good start?
The short running tasks are not really burdening you,they have very little run time,so why move them?
The most important part is why this would be a good start to the series, it is not.
The patch is also dubious at best; short running might be all you have, your definition of 'short' is also iffy.
Throughout the concept of load balancing the focus is on *balancing the *existing* load* between the sched groups.But not really evaluating the *absolute load* of any given sched group.
Which is why you're going to change the metrics.. the balancer really only cares about making load equal, flipping the metric of the load doesn't change anything fundamental.
Why is this *the start*? This is like a round of elimination before the actual interview round Try to have only those sched groups as candidates for load balancing that are sufficiently loaded.[Patch1] This *sufficiently loaded* is decided by the new metric explained in the *How* above.
No, this is absolutely wrong.
So a sane series would introduce maybe two functions: cpu_load() and task_load() and use those where we now use rq->load.weight and p->se.load.weight for load balancing purposes. Implement these functions using those two expression. So effectively this patch is a NOP.
Secondly, switch these two functions over to the per-task based averages.
Tada! all done. The load balancer will then try and equalize effective load instead of instant load.
It will do the 3x10% vs 100% thing correctly with just those two patches. Simply because it will report a lower cpu-load for the 3x10% case than it will for the 100% case, no need to go fudge about in the load-balance internals.
Once you've got this correctly done, you can go change balancing to better utilize the new metric, like use the effective load instead of nr_running against the capacity and things like that. But for every such change you want to be very careful and run all the benchmarks you can find -- in fact you want to do that after the 2nd patch too.
* Peter Zijlstra a.p.zijlstra@chello.nl wrote:
[...]
So a sane series would introduce maybe two functions: cpu_load() and task_load() and use those where we now use rq->load.weight and p->se.load.weight for load balancing purposes. Implement these functions using those two expression. So effectively this patch is a NOP.
Secondly, switch these two functions over to the per-task based averages.
Tada! all done. The load balancer will then try and equalize effective load instead of instant load.
It will do the 3x10% vs 100% thing correctly with just those two patches. Simply because it will report a lower cpu-load for the 3x10% case than it will for the 100% case, no need to go fudge about in the load-balance internals.
Once you've got this correctly done, you can go change balancing to better utilize the new metric, like use the effective load instead of nr_running against the capacity and things like that. But for every such change you want to be very careful and run all the benchmarks you can find -- in fact you want to do that after the 2nd patch too.
If anyone posted that simple two-patch series that switches over to the new load metrics I'd be happy to test the performance of those.
Having two parallel load metrics is really not something that we should tolerate for too long.
Thanks,
Ingo
On 10/26/2012 06:37 PM, Ingo Molnar wrote:
- Peter Zijlstra a.p.zijlstra@chello.nl wrote:
[...]
So a sane series would introduce maybe two functions: cpu_load() and task_load() and use those where we now use rq->load.weight and p->se.load.weight for load balancing purposes. Implement these functions using those two expression. So effectively this patch is a NOP.
Secondly, switch these two functions over to the per-task based averages.
Tada! all done. The load balancer will then try and equalize effective load instead of instant load.
It will do the 3x10% vs 100% thing correctly with just those two patches. Simply because it will report a lower cpu-load for the 3x10% case than it will for the 100% case, no need to go fudge about in the load-balance internals.
Once you've got this correctly done, you can go change balancing to better utilize the new metric, like use the effective load instead of nr_running against the capacity and things like that. But for every such change you want to be very careful and run all the benchmarks you can find -- in fact you want to do that after the 2nd patch too.
If anyone posted that simple two-patch series that switches over to the new load metrics I'd be happy to test the performance of those.
Having two parallel load metrics is really not something that we should tolerate for too long.
Thanks,
Ingo
Right Ingo.I will incorporate this approach and post out very soon.
Thank you
Regards Preeti
On 10/26/2012 05:59 PM, Peter Zijlstra wrote:
On Thu, 2012-10-25 at 23:42 +0530, Preeti U Murthy wrote:
firstly, cfs_rq is the wrong place for a per-cpu load measure, secondly why add another load field instead of fixing the one we have?
Hmm..,rq->load.weight is the place.
So why didnt I replace? I added cfs_rq->runnable_load_avg as an additional metric instead of replacing the older metric.I let the old metric be as a dead metric and used the newer metric as an alternative.so if this alternate metric does not do us good we have the old metric to fall back on.
That's wrong.. either it works and we can apply the patches or it doesn't and we won't. What you did makes it very hard to see you preserve the current balancer -- which afaict you don't, you change the balancer with the very first patch.
You are right on this Peter.
Why not update weighted_cpuload(), update_idle_cpu_load() and update_cpu_load_active() to use another metric and go from there. If you do that the whole balancer will auto-magically use the new weight measure.
Once you have that running, you can look at modifying it.
Hmm...Correct.
a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of cfs_rq->runnable_load_avg to decide this instead of sum of cfs_rq->load.weight.
But the first patches are about adding new balancing conditions, that's a complete fail..
b.find_busiest_queue/find_idlest_queue: use cfs_rq->runnable_load_avg to decide this instead of cfs_rq->load.weight
See, if you would have changed the input function (weighted_cpuload), you wouldn't have needed to touch any of this.
I see your point.
c.move_tasks: use se->avg.load_avg_contrib to decide the weight of of each task instead of se->load.weight as the former reflects the runtime of the sched entity and hence its actual load.
The changelog in that patch (7) is completely devoid of any useful information.
This is what my patches3-13 do.Merely *Replace*.
*Why am I doing it*: Feed the load balancer with a more realistic metric to do load balancing and observe the consequences.
I know why you're doing the entire thing, but you fail to motivate each individual change. A changelog should read something like:
current code does blah, this has a problem when blah, we can improve this doing blah because blah.
Ah! I get it.
you start out by some weird avoid short running task movement. why is that a good start?
The short running tasks are not really burdening you,they have very little run time,so why move them?
The most important part is why this would be a good start to the series, it is not.
The patch is also dubious at best; short running might be all you have, your definition of 'short' is also iffy.
My definition of 'short' was bursty loads.What I observed from using the new metric to study the effective load calculation was,when there are around 2-3 such bursty loads the effective load stays below the threshold that i have stated,and I thought this would be a good enough excuse to let the loads stay on the cpu.
Bursty being a load that sleeps for 9ms every 10ms for a duration of 10s.(a 10% load) in my experiments.
Throughout the concept of load balancing the focus is on *balancing the *existing* load* between the sched groups.But not really evaluating the *absolute load* of any given sched group.
Which is why you're going to change the metrics.. the balancer really only cares about making load equal, flipping the metric of the load doesn't change anything fundamental.
Ok.
Why is this *the start*? This is like a round of elimination before the actual interview round Try to have only those sched groups as candidates for load balancing that are sufficiently loaded.[Patch1] This *sufficiently loaded* is decided by the new metric explained in the *How* above.
No, this is absolutely wrong.
So a sane series would introduce maybe two functions: cpu_load() and task_load() and use those where we now use rq->load.weight and p->se.load.weight for load balancing purposes. Implement these functions using those two expression. So effectively this patch is a NOP.
Secondly, switch these two functions over to the per-task based averages.
Tada! all done. The load balancer will then try and equalize effective load instead of instant load.
It will do the 3x10% vs 100% thing correctly with just those two patches. Simply because it will report a lower cpu-load for the 3x10% case than it will for the 100% case, no need to go fudge about in the load-balance internals.
Once you've got this correctly done, you can go change balancing to better utilize the new metric, like use the effective load instead of nr_running against the capacity and things like that. But for every such change you want to be very careful and run all the benchmarks you can find -- in fact you want to do that after the 2nd patch too.
I agree with this entire suggestion.Perhaps I was hesitant with introducing the new metric into the scheduler which led to this two faced approach.I will try out this approach suggested and post out the results very soon.
Thank you very much for such a comprehensive and helpful feedback!
Regards Preeti U Murthy