This patch series is to improve load balance with more proper behaviour for misfit task. Current code introduces type 'group_misfit_task' to indicate one schedule group has misfit task, but before the misfit task can be really migrated onto higher capacity CPU there still have some barriers we need clear up.
The first patch is to correct task_fits_max() so it can properly filter out misfit task on low capacity CPU. If without this patch, in system it's possible this function can always return true so the 'misfit' task mechanism will totally not be triggered.
The second patch is to fix function group_smaller_cpu_capacity(), so we can make sure the schedule group with type 'group_misfit_task' will not wrongly be roll back to type 'group_other'. This will let all misfit related info be abondoned.
The third patch is to fix nr_running accounting, if without this patch the scheduler will wronly consider the destination CPU has running task and skip migrate task on it. This patch is to give correct info like the destination CPU has no running task on it when the CPU is going into idle state, so should migrate misfit task by utilizing this time balance.
The forth patch is a temperary patch if we have not backported Vincent's patches "sched: reflect sched_entity move into task_group's load" [1], If without this patch series, it's possible that the CPU is not overutilized but the CPU has one misfit task has been enqueued on it. So we set sgs->group_misfit_task by checking rq->misfit_task but not rely on cpu is overutilized or not.
The fifth patch is to select busiest rq if the rq has misfit task, we let this kind rq has higher priority than the rq with highest weighted load. This criteria is only enabled for energy aware scheduling.
The sixth patch is to aggressively kick active load balance for misfit task, so it has quite high chance for higher capacity CPU to immediately pull misfit task on it.
[1] https://lkml.org/lkml/2016/10/17/223
Leo Yan (6): sched/fair: correct task_fits_max() for misfit task sched/fair: fix for group_smaller_cpu_capacity() sched/fair: fix nr_running accounting for new idle CPU sched/fair: fix to set sgs->group_misfit_task sched/fair: select busiest rq with misfit task sched/fair: kick active load balance for misfit task
kernel/sched/fair.c | 59 +++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 46 insertions(+), 13 deletions(-)
-- 2.7.4
Function task_fits_max() checks if CPU maximum capacity can meet task requirement. So it gets CPU capacity by using capacity_of(), this value is CPU maximum capacity - bandwidth occupied by RT threads. So sometimes capacity_of() is a bit less than CPU original capacity if the CPU has RT threads on it. As an amendment for RT threads, function task_fits_max() uses more flexible condition to check if the CPU can meet task requirement, this is finished by below condition:
if (capacity * capacity_margin > max_capacity * 1024) return true;
This condition introduces some unexpected result if the two clusters have small difference so even the lower capacity cluster can easily reach this criteria. So finally all CPUs can meet this condition and is always invalid for misfit task checking..
This patch is to go back to use capacity_orig_of() to get CPU capacity and if CPU is highest capacity CPU then always return true, otherwise it will run detailed flow, this also includes the case when CPU has RT threads on it.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7b1f65b..2ae55f6 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5074,15 +5074,12 @@ static inline bool __task_fits(struct task_struct *p, int cpu, int util)
static inline bool task_fits_max(struct task_struct *p, int cpu) { - unsigned long capacity = capacity_of(cpu); + unsigned long capacity = capacity_orig_of(cpu); unsigned long max_capacity = cpu_rq(cpu)->rd->max_cpu_capacity.val;
if (capacity == max_capacity) return true;
- if (capacity * capacity_margin > max_capacity * 1024) - return true; - return __task_fits(p, cpu, 0); }
-- 2.7.4
Function group_smaller_cpu_capacity() checks if one schedule group has smaller capacity than another one:
return sg->sgc->max_capacity + capacity_margin - SCHED_LOAD_SCALE < ref->sgc->max_capacity;
The value (capacity_margin - SCHED_LOAD_SCALE) is an absolute value for difference checking, so it's easily broken if two scheduler groups have no much difference (like CA53.Fast+CA53.Slow system).
When this function is invalid the schedule group with misfit tasks will be wrongly cleared flag so misfit task has no chance to migrate to higher capacity CPU.
This patch is to directly check schedule group maximum_capacity and has a minor fix for maximum_capacity assignment with original capacity.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2ae55f6..f5fb04f 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6923,6 +6923,8 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) raw_spin_unlock_irqrestore(&mcc->lock, flags);
skip_unlock: __attribute__ ((unused)); + sdg->sgc->max_capacity = capacity; + capacity *= scale_rt_capacity(cpu); capacity >>= SCHED_CAPACITY_SHIFT;
@@ -6931,7 +6933,6 @@ skip_unlock: __attribute__ ((unused));
cpu_rq(cpu)->cpu_capacity = capacity; sdg->sgc->capacity = capacity; - sdg->sgc->max_capacity = capacity; }
void update_group_capacity(struct sched_domain *sd, int cpu) @@ -7103,8 +7104,7 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) static inline bool group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref) { - return sg->sgc->max_capacity + capacity_margin - SCHED_LOAD_SCALE < - ref->sgc->max_capacity; + return sg->sgc->max_capacity < ref->sgc->max_capacity; }
static enum group_type group_classify(struct lb_env *env, -- 2.7.4
When a new idle CPU executes idle balance, the idle swap thread has not been switched in actually. The current thread is a normal task and this task is going to not occupy the CPU anymore so the CPU is seeking to pull task onto it.
But at this moment rq->h_nr_running still adds accounts for this normal thread; this gives scheduler misunderstanding the CPU has one running task on it and finally adds it into sum running number of schedule group.
At the end, function group_has_capacity() compare the running task number with CPU number, and unfortunately if all other CPUs have real running tasks then the group is considered as no spare 'capacity' and skip migrate any misfit task from another schedule group in the same schedule domain.
This patch is to fix nu_running accounting for new idle CPU, when checks the new idle CPU it doesn't account the running number into schedule group.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f5fb04f..6ebf7c7 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7154,9 +7154,22 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load; sgs->group_util += cpu_util(i); - sgs->sum_nr_running += rq->cfs.h_nr_running;
- nr_running = rq->nr_running; + /* + * If destination CPU is one new idle CPU, that means current + * task is occupying CPU so h_nr_running = 1 but in fact this + * task is going to release CPU for idle balance. + * + * Here should not account this task into running number, so + * give more chance for task migration onto this idle CPU. + */ + if (env->idle == CPU_NEWLY_IDLE && env->dst_cpu == i) + nr_running = 0; + else { + sgs->sum_nr_running += rq->cfs.h_nr_running; + nr_running = rq->nr_running; + } + if (nr_running > 1) *overload = true;
-- 2.7.4
On 22 December 2016 at 16:58, Leo Yan leo.yan@linaro.org wrote:
When a new idle CPU executes idle balance, the idle swap thread has not been switched in actually. The current thread is a normal task and this task is going to not occupy the CPU anymore so the CPU is seeking to pull task onto it.
But at this moment rq->h_nr_running still adds accounts for this normal thread; this gives scheduler misunderstanding the CPU has one running task on it and finally adds it into sum running number of schedule group.
Are you sure of the point above ? I'm pretty sure that in the mainline scheduler the task has already been dequeued and cfs->h_nr_running and rq->nr_running have been decreased when newly idle load balance is called so their are null
At the end, function group_has_capacity() compare the running task number with CPU number, and unfortunately if all other CPUs have real running tasks then the group is considered as no spare 'capacity' and skip migrate any misfit task from another schedule group in the same schedule domain.
This patch is to fix nu_running accounting for new idle CPU, when checks the new idle CPU it doesn't account the running number into schedule group.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f5fb04f..6ebf7c7 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7154,9 +7154,22 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load; sgs->group_util += cpu_util(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;
nr_running = rq->nr_running;
/*
* If destination CPU is one new idle CPU, that means current
* task is occupying CPU so h_nr_running = 1 but in fact this
* task is going to release CPU for idle balance.
*
* Here should not account this task into running number, so
* give more chance for task migration onto this idle CPU.
*/
if (env->idle == CPU_NEWLY_IDLE && env->dst_cpu == i)
nr_running = 0;
else {
sgs->sum_nr_running += rq->cfs.h_nr_running;
nr_running = rq->nr_running;
}
if (nr_running > 1) *overload = true;
-- 2.7.4
Hi Vincent,
On Thu, Dec 22, 2016 at 05:56:47PM +0100, Vincent Guittot wrote:
On 22 December 2016 at 16:58, Leo Yan leo.yan@linaro.org wrote:
When a new idle CPU executes idle balance, the idle swap thread has not been switched in actually. The current thread is a normal task and this task is going to not occupy the CPU anymore so the CPU is seeking to pull task onto it.
But at this moment rq->h_nr_running still adds accounts for this normal thread; this gives scheduler misunderstanding the CPU has one running task on it and finally adds it into sum running number of schedule group.
Are you sure of the point above ? I'm pretty sure that in the mainline scheduler the task has already been dequeued and cfs->h_nr_running and rq->nr_running have been decreased when newly idle load balance is called so their are null
Ah, you are right :) I reviewed code and verified in trace log I found I misunderstand for the code.
This patch is useless and we should drop it.
Thanks, Leo Yan
In current code after meets two conditions to set sgs->group_misfit_task:
- Condition 1: the CPU is overutilized; - Condition 2: rq->misfit is set;
But if we think about there have corner case that CPU level signal has not been increased over tipping point but it has one big task is migrated onto this CPU so rq->misfit flag has been set. For this case, we will miss to set sgs->group_misfit_task due condition 1 should take very long time to become true.
This patch is to directly check if rq->misfit flag has been set and then directly to set sgs->group_misfit_task. So give more chance for misfit task migration.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6ebf7c7..ed9fbed 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7184,11 +7184,11 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
- if (cpu_overutilized(i)) { + if (cpu_overutilized(i)) *overutilized = true; - if (!sgs->group_misfit_task && rq->misfit_task) - sgs->group_misfit_task = capacity_of(i); - } + + if (!sgs->group_misfit_task && rq->misfit_task) + sgs->group_misfit_task = capacity_of(i); }
/* Adjust by relative CPU capacity of the group */ -- 2.7.4
Current code select busiest rq mostly consider only from weighted load this point of view; so it's possible to select one CPU with most weighted load value but this load are contributes by some small load tasks, on the other hand there have one another CPU with misfit task but it may have no chance to migrate task to higher capacity CPU.
This patch is to add one more checking for selection busiest rq if find only one misfit task on the rq and it's possible to migrate task from lower capacity CPU to higher capacity CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed9fbed..1cf0e37 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7777,6 +7777,24 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
/* + * After enable energy awared scheduling, it has higher + * priority to migrate misfit task rather than from most + * loaded CPU; E.g. one CPU with single misfit task and + * other CPUs with multiple lower load tasks, we should + * firstly make sure the misfit task can be migrated onto + * higher capacity CPU. + */ + if (energy_aware() && + capacity_orig_of(i) < capacity_orig_of(env->dst_cpu) && + rq->cfs.h_nr_running == 1 && rq->misfit_task && + env->busiest_group_type == group_misfit_task) { + busiest_load = wl; + busiest_capacity = capacity; + busiest = rq; + break; + } + + /* * For the load comparisons with the other cpu's, consider * the weighted_cpuload() scaled with the cpu capacity, so * that the load can be moved away from the cpu that is -- 2.7.4
On Thu, Dec 22, 2016 at 11:58:50PM +0800, Leo Yan wrote:
Current code select busiest rq mostly consider only from weighted load this point of view; so it's possible to select one CPU with most weighted load value but this load are contributes by some small load tasks, on the other hand there have one another CPU with misfit task but it may have no chance to migrate task to higher capacity CPU.
This patch is to add one more checking for selection busiest rq if find only one misfit task on the rq and it's possible to migrate task from lower capacity CPU to higher capacity CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed9fbed..1cf0e37 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7777,6 +7777,24 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
/*
* After enable energy awared scheduling, it has higher
* priority to migrate misfit task rather than from most
* loaded CPU; E.g. one CPU with single misfit task and
* other CPUs with multiple lower load tasks, we should
* firstly make sure the misfit task can be migrated onto
* higher capacity CPU.
*/
if (energy_aware() &&
capacity_orig_of(i) < capacity_orig_of(env->dst_cpu) &&
Do you want to check capacity_of() as well as capacity_orig_of()?
Thanks, Joonwoo
On Mon, Jan 23, 2017 at 06:48:54PM -0800, Joonwoo Park wrote:
On Thu, Dec 22, 2016 at 11:58:50PM +0800, Leo Yan wrote:
Current code select busiest rq mostly consider only from weighted load this point of view; so it's possible to select one CPU with most weighted load value but this load are contributes by some small load tasks, on the other hand there have one another CPU with misfit task but it may have no chance to migrate task to higher capacity CPU.
This patch is to add one more checking for selection busiest rq if find only one misfit task on the rq and it's possible to migrate task from lower capacity CPU to higher capacity CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed9fbed..1cf0e37 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7777,6 +7777,24 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
/*
* After enable energy awared scheduling, it has higher
* priority to migrate misfit task rather than from most
* loaded CPU; E.g. one CPU with single misfit task and
* other CPUs with multiple lower load tasks, we should
* firstly make sure the misfit task can be migrated onto
* higher capacity CPU.
*/
if (energy_aware() &&
capacity_orig_of(i) < capacity_orig_of(env->dst_cpu) &&
Do you want to check capacity_of() as well as capacity_orig_of()?
Here using capacity_orig_of() to distinguish the the src CPU and dst CPU have different 'original' capacity.
If check capacity_of(), this gives chance to migrate the task to the CPU within the same cluster, e.g. the src CPU is LITTLE CPU and it has one rt thread and one misfit task, and dst CPU is LITTLE CPU and it has no rt thread. So finally 'capacity_of(src_cpu) < capacity_of(dst_cpu)' is valid and migrate misfit task to another LITTLE CPU. This is pointless for misfit task.
Please let me know if I miss anything for this.
Thanks, Leo Yan
On 01/23/2017 11:42 PM, Leo Yan wrote:
On Mon, Jan 23, 2017 at 06:48:54PM -0800, Joonwoo Park wrote:
On Thu, Dec 22, 2016 at 11:58:50PM +0800, Leo Yan wrote:
Current code select busiest rq mostly consider only from weighted load this point of view; so it's possible to select one CPU with most weighted load value but this load are contributes by some small load tasks, on the other hand there have one another CPU with misfit task but it may have no chance to migrate task to higher capacity CPU.
This patch is to add one more checking for selection busiest rq if find only one misfit task on the rq and it's possible to migrate task from lower capacity CPU to higher capacity CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed9fbed..1cf0e37 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7777,6 +7777,24 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
/*
* After enable energy awared scheduling, it has higher
* priority to migrate misfit task rather than from most
* loaded CPU; E.g. one CPU with single misfit task and
* other CPUs with multiple lower load tasks, we should
* firstly make sure the misfit task can be migrated onto
* higher capacity CPU.
*/
if (energy_aware() &&
capacity_orig_of(i) < capacity_orig_of(env->dst_cpu) &&
Do you want to check capacity_of() as well as capacity_orig_of()?
Here using capacity_orig_of() to distinguish the the src CPU and dst CPU have different 'original' capacity.
If check capacity_of(), this gives chance to migrate the task to the CPU within the same cluster, e.g. the src CPU is LITTLE CPU and it has one rt thread and one misfit task, and dst CPU is LITTLE CPU and it has no rt thread. So finally 'capacity_of(src_cpu) < capacity_of(dst_cpu)' is valid and migrate misfit task to another LITTLE CPU. This is pointless for misfit task.
Sorry for being late.
I meant to have both capacity_orig_of() as well as capacity_of() thus to be :
if (energy_aware() && capacity_orig_of(i) < capacity_orig_of(env->dst_cpu) && capacity_of(i) < capacity_of(env->dst_cpu))
capacity_orig_of() will take care of unnecessary migrations between same cluster CPUs. However I think there are cases cluster CPU, dst_cpu has less capacity than little CPU, i. For example due to rt task on the big cluster CPU as like you mentioned above. Also if big cluster CPU's fmax got mitigated for reasons like thermal, the max capacity inversion could happen I believe.
I tried rt-app with below json, looks like it's indeed happening.
Thanks, Joonwoo
---8<--- { "tasks" : { "ThreadA_other" : { "run" : 1000000, "sleep" : 1000, "loop" : 1000, "cpus" : [0,1,2,3], "policy" : "SCHED_OTHER", }, "ThreadA" : { "run" : 1000000, "sleep" : 1000, "loop" : 1000, "cpus" : [4], }, "ThreadB" : { "run" : 1000000, "sleep" : 1000, "loop" : 1000, "cpus" : [5], }, "ThreadC" : { "run" : 1000000, "sleep" : 1000, "loop" : 1000, "cpus" : [6], }, "ThreadD" : { "run" : 1000000, "sleep" : 100, "loop" : 1000, "cpus" : [7], }, }, "global" : { "default_policy" : "SCHED_FIFO", "duration" : 5, "ftrace" : false, "gnuplot" : false, "logdir" : "/data/", "log_basename" : "rt-app", "lock_pages" : true, "frag" : 1, "calibration" : "CPU0", } } ---
+ Leo,
Ugh.. mutt is acting up with me so using thunderbird. I'm keep forgetting reply-to-all instead reply-to list :)
On 02/02/2017 03:28 PM, Joonwoo Park wrote:
On 01/23/2017 11:42 PM, Leo Yan wrote:
On Mon, Jan 23, 2017 at 06:48:54PM -0800, Joonwoo Park wrote:
On Thu, Dec 22, 2016 at 11:58:50PM +0800, Leo Yan wrote:
Current code select busiest rq mostly consider only from weighted load this point of view; so it's possible to select one CPU with most weighted load value but this load are contributes by some small load tasks, on the other hand there have one another CPU with misfit task but it may have no chance to migrate task to higher capacity CPU.
This patch is to add one more checking for selection busiest rq if find only one misfit task on the rq and it's possible to migrate task from lower capacity CPU to higher capacity CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed9fbed..1cf0e37 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7777,6 +7777,24 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
/*
* After enable energy awared scheduling, it has higher
* priority to migrate misfit task rather than from most
* loaded CPU; E.g. one CPU with single misfit task and
* other CPUs with multiple lower load tasks, we should
* firstly make sure the misfit task can be migrated onto
* higher capacity CPU.
*/
if (energy_aware() &&
capacity_orig_of(i) < capacity_orig_of(env->dst_cpu) &&
Do you want to check capacity_of() as well as capacity_orig_of()?
Here using capacity_orig_of() to distinguish the the src CPU and dst CPU have different 'original' capacity.
If check capacity_of(), this gives chance to migrate the task to the CPU within the same cluster, e.g. the src CPU is LITTLE CPU and it has one rt thread and one misfit task, and dst CPU is LITTLE CPU and it has no rt thread. So finally 'capacity_of(src_cpu) < capacity_of(dst_cpu)' is valid and migrate misfit task to another LITTLE CPU. This is pointless for misfit task.
Sorry for being late.
I meant to have both capacity_orig_of() as well as capacity_of() thus to be :
if (energy_aware() && capacity_orig_of(i) < capacity_orig_of(env->dst_cpu) && capacity_of(i) < capacity_of(env->dst_cpu))
capacity_orig_of() will take care of unnecessary migrations between same cluster CPUs. However I think there are cases cluster CPU, dst_cpu has less capacity than little CPU, i. For example due to rt task on the big cluster CPU as like you mentioned above. Also if big cluster CPU's fmax got mitigated for reasons like thermal, the max capacity inversion could happen I believe.
I tried rt-app with below json, looks like it's indeed happening.
Thanks, Joonwoo
---8<--- { "tasks" : { "ThreadA_other" : { "run" : 1000000, "sleep" : 1000, "loop" : 1000, "cpus" : [0,1,2,3], "policy" : "SCHED_OTHER", }, "ThreadA" : { "run" : 1000000, "sleep" : 1000, "loop" : 1000, "cpus" : [4], }, "ThreadB" : { "run" : 1000000, "sleep" : 1000, "loop" : 1000, "cpus" : [5], }, "ThreadC" : { "run" : 1000000, "sleep" : 1000, "loop" : 1000, "cpus" : [6], }, "ThreadD" : { "run" : 1000000, "sleep" : 100, "loop" : 1000, "cpus" : [7], }, }, "global" : { "default_policy" : "SCHED_FIFO", "duration" : 5, "ftrace" : false, "gnuplot" : false, "logdir" : "/data/", "log_basename" : "rt-app", "lock_pages" : true, "frag" : 1, "calibration" : "CPU0", } }
Current code checks destination CPU is overutilized or not before kick active load balance and if the destination CPU is overutilized the function will bail out without kicking active balance.
This patch is more aggressively to kick active balance if find source CPU has single misfit task and destination CPU is in idle balance. So this can quickly migrate the misfit task to higher capacity CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1cf0e37..d0742a7 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7859,6 +7859,11 @@ static int need_active_balance(struct lb_env *env) return 1; }
+ if ((env->idle != CPU_NOT_IDLE) && + (capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) && + env->src_rq->cfs.h_nr_running == 1 && env->src_rq->misfit_task) + return 1; + return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); }
-- 2.7.4
On Thu, Dec 22, 2016 at 11:58:45PM +0800, Leo Yan wrote:
This patch series is to improve load balance with more proper behaviour for misfit task. Current code introduces type 'group_misfit_task' to indicate one schedule group has misfit task, but before the misfit task can be really migrated onto higher capacity CPU there still have some barriers we need clear up.
Add testing result for performance and power on Juno:
EASv5.2 EASv5.2+Opt Percentage Energy 1.8527174 1.9091532 +3.04% Linpack_MT 415.7738 426.6782 +2.62% Linpack_ST 223.5884 228.0928 +2.01%
The first patch is to correct task_fits_max() so it can properly filter out misfit task on low capacity CPU. If without this patch, in system it's possible this function can always return true so the 'misfit' task mechanism will totally not be triggered.
The second patch is to fix function group_smaller_cpu_capacity(), so we can make sure the schedule group with type 'group_misfit_task' will not wrongly be roll back to type 'group_other'. This will let all misfit related info be abondoned.
The third patch is to fix nr_running accounting, if without this patch the scheduler will wronly consider the destination CPU has running task and skip migrate task on it. This patch is to give correct info like the destination CPU has no running task on it when the CPU is going into idle state, so should migrate misfit task by utilizing this time balance.
The forth patch is a temperary patch if we have not backported Vincent's patches "sched: reflect sched_entity move into task_group's load" [1], If without this patch series, it's possible that the CPU is not overutilized but the CPU has one misfit task has been enqueued on it. So we set sgs->group_misfit_task by checking rq->misfit_task but not rely on cpu is overutilized or not.
The fifth patch is to select busiest rq if the rq has misfit task, we let this kind rq has higher priority than the rq with highest weighted load. This criteria is only enabled for energy aware scheduling.
The sixth patch is to aggressively kick active load balance for misfit task, so it has quite high chance for higher capacity CPU to immediately pull misfit task on it.
[1] https://lkml.org/lkml/2016/10/17/223
Leo Yan (6): sched/fair: correct task_fits_max() for misfit task sched/fair: fix for group_smaller_cpu_capacity() sched/fair: fix nr_running accounting for new idle CPU sched/fair: fix to set sgs->group_misfit_task sched/fair: select busiest rq with misfit task sched/fair: kick active load balance for misfit task
kernel/sched/fair.c | 59 +++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 46 insertions(+), 13 deletions(-)
-- 2.7.4