The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios. 1. There is a misfit task in one of the cpu's in this sched_domain. 2. The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org --- V2->V3: - Rebased on latest kernel. - The previous check for misfit task is replaced with the newely introduced rq->misfit_task flag. V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 3 - kernel/sched/topology.c | 8 +-- 4 files changed, 117 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 3137750..ae44044 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -88,6 +88,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + bool overutilized; };
struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a9ac67c..34bdfeb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4791,6 +4791,29 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool +is_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + return sd->shared->overutilized; + else + return false; +} + +static void +set_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = true; +} + +static void +clear_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = false; +} + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -4800,6 +4823,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; + struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4843,9 +4867,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) { add_nr_running(rq, 1); - if (!task_new && !rq->rd->overutilized && - cpu_overutilized(rq->cpu)) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!task_new && !is_sd_overutilized(sd) && + cpu_overutilized(rq->cpu)) + set_sd_overutilized(sd); + rcu_read_unlock(); } hrtick_update(rq); } @@ -6276,8 +6303,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
- rcu_read_lock(); - + /* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd) @@ -6315,8 +6341,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) }
unlock: - rcu_read_unlock(); - if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu;
@@ -6350,10 +6374,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, &p->cpus_allowed); }
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized)) - return select_energy_cpu_brute(p, prev_cpu); - rcu_read_lock(); + sd = rcu_dereference(cpu_rq(prev_cpu)->sd); + if (energy_aware() && + !is_sd_overutilized(sd)) { + new_cpu = select_energy_cpu_brute(p, prev_cpu); + goto unlock; + } + + sd = NULL; + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break; @@ -6418,6 +6448,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ } + +unlock: rcu_read_unlock();
return new_cpu; @@ -7478,6 +7510,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ + unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ @@ -7497,6 +7530,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL, + .total_util = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, @@ -7792,7 +7826,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, - bool *overload, bool *overutilized) + bool *overload, bool *overutilized, bool *misfit_task) { unsigned long load; int i, nr_running; @@ -7831,8 +7865,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, !sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
- if (cpu_overutilized(i)) + if (cpu_overutilized(i)) { *overutilized = true; + /* + * If the cpu is overutilized and if there is only one + * current task in cfs runqueue, it is potentially a misfit + * task. + */ + if (rq->misfit_task) + *misfit_task = true; + } }
/* Adjust by relative CPU capacity of the group */ @@ -7974,12 +8016,12 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) */ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { - struct sched_domain *child = env->sd->child; + struct sched_domain *child = env->sd->child, *sd; struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; - bool overload = false, overutilized = false; + bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; @@ -8001,7 +8043,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd }
update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload, &overutilized); + &overload, &overutilized, + &misfit_task);
if (local_group) goto next_group; @@ -8032,6 +8075,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; + sds->total_util += sgs->group_util;
sg = sg->next; } while (sg != env->sd->groups); @@ -8045,14 +8089,45 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; + }
- /* Update over-utilization (tipping point, U >= 0) indicator */ - if (env->dst_rq->rd->overutilized != overutilized) - env->dst_rq->rd->overutilized = overutilized; - } else { - if (!env->dst_rq->rd->overutilized && overutilized) - env->dst_rq->rd->overutilized = true; + if (overutilized) + set_sd_overutilized(env->sd); + else + clear_sd_overutilized(env->sd); + + /* + * If there is a misfit task in one cpu in this sched_domain + * it is likely that the imbalance cannot be sorted out among + * the cpu's in this sched_domain. In this case set the + * overutilized flag at the parent sched_domain. + */ + if (misfit_task) { + + sd = env->sd->parent; + + /* + * In case of a misfit task, load balance at the parent + * sched domain level will make sense only if the the cpus + * have a different capacity. If cpus at a domain level have + * the same capacity, the misfit task cannot be well + * accomodated in any of the cpus and there in no point in + * trying a load balance at this level + */ + while (sd) { + if (sd->flags & SD_ASYM_CPUCAPACITY) { + set_sd_overutilized(sd); + break; + } + sd = sd->parent; + } } + + /* If the domain util is greater that domain capacity, load balancing + * needs to be done at the next sched domain level as well + */ + if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) + set_sd_overutilized(env->sd->parent); }
/** @@ -8279,8 +8354,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized) - goto out_balanced; + if (energy_aware()) { + if (!is_sd_overutilized(env->sd)) + goto out_balanced; + }
local = &sds.local_stat; busiest = &sds.busiest_stat; @@ -9164,6 +9241,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) { + if (energy_aware()) { + if (!is_sd_overutilized(sd)) + continue; + } + /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second. @@ -9466,6 +9548,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; + struct sched_domain *sd;
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -9477,8 +9560,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
rq->misfit_task = !task_fits_capacity(curr, capacity_of(rq->cpu));
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr))) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!is_sd_overutilized(sd) && + cpu_overutilized(task_cpu(curr))) + set_sd_overutilized(sd); + rcu_read_unlock(); }
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8d27d5b..1604ef2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -585,9 +585,6 @@ struct root_domain { /* Indicate more than one runnable task for any CPU */ bool overload;
- /* Indicate one or more cpus over-utilized (tipping point) */ - bool overutilized; - /* * The bit corresponding to a CPU gets set here if such CPU has more * than one runnable -deadline task (as it is below for RT tasks). diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 263e549..e5ba6fc 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1040,11 +1040,11 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */ - if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->shared = *per_cpu_ptr(sdd->sds, sd_id); - atomic_inc(&sd->shared->ref); + sd->shared = *per_cpu_ptr(sdd->sds, sd_id); + atomic_inc(&sd->shared->ref); + + if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight); - }
sd->private = sdd;
-- 2.1.4
On 06/23/2017 03:37 PM, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen
I mentioned in my v2 review (Feb 2017) that this is only valid for #cpu in a cluster >=7.
in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately.
test cases and test results:
Do you have test results for this patch running on a big-Little platform? It would be really helpful to let's say have a set of synthetic workloads (rt-app based) like:
(1) Create a over-utilized scenario for one cpu (2) Create over-utilized scenario for one cluster (3) Create a misfit scenario (4) Create a combined (over-utilized and misfit) scenario (5) Create an anti-test pattern (where the over-utilization is not detectable) ...
With a couple of extra trace events (some of them we already have in our product kernel, e.g. indicating over-utilization/non over-utilization sections) you could compare the test results between vanilla eas/next integration (20170522_1454) and the one with your patch in terms of how much time the system/a cluster was over-utilized or not.
I guess these results are necessary before we can take the next step, i.e. integrating this patch into the EAS product kernel.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
V2->V3:
- Rebased on latest kernel.
- The previous check for misfit task is replaced with the newely introduced rq->misfit_task flag.
V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 3 - kernel/sched/topology.c | 8 +-- 4 files changed, 117 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 3137750..ae44044 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -88,6 +88,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores;
bool overutilized; };
struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a9ac67c..34bdfeb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4791,6 +4791,29 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool +is_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
return sd->shared->overutilized;
- else
return false;
+}
+static void +set_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = true;
+}
+static void +clear_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = false;
+}
- /*
- The enqueue_task method is called before nr_running is
- increased. Here we update the fair scheduling stats and
@@ -4800,6 +4823,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq;
- struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4843,9 +4867,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) { add_nr_running(rq, 1);
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
rcu_read_lock();
sd = rcu_dereference(rq->sd);
if (!task_new && !is_sd_overutilized(sd) &&
Already raised in v2: IMHO, this is problematic. is_sd_overutilized() returns false if the sd is not overutilized or sd == NULL.
Shouldn't we change all of these occurrences where we fetch the sd from rcu_dereference() to:
if (... && sd && is_sd_overutilized(sd) && ..)
This would allow to get rid of the if() condition in [set|clear|is]_sd_overutilized() as well.
Nitpick: Could we rename is_sd_overutilized() into sd_overutilized() to be in line with cpu_overutilized()?
cpu_overutilized(rq->cpu))
set_sd_overutilized(sd);
} hrtick_update(rq); }rcu_read_unlock();
@@ -6276,8 +6303,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
- rcu_read_lock();
/* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd)
@@ -6315,8 +6341,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) }
unlock:
- rcu_read_unlock();
- if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu;
@@ -6350,10 +6374,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, &p->cpus_allowed); }
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
return select_energy_cpu_brute(p, prev_cpu);
- rcu_read_lock();
- sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
- if (energy_aware() &&
!is_sd_overutilized(sd)) {
new_cpu = select_energy_cpu_brute(p, prev_cpu);
goto unlock;
- }
- sd = NULL;
- for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break;
@@ -6418,6 +6448,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ }
+unlock: rcu_read_unlock();
return new_cpu; @@ -7478,6 +7510,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7497,6 +7530,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL,
.busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0,.total_util = 0UL,
@@ -7792,7 +7826,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
{ unsigned long load; int i, nr_running;bool *overload, bool *overutilized, bool *misfit_task)
@@ -7831,8 +7865,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, !sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
if (cpu_overutilized(i))
if (cpu_overutilized(i)) { *overutilized = true;
/*
* If the cpu is overutilized and if there is only one
* current task in cfs runqueue, it is potentially a misfit
* task.
*/
This comment doesn't apply anymore since you changed
'if (rq->cfs.h_nr_running == 1)' to 'rq->misfit_task':
if (rq->misfit_task)
*misfit_task = true;
}
}
/* Adjust by relative CPU capacity of the group */
@@ -7974,12 +8016,12 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) */ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) {
- struct sched_domain *child = env->sd->child;
- struct sched_domain *child = env->sd->child, *sd; struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;
@@ -8001,7 +8043,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd }
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
&overload, &overutilized);
&overload, &overutilized,
&misfit_task);
if (local_group) goto next_group;
@@ -8032,6 +8075,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
sds->total_util += sgs->group_util;
sg = sg->next; } while (sg != env->sd->groups);
@@ -8045,14 +8089,45 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload;
- }
/* Update over-utilization (tipping point, U >= 0) indicator */
if (env->dst_rq->rd->overutilized != overutilized)
env->dst_rq->rd->overutilized = overutilized;
- } else {
if (!env->dst_rq->rd->overutilized && overutilized)
env->dst_rq->rd->overutilized = true;
- if (overutilized)
set_sd_overutilized(env->sd);
- else
clear_sd_overutilized(env->sd);
- /*
* If there is a misfit task in one cpu in this sched_domain
* it is likely that the imbalance cannot be sorted out among
* the cpu's in this sched_domain. In this case set the
* overutilized flag at the parent sched_domain.
*/
- if (misfit_task) {
sd = env->sd->parent;
/*
* In case of a misfit task, load balance at the parent
* sched domain level will make sense only if the the cpus
* have a different capacity. If cpus at a domain level have
* the same capacity, the misfit task cannot be well
* accomodated in any of the cpus and there in no point in
* trying a load balance at this level
*/
while (sd) {
if (sd->flags & SD_ASYM_CPUCAPACITY) {
set_sd_overutilized(sd);
break;
}
sd = sd->parent;
}}
- /* If the domain util is greater that domain capacity, load balancing
* needs to be done at the next sched domain level as well
*/
- if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
}set_sd_overutilized(env->sd->parent);
Like in v2, sched_domain_shared::overutilized is potentially set twice on DIE level, one time for misfit, one time for overutilized. I posted a possible code change during the v2 review.
/** @@ -8279,8 +8354,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
if (energy_aware()) {
if (!is_sd_overutilized(env->sd))
goto out_balanced;
}
local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -9164,6 +9241,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) {
if (energy_aware()) {
if (!is_sd_overutilized(sd))
continue;
}
- /*
- Decay the newidle max times here because this is a regular
- visit to all the domains. Decay ~1% per second.
@@ -9466,6 +9548,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se;
struct sched_domain *sd;
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se);
@@ -9477,8 +9560,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
rq->misfit_task = !task_fits_capacity(curr, capacity_of(rq->cpu));
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
rq->rd->overutilized = true;
rcu_read_lock();
sd = rcu_dereference(rq->sd);
if (!is_sd_overutilized(sd) &&
cpu_overutilized(task_cpu(curr)))
set_sd_overutilized(sd);
rcu_read_unlock(); }
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8d27d5b..1604ef2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -585,9 +585,6 @@ struct root_domain { /* Indicate more than one runnable task for any CPU */ bool overload;
- /* Indicate one or more cpus over-utilized (tipping point) */
- bool overutilized;
- /*
- The bit corresponding to a CPU gets set here if such CPU has more
- than one runnable -deadline task (as it is below for RT tasks).
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 263e549..e5ba6fc 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1040,11 +1040,11 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */
from v2: Nitpick: The original comment from Peter Z. doesn't make sense any more. We do this now regardless of SD_SHARE_PKG_RESOURCES being set or not.
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
}
sd->private = sdd;
On 06/30/2017 06:30 AM, Dietmar Eggemann wrote: Hi Dietmar,
Sorry for the delay in posting the reply.
On 06/23/2017 03:37 PM, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain
capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen
I mentioned in my v2 review (Feb 2017) that this is only valid for #cpu in a cluster >=7.
Yes. But don't you think it is still worth mentioning.
in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately.
test cases and test results:
Do you have test results for this patch running on a big-Little platform? It would be really helpful to let's say have a set of synthetic workloads (rt-app based) like:
(1) Create a over-utilized scenario for one cpu (2) Create over-utilized scenario for one cluster (3) Create a misfit scenario (4) Create a combined (over-utilized and misfit) scenario (5) Create an anti-test pattern (where the over-utilization is not detectable) ...
With a couple of extra trace events (some of them we already have in our product kernel, e.g. indicating over-utilization/non over-utilization sections) you could compare the test results between vanilla eas/next integration (20170522_1454) and the one with your patch in terms of how much time the system/a cluster was over-utilized or not.
I guess these results are necessary before we can take the next step, i.e. integrating this patch into the EAS product kernel.
I am working on the test cases. I will not be posting another version of this patch till we align on the testing, test cases and results.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
V2->V3: - Rebased on latest kernel. - The previous check for misfit task is replaced with the newely introduced rq->misfit_task flag. V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 3 - kernel/sched/topology.c | 8 +-- 4 files changed, 117 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 3137750..ae44044 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -88,6 +88,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores;
- bool overutilized; }; struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a9ac67c..34bdfeb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4791,6 +4791,29 @@ static inline void hrtick_update(struct rq *rq) static bool cpu_overutilized(int cpu); +static bool +is_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
return sd->shared->overutilized;
- else
return false;
+}
+static void +set_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = true;
+}
+static void +clear_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = false;
+}
- /*
- The enqueue_task method is called before nr_running is
- increased. Here we update the fair scheduling stats and
@@ -4800,6 +4823,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq;
- struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP); @@ -4843,9 +4867,12 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags) if (!se) { add_nr_running(rq, 1);
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
rcu_read_lock();
sd = rcu_dereference(rq->sd);
if (!task_new && !is_sd_overutilized(sd) &&
Already raised in v2: IMHO, this is problematic. is_sd_overutilized() returns false if the sd is not overutilized or sd == NULL.
Shouldn't we change all of these occurrences where we fetch the sd from rcu_dereference() to:
if (... && sd && is_sd_overutilized(sd) && ..)
I prefer the check being in a single place. But you are right there is a problem in the is_sd_overutilized fn. I can change this function from bool to int with multiple return value. We are only interested in !is_sd_overutilized(). So I can return -1 in case of !sd.
I do not see the problem with set_sd_overutilized and clear_sd_overutilized.
This would allow to get rid of the if() condition in [set|clear|is]_sd_overutilized() as well.
Nitpick: Could we rename is_sd_overutilized() into sd_overutilized() to be in line with cpu_overutilized()?
Sure.
cpu_overutilized(rq->cpu))
set_sd_overutilized(sd);
}rcu_read_unlock(); } hrtick_update(rq);
@@ -6276,8 +6303,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
- rcu_read_lock();
- /* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu)); if (!sd)
@@ -6315,8 +6341,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) } unlock:
- rcu_read_unlock();
@@ -6350,10 +6374,16 @@ select_task_rq_fair(struct task_struct *p,if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu;
int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, &p->cpus_allowed); }
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
return select_energy_cpu_brute(p, prev_cpu);
rcu_read_lock();
- sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
- if (energy_aware() &&
!is_sd_overutilized(sd)) {
new_cpu = select_energy_cpu_brute(p, prev_cpu);
goto unlock;
- }
- sd = NULL;
for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break;
@@ -6418,6 +6448,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ }
+unlock: rcu_read_unlock(); return new_cpu; @@ -7478,6 +7510,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */
- unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in
sd */ struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ @@ -7497,6 +7530,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL,
.total_util = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0,
@@ -7792,7 +7826,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
{ unsigned long load; int i, nr_running;bool *overload, bool *overutilized, bool *misfit_task)
@@ -7831,8 +7865,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, !sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
if (cpu_overutilized(i))
if (cpu_overutilized(i)) { *overutilized = true;
/*
* If the cpu is overutilized and if there is only one
* current task in cfs runqueue, it is potentially a misfit
* task.
*/
This comment doesn't apply anymore since you changed
Will remove.
'if (rq->cfs.h_nr_running == 1)' to 'rq->misfit_task':
if (rq->misfit_task)
*misfit_task = true;
} } /* Adjust by relative CPU capacity of the group */
@@ -7974,12 +8016,12 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) */ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) {
- struct sched_domain *child = env->sd->child;
- struct sched_domain *child = env->sd->child, *sd; struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
- bool overload = false, overutilized = false, misfit_task = false; if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;
@@ -8001,7 +8043,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
&overload, &overutilized);
&overload, &overutilized,
&misfit_task); if (local_group) goto next_group;
@@ -8032,6 +8075,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
sds->total_util += sgs->group_util; sg = sg->next; } while (sg != env->sd->groups);
@@ -8045,14 +8089,45 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload;
- }
/* Update over-utilization (tipping point, U >= 0)
indicator */
if (env->dst_rq->rd->overutilized != overutilized)
env->dst_rq->rd->overutilized = overutilized;
- } else {
if (!env->dst_rq->rd->overutilized && overutilized)
env->dst_rq->rd->overutilized = true;
- if (overutilized)
set_sd_overutilized(env->sd);
- else
clear_sd_overutilized(env->sd);
- /*
* If there is a misfit task in one cpu in this sched_domain
* it is likely that the imbalance cannot be sorted out among
* the cpu's in this sched_domain. In this case set the
* overutilized flag at the parent sched_domain.
*/
- if (misfit_task) {
sd = env->sd->parent;
/*
* In case of a misfit task, load balance at the parent
* sched domain level will make sense only if the the cpus
* have a different capacity. If cpus at a domain level have
* the same capacity, the misfit task cannot be well
* accomodated in any of the cpus and there in no point in
* trying a load balance at this level
*/
while (sd) {
if (sd->flags & SD_ASYM_CPUCAPACITY) {
set_sd_overutilized(sd);
break;
}
sd = sd->parent;
} }
- /* If the domain util is greater that domain capacity, load
balancing
* needs to be done at the next sched domain level as well
*/
- if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
}set_sd_overutilized(env->sd->parent);
Like in v2, sched_domain_shared::overutilized is potentially set twice on DIE level, one time for misfit, one time for overutilized. I posted a possible code change during the v2 review.
I am not sure , I understand the issue here.
/** @@ -8279,8 +8354,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
- if (energy_aware()) {
if (!is_sd_overutilized(env->sd))
goto out_balanced;
- } local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -9164,6 +9241,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) rcu_read_lock(); for_each_domain(cpu, sd) {
if (energy_aware()) {
if (!is_sd_overutilized(sd))
continue;
}
/* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second.
@@ -9466,6 +9548,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se;
- struct sched_domain *sd; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se);
@@ -9477,8 +9560,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) rq->misfit_task = !task_fits_capacity(curr, capacity_of(rq->cpu));
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
rq->rd->overutilized = true;
- rcu_read_lock();
- sd = rcu_dereference(rq->sd);
- if (!is_sd_overutilized(sd) &&
cpu_overutilized(task_cpu(curr)))
set_sd_overutilized(sd);
- rcu_read_unlock(); } /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8d27d5b..1604ef2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -585,9 +585,6 @@ struct root_domain { /* Indicate more than one runnable task for any CPU */ bool overload;
- /* Indicate one or more cpus over-utilized (tipping point) */
- bool overutilized;
/* * The bit corresponding to a CPU gets set here if such CPU has
more * than one runnable -deadline task (as it is below for RT tasks). diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 263e549..e5ba6fc 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1040,11 +1040,11 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */
from v2: Nitpick: The original comment from Peter Z. doesn't make sense any more. We do this now regardless of SD_SHARE_PKG_RESOURCES being set or not.
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- } sd->private = sdd;
-- Regards Thara
On 29/08/17 23:49, Thara Gopinath wrote:
On 06/30/2017 06:30 AM, Dietmar Eggemann wrote: Hi Dietmar,
Sorry for the delay in posting the reply.
On 06/23/2017 03:37 PM, Thara Gopinath wrote:
[...]
This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen
I mentioned in my v2 review (Feb 2017) that this is only valid for #cpu in a cluster >=7.
Yes. But don't you think it is still worth mentioning.
I agree that it is still worth mentioning your example but you should add the part that it is only valid for #cpu in the cluster >=7.
[...]
Do you have test results for this patch running on a big-Little platform? It would be really helpful to let's say have a set of synthetic workloads (rt-app based) like:
(1) Create a over-utilized scenario for one cpu (2) Create over-utilized scenario for one cluster (3) Create a misfit scenario (4) Create a combined (over-utilized and misfit) scenario (5) Create an anti-test pattern (where the over-utilization is not detectable) ...
With a couple of extra trace events (some of them we already have in our product kernel, e.g. indicating over-utilization/non over-utilization sections) you could compare the test results between vanilla eas/next integration (20170522_1454) and the one with your patch in terms of how much time the system/a cluster was over-utilized or not.
I guess these results are necessary before we can take the next step, i.e. integrating this patch into the EAS product kernel.
I am working on the test cases. I will not be posting another version of this patch till we align on the testing, test cases and results.
Agreed. It would be helpful to post these test results with the new patch version then.
[...]
@@ -4800,6 +4823,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq;
- struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP); @@ -4843,9 +4867,12 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags) if (!se) { add_nr_running(rq, 1);
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
rcu_read_lock();
sd = rcu_dereference(rq->sd);
if (!task_new && !is_sd_overutilized(sd) &&
Already raised in v2: IMHO, this is problematic. is_sd_overutilized() returns false if the sd is not overutilized or sd == NULL.
Shouldn't we change all of these occurrences where we fetch the sd from rcu_dereference() to:
if (... && sd && is_sd_overutilized(sd) && ..)
I prefer the check being in a single place. But you are right there is a problem in the is_sd_overutilized fn. I can change this function from bool to int with multiple return value. We are only interested in !is_sd_overutilized(). So I can return -1 in case of !sd.
As long as you can distinguish between !sd and !sd_not_overutilized that's fine with me.
I do not see the problem with set_sd_overutilized and clear_sd_overutilized.
There is no issue with these functions. I just meant that if you test for sd in the condition you can get rid of the if (sd) in those functions.
[...]
On 23/06/17 15:37, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
V2->V3:
- Rebased on latest kernel.
- The previous check for misfit task is replaced with the newely introduced rq->misfit_task flag.
V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 3 - kernel/sched/topology.c | 8 +-- 4 files changed, 117 insertions(+), 32 deletions(-)
[...]
@@ -8279,8 +8354,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
if (energy_aware()) {
if (!is_sd_overutilized(env->sd))
goto out_balanced;
}
local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -9164,6 +9241,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) {
if (energy_aware()) {
if (!is_sd_overutilized(sd))
continue;
}
While doing the integration of the sd over-utilization thing into the next EAS version I realized that this continue statement is not in the version in which we use rd->overutilized.
Not sure if we already talked about why you added it here? Eventually, the 'goto out_balanced' would trigger in load_balance() -> find_busiest_group().
[...]
Hi Thara,
On 23/06/17 15:37, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
V2->V3:
- Rebased on latest kernel.
- The previous check for misfit task is replaced with the newely introduced rq->misfit_task flag.
V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 3 - kernel/sched/topology.c | 8 +-- 4 files changed, 117 insertions(+), 32 deletions(-)
[...]
This is what I had to do to apply this patch to the next EAS integration. Have a look and tell me if you spot any issues.
Thanks,
-- Dietmar
-- >8 --
From 1500a85a733af8590c1545928eb589e73a67de57 Mon Sep 17 00:00:00 2001
From: Dietmar Eggemann dietmar.eggemann@arm.com Date: Thu, 5 Oct 2017 16:52:24 +0100 Subject: [PATCH] Tweaking [PATCH V3] Per Sched domain over utilization into the next eas_int
These are the changes I did to integrate this into eas_int:
(1) Incorporating my v3 review comments. (2) Adaptations in update_sd_lb_stats() due to new mainline base. (3) Move overutilized check behind 'decay the newidle max times' in rebalance_domains()
Signed-off-by: Dietmar Eggemann dietmar.eggemann@arm.com --- kernel/sched/fair.c | 61 ++++++++++++++++++------------------------------- kernel/sched/topology.c | 4 ---- 2 files changed, 22 insertions(+), 43 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1cedda74f1f7..aa1388cc673d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4882,27 +4882,19 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
-static bool -is_sd_overutilized(struct sched_domain *sd) +static bool sd_overutilized(struct sched_domain *sd) { - if (sd) - return sd->shared->overutilized; - else - return false; + return sd->shared->overutilized; }
-static void -set_sd_overutilized(struct sched_domain *sd) +static void set_sd_overutilized(struct sched_domain *sd) { - if (sd) - sd->shared->overutilized = true; + sd->shared->overutilized = true; }
-static void -clear_sd_overutilized(struct sched_domain *sd) +static void clear_sd_overutilized(struct sched_domain *sd) { - if (sd) - sd->shared->overutilized = false; + sd->shared->overutilized = false; }
/* @@ -4960,8 +4952,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) add_nr_running(rq, 1); rcu_read_lock(); sd = rcu_dereference(rq->sd); - if (!task_new && !is_sd_overutilized(sd) && - cpu_overutilized(rq->cpu)) + if (!task_new && sd && !sd_overutilized(sd) && + cpu_overutilized(rq->cpu)) set_sd_overutilized(sd); rcu_read_unlock(); } @@ -6301,7 +6293,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
- /* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd) @@ -6374,8 +6365,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
rcu_read_lock(); sd = rcu_dereference(cpu_rq(prev_cpu)->sd); - if (energy_aware() && - !is_sd_overutilized(sd)) { + if (energy_aware() && sd && !sd_overutilized(sd)) { new_cpu = select_energy_cpu_brute(p, prev_cpu); goto unlock; } @@ -7908,11 +7898,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (cpu_overutilized(i)) { *overutilized = true; - /* - * If the cpu is overutilized and if there is only one - * current task in cfs runqueue, it is potentially a misfit - * task. - */ + if (rq->misfit_task) *misfit_task = true; } @@ -8165,13 +8151,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } }
- /* If the domain util is greater that domain capacity, load balancing - * needs to be done at the next sched domain level as well + /* + * If the domain util is greater that domain capacity, load balancing + * needs to be done at the next sched domain level as well. */ - if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) + if (lb_sd_parent(env->sd) && + sds->total_capacity * 1024 < sds->total_util * capacity_margin) set_sd_overutilized(env->sd->parent);
- if (!shared) + if (!(env->sd->flags & SD_SHARE_PKG_RESOURCES)) return;
/* @@ -8412,10 +8400,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware()) { - if (!is_sd_overutilized(env->sd)) - goto out_balanced; - } + if (energy_aware() && !sd_overutilized(env->sd)) + goto out_balanced;
local = &sds.local_stat; busiest = &sds.busiest_stat; @@ -9309,11 +9295,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) { - if (energy_aware()) { - if (!is_sd_overutilized(sd)) - continue; - } - /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second. @@ -9326,6 +9307,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) } max_cost += sd->max_newidle_lb_cost;
+ if (energy_aware() && !sd_overutilized(sd)) + continue; + if (!(sd->flags & SD_LOAD_BALANCE)) continue;
@@ -9630,8 +9614,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
rcu_read_lock(); sd = rcu_dereference(rq->sd); - if (!is_sd_overutilized(sd) && - cpu_overutilized(task_cpu(curr))) + if (sd && !sd_overutilized(sd) && cpu_overutilized(task_cpu(curr))) set_sd_overutilized(sd); rcu_read_unlock(); } diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 42a3ede19a5c..7fe59856a34f 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1274,10 +1274,6 @@ sd_init(struct sched_domain_topology_level *tl, sd->idle_idx = 1; }
- /* - * For all levels sharing cache; connect a sched_domain_shared - * instance. - */ sd->shared = *per_cpu_ptr(sdd->sds, sd_id); atomic_inc(&sd->shared->ref);
-- 2.11.0
Hello Dietmar,
On 10/05/2017 02:45 PM, Dietmar Eggemann wrote:
Hi Thara,
On 23/06/17 15:37, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
V2->V3:
- Rebased on latest kernel.
- The previous check for misfit task is replaced with the newely introduced rq->misfit_task flag.
V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 3 - kernel/sched/topology.c | 8 +-- 4 files changed, 117 insertions(+), 32 deletions(-)
[...]
This is what I had to do to apply this patch to the next EAS integration. Have a look and tell me if you spot any issues.
Thanks,
-- Dietmar
-- >8 --
From 1500a85a733af8590c1545928eb589e73a67de57 Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann dietmar.eggemann@arm.com Date: Thu, 5 Oct 2017 16:52:24 +0100 Subject: [PATCH] Tweaking [PATCH V3] Per Sched domain over utilization into the next eas_int
These are the changes I did to integrate this into eas_int:
(1) Incorporating my v3 review comments. (2) Adaptations in update_sd_lb_stats() due to new mainline base.
I did not understand this change. But I guess you have some extra stuff in the intergration branch that I cannot see. Otherwise I am okay with the changes.
Regards Thara
(3) Move overutilized check behind 'decay the newidle max times' in rebalance_domains()
Signed-off-by: Dietmar Eggemann dietmar.eggemann@arm.com
kernel/sched/fair.c | 61 ++++++++++++++++++------------------------------- kernel/sched/topology.c | 4 ---- 2 files changed, 22 insertions(+), 43 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1cedda74f1f7..aa1388cc673d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4882,27 +4882,19 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
-static bool -is_sd_overutilized(struct sched_domain *sd) +static bool sd_overutilized(struct sched_domain *sd) {
if (sd)
return sd->shared->overutilized;
else
return false;
return sd->shared->overutilized;
}
-static void -set_sd_overutilized(struct sched_domain *sd) +static void set_sd_overutilized(struct sched_domain *sd) {
if (sd)
sd->shared->overutilized = true;
sd->shared->overutilized = true;
}
-static void -clear_sd_overutilized(struct sched_domain *sd) +static void clear_sd_overutilized(struct sched_domain *sd) {
if (sd)
sd->shared->overutilized = false;
sd->shared->overutilized = false;
}
/* @@ -4960,8 +4952,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) add_nr_running(rq, 1); rcu_read_lock(); sd = rcu_dereference(rq->sd);
if (!task_new && !is_sd_overutilized(sd) &&
cpu_overutilized(rq->cpu))
if (!task_new && sd && !sd_overutilized(sd) &&
cpu_overutilized(rq->cpu)) set_sd_overutilized(sd); rcu_read_unlock(); }
@@ -6301,7 +6293,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
/* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu)); if (!sd)
@@ -6374,8 +6365,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
rcu_read_lock(); sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
if (energy_aware() &&
!is_sd_overutilized(sd)) {
if (energy_aware() && sd && !sd_overutilized(sd)) { new_cpu = select_energy_cpu_brute(p, prev_cpu); goto unlock; }
@@ -7908,11 +7898,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (cpu_overutilized(i)) { *overutilized = true;
/*
* If the cpu is overutilized and if there is only one
* current task in cfs runqueue, it is potentially a misfit
* task.
*/
if (rq->misfit_task) *misfit_task = true; }
@@ -8165,13 +8151,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } }
/* If the domain util is greater that domain capacity, load balancing
* needs to be done at the next sched domain level as well
/*
* If the domain util is greater that domain capacity, load balancing
* needs to be done at the next sched domain level as well. */
if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
if (lb_sd_parent(env->sd) &&
sds->total_capacity * 1024 < sds->total_util * capacity_margin) set_sd_overutilized(env->sd->parent);
if (!shared)
if (!(env->sd->flags & SD_SHARE_PKG_RESOURCES)) return; /*
@@ -8412,10 +8400,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
if (energy_aware()) {
if (!is_sd_overutilized(env->sd))
goto out_balanced;
}
if (energy_aware() && !sd_overutilized(env->sd))
goto out_balanced; local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -9309,11 +9295,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) {
if (energy_aware()) {
if (!is_sd_overutilized(sd))
continue;
}
/* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second.
@@ -9326,6 +9307,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) } max_cost += sd->max_newidle_lb_cost;
if (energy_aware() && !sd_overutilized(sd))
continue;
if (!(sd->flags & SD_LOAD_BALANCE)) continue;
@@ -9630,8 +9614,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
rcu_read_lock(); sd = rcu_dereference(rq->sd);
if (!is_sd_overutilized(sd) &&
cpu_overutilized(task_cpu(curr)))
if (sd && !sd_overutilized(sd) && cpu_overutilized(task_cpu(curr))) set_sd_overutilized(sd); rcu_read_unlock();
} diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 42a3ede19a5c..7fe59856a34f 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1274,10 +1274,6 @@ sd_init(struct sched_domain_topology_level *tl, sd->idle_idx = 1; }
/*
* For all levels sharing cache; connect a sched_domain_shared
* instance.
*/ sd->shared = *per_cpu_ptr(sdd->sds, sd_id); atomic_inc(&sd->shared->ref);
-- Regards Thara
On 06/10/17 21:36, Thara Gopinath wrote:
Hello Dietmar,
On 10/05/2017 02:45 PM, Dietmar Eggemann wrote:
Hi Thara,
On 23/06/17 15:37, Thara Gopinath wrote:
[...]
This is what I had to do to apply this patch to the next EAS integration. Have a look and tell me if you spot any issues.
Thanks,
-- Dietmar
-- >8 --
From 1500a85a733af8590c1545928eb589e73a67de57 Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann dietmar.eggemann@arm.com Date: Thu, 5 Oct 2017 16:52:24 +0100 Subject: [PATCH] Tweaking [PATCH V3] Per Sched domain over utilization into the next eas_int
These are the changes I did to integrate this into eas_int:
(1) Incorporating my v3 review comments. (2) Adaptations in update_sd_lb_stats() due to new mainline base.
I did not understand this change. But I guess you have some extra stuff in the intergration branch that I cannot see. Otherwise I am okay with the changes.
Thanks! Yes it's related to Peterz patch "sched/fair: Fix wake_affine() for !NUMA_BALANCING" which is currently part of tip/sched/core. We do use tip/sched/core for our eas integrations.
-- Dietmar
On 10/09/2017 08:33 AM, Dietmar Eggemann wrote:
On 06/10/17 21:36, Thara Gopinath wrote:
Hello Dietmar,
On 10/05/2017 02:45 PM, Dietmar Eggemann wrote:
Hi Thara,
On 23/06/17 15:37, Thara Gopinath wrote:
[...]
This is what I had to do to apply this patch to the next EAS integration. Have a look and tell me if you spot any issues.
Thanks,
-- Dietmar
-- >8 --
From 1500a85a733af8590c1545928eb589e73a67de57 Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann dietmar.eggemann@arm.com Date: Thu, 5 Oct 2017 16:52:24 +0100 Subject: [PATCH] Tweaking [PATCH V3] Per Sched domain over utilization into the next eas_int
These are the changes I did to integrate this into eas_int:
(1) Incorporating my v3 review comments. (2) Adaptations in update_sd_lb_stats() due to new mainline base.
I did not understand this change. But I guess you have some extra stuff in the intergration branch that I cannot see. Otherwise I am okay with the changes.
Thanks! Yes it's related to Peterz patch "sched/fair: Fix wake_affine() for !NUMA_BALANCING" which is currently part of tip/sched/core. We do use tip/sched/core for our eas integrations.
Thanks! I will try to access tip/sched/core and take a look at it.
-- Dietmar
-- Regards Thara