The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios. 1. There is a misfit task in one of the cpu's in this sched_domain. 2. The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org --- include/linux/sched.h | 1 + kernel/sched/core.c | 7 ++- kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++----------- 3 files changed, 99 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c5122e..971842a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1112,6 +1112,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + bool overutilized; };
struct sched_domain { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 31a466f..e0a8758 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6659,11 +6659,10 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */ - if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->shared = *per_cpu_ptr(sdd->sds, sd_id); - atomic_inc(&sd->shared->ref); + sd->shared = *per_cpu_ptr(sdd->sds, sd_id); + atomic_inc(&sd->shared->ref); + if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight); - }
sd->private = sdd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 489f6d3..485f597 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4735,6 +4735,30 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool +is_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + return sd->shared->overutilized; + else + return false; +} + +static void +set_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = true; +} + +static void +clear_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->shared->overutilized = false; +} + + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -4744,6 +4768,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; + struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4787,9 +4812,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) { add_nr_running(rq, 1); - if (!task_new && !rq->rd->overutilized && - cpu_overutilized(rq->cpu)) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!task_new && !is_sd_overutilized(sd) && + cpu_overutilized(rq->cpu)) + set_sd_overutilized(sd); + rcu_read_unlock(); } hrtick_update(rq); } @@ -6173,8 +6201,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
- rcu_read_lock(); - + /* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd) @@ -6212,8 +6239,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) }
unlock: - rcu_read_unlock(); - if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu;
@@ -6247,10 +6272,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); }
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized)) - return select_energy_cpu_brute(p, prev_cpu); - rcu_read_lock(); + sd = rcu_dereference(cpu_rq(prev_cpu)->sd); + if (energy_aware() && + !is_sd_overutilized(sd)) { + new_cpu = select_energy_cpu_brute(p, prev_cpu); + goto unlock; + } + + sd = NULL; + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break; @@ -6315,6 +6346,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ } + +unlock: rcu_read_unlock();
return new_cpu; @@ -7366,6 +7399,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ + unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ @@ -7385,6 +7419,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL, + .total_util = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, @@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, - bool *overload, bool *overutilized) + bool *overload, bool *overutilized, bool *misfit_task) { unsigned long load; int i, nr_running; @@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
- if (cpu_overutilized(i)) + if (cpu_overutilized(i)) { *overutilized = true; + /* + * If the cpu is overutilized and if there is only one + * current task in cfs runqueue, it is potentially a misfit + * task. + */ + if (rq->cfs.h_nr_running == 1) + *misfit_task = true; + } }
/* Adjust by relative CPU capacity of the group */ @@ -7829,7 +7872,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; - bool overload = false, overutilized = false; + bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; @@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd }
update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload, &overutilized); + &overload, &overutilized, + &misfit_task);
if (local_group) goto next_group; @@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; + sds->total_util += sgs->group_util;
sg = sg->next; } while (sg != env->sd->groups); @@ -7895,14 +7940,27 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; - - /* Update over-utilization (tipping point, U >= 0) indicator */ - if (env->dst_rq->rd->overutilized != overutilized) - env->dst_rq->rd->overutilized = overutilized; - } else { - if (!env->dst_rq->rd->overutilized && overutilized) - env->dst_rq->rd->overutilized = true; } + + if (overutilized) + set_sd_overutilized(env->sd); + else + clear_sd_overutilized(env->sd); + + /* + * If there is a misfit task in one cpu in this sched_domain + * it is likely that the imbalance cannot be sorted out among + * the cpu's in this sched_domain. In this case set the + * overutilized flag at the parent sched_domain. + */ + if (misfit_task) + set_sd_overutilized(env->sd->parent); + + /* If the domain util is greater that domain capacity, load balancing + * needs to be done at the next sched domain level as well + */ + if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) + set_sd_overutilized(env->sd->parent); }
/** @@ -8122,8 +8180,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized) - goto out_balanced; + if (energy_aware()) { + if (!is_sd_overutilized(env->sd)) + goto out_balanced; + }
local = &sds.local_stat; busiest = &sds.busiest_stat; @@ -8981,6 +9041,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) { + if (energy_aware()) { + if (!is_sd_overutilized(sd)) + continue; + } + /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second. @@ -9280,6 +9345,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; + struct sched_domain *sd;
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -9289,8 +9355,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr);
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr))) - rq->rd->overutilized = true; + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!is_sd_overutilized(sd) && + cpu_overutilized(task_cpu(curr))) + set_sd_overutilized(sd); + rcu_read_unlock(); }
/* -- 2.1.4
Hi Thara,
On Wed, Jan 25, 2017 at 09:17:08AM -0500, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
include/linux/sched.h | 1 + kernel/sched/core.c | 7 ++- kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++----------- 3 files changed, 99 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c5122e..971842a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1112,6 +1112,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores;
- bool overutilized;
};
struct sched_domain { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 31a466f..e0a8758 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6659,11 +6659,10 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
This is based on Dietmar meantioned the 'shared' sched domain, right?
sd->private = sdd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 489f6d3..485f597 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4735,6 +4735,30 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool +is_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
return sd->shared->overutilized;
- else
return false;
+}
+static void +set_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = true;
+}
+static void +clear_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = false;
+}
/*
- The enqueue_task method is called before nr_running is
- increased. Here we update the fair scheduling stats and
@@ -4744,6 +4768,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq;
- struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4787,9 +4812,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) { add_nr_running(rq, 1);
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
rcu_read_lock();
sd = rcu_dereference(rq->sd);
if (!task_new && !is_sd_overutilized(sd) &&
cpu_overutilized(rq->cpu))
set_sd_overutilized(sd);
} hrtick_update(rq);rcu_read_unlock();
} @@ -6173,8 +6201,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
- rcu_read_lock();
/* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd)
@@ -6212,8 +6239,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) }
unlock:
- rcu_read_unlock();
- if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu;
@@ -6247,10 +6272,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); }
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
return select_energy_cpu_brute(p, prev_cpu);
- rcu_read_lock();
- sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
- if (energy_aware() &&
!is_sd_overutilized(sd)) {
new_cpu = select_energy_cpu_brute(p, prev_cpu);
goto unlock;
- }
- sd = NULL;
- for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break;
@@ -6315,6 +6346,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ }
+unlock: rcu_read_unlock();
return new_cpu; @@ -7366,6 +7399,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7385,6 +7419,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL,
.busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0,.total_util = 0UL,
@@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
bool *overload, bool *overutilized, bool *misfit_task)
{ unsigned long load; int i, nr_running; @@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i))
if (cpu_overutilized(i)) { *overutilized = true;
/*
* If the cpu is overutilized and if there is only one
* current task in cfs runqueue, it is potentially a misfit
* task.
*/
if (rq->cfs.h_nr_running == 1)
*misfit_task = true;
Can we also check rq->misfit? E.g. if one big task is enqueued onto rq, the rq->misfit is set but the CPU utilization will take long time to cross 'overutilized', so if check rq->misfit we can quickly get to know there have misfit task on it.
}
}
/* Adjust by relative CPU capacity of the group */
@@ -7829,7 +7872,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;
@@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd }
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
&overload, &overutilized);
&overload, &overutilized,
&misfit_task);
if (local_group) goto next_group;
@@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
sds->total_util += sgs->group_util;
sg = sg->next; } while (sg != env->sd->groups);
@@ -7895,14 +7940,27 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload;
/* Update over-utilization (tipping point, U >= 0) indicator */
if (env->dst_rq->rd->overutilized != overutilized)
env->dst_rq->rd->overutilized = overutilized;
- } else {
if (!env->dst_rq->rd->overutilized && overutilized)
env->dst_rq->rd->overutilized = true;
So can we use one patch to remove 'rd->overutilized'?
}
- if (overutilized)
set_sd_overutilized(env->sd);
- else
clear_sd_overutilized(env->sd);
If only one CPU is overutilized, here is it possible to set 'overutilized' flag for 'shared' sched domain in second level, finally introduce lb between two clusters?
- /*
* If there is a misfit task in one cpu in this sched_domain
* it is likely that the imbalance cannot be sorted out among
* the cpu's in this sched_domain. In this case set the
* overutilized flag at the parent sched_domain.
*/
- if (misfit_task)
set_sd_overutilized(env->sd->parent);
Have same question with upper comment. If 'env->sd' is the second level sched domain, so should use 'env->sd' but not 'env->sd->parent'?
- /* If the domain util is greater that domain capacity, load balancing
* needs to be done at the next sched domain level as well
*/
- if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
set_sd_overutilized(env->sd->parent);
}
/** @@ -8122,8 +8180,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
if (energy_aware()) {
if (!is_sd_overutilized(env->sd))
goto out_balanced;
}
local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -8981,6 +9041,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) {
if (energy_aware()) {
if (!is_sd_overutilized(sd))
continue;
}
- /*
- Decay the newidle max times here because this is a regular
- visit to all the domains. Decay ~1% per second.
@@ -9280,6 +9345,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se;
struct sched_domain *sd;
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se);
@@ -9289,8 +9355,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr);
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
rq->rd->overutilized = true;
- rcu_read_lock();
- sd = rcu_dereference(rq->sd);
- if (!is_sd_overutilized(sd) &&
cpu_overutilized(task_cpu(curr)))
set_sd_overutilized(sd);
- rcu_read_unlock();
}
/*
2.1.4
Hello Leo,
Thanks for the review.
On 01/25/2017 06:19 PM, Leo Yan wrote:
Hi Thara,
On Wed, Jan 25, 2017 at 09:17:08AM -0500, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
include/linux/sched.h | 1 + kernel/sched/core.c | 7 ++- kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++----------- 3 files changed, 99 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c5122e..971842a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1112,6 +1112,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores;
- bool overutilized;
};
struct sched_domain { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 31a466f..e0a8758 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6659,11 +6659,10 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
This is based on Dietmar meantioned the 'shared' sched domain, right?
Yes right.
sd->private = sdd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 489f6d3..485f597 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4735,6 +4735,30 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool +is_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
return sd->shared->overutilized;
- else
return false;
+}
+static void +set_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = true;
+}
+static void +clear_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = false;
+}
/*
- The enqueue_task method is called before nr_running is
- increased. Here we update the fair scheduling stats and
@@ -4744,6 +4768,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq;
- struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4787,9 +4812,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) { add_nr_running(rq, 1);
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
rcu_read_lock();
sd = rcu_dereference(rq->sd);
if (!task_new && !is_sd_overutilized(sd) &&
cpu_overutilized(rq->cpu))
set_sd_overutilized(sd);
} hrtick_update(rq);rcu_read_unlock();
} @@ -6173,8 +6201,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
- rcu_read_lock();
/* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd)
@@ -6212,8 +6239,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) }
unlock:
- rcu_read_unlock();
- if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu;
@@ -6247,10 +6272,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); }
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
return select_energy_cpu_brute(p, prev_cpu);
- rcu_read_lock();
- sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
- if (energy_aware() &&
!is_sd_overutilized(sd)) {
new_cpu = select_energy_cpu_brute(p, prev_cpu);
goto unlock;
- }
- sd = NULL;
- for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break;
@@ -6315,6 +6346,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ }
+unlock: rcu_read_unlock();
return new_cpu; @@ -7366,6 +7399,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7385,6 +7419,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL,
.busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0,.total_util = 0UL,
@@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
bool *overload, bool *overutilized, bool *misfit_task)
{ unsigned long load; int i, nr_running; @@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i))
if (cpu_overutilized(i)) { *overutilized = true;
/*
* If the cpu is overutilized and if there is only one
* current task in cfs runqueue, it is potentially a misfit
* task.
*/
if (rq->cfs.h_nr_running == 1)
*misfit_task = true;
Can we also check rq->misfit? E.g. if one big task is enqueued onto rq, the rq->misfit is set but the CPU utilization will take long time to cross 'overutilized', so if check rq->misfit we can quickly get to know there have misfit task on it.
I cannot think of why we cannot check rq->misfit. I will implement it in the next version.
}
}
/* Adjust by relative CPU capacity of the group */
@@ -7829,7 +7872,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;
@@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd }
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
&overload, &overutilized);
&overload, &overutilized,
&misfit_task);
if (local_group) goto next_group;
@@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
sds->total_util += sgs->group_util;
sg = sg->next; } while (sg != env->sd->groups);
@@ -7895,14 +7940,27 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload;
/* Update over-utilization (tipping point, U >= 0) indicator */
if (env->dst_rq->rd->overutilized != overutilized)
env->dst_rq->rd->overutilized = overutilized;
- } else {
if (!env->dst_rq->rd->overutilized && overutilized)
env->dst_rq->rd->overutilized = true;
So can we use one patch to remove 'rd->overutilized'?
Yes. When I send out the next version i will include another patch removing rd->overutilized.
}
- if (overutilized)
set_sd_overutilized(env->sd);
- else
clear_sd_overutilized(env->sd);
If only one CPU is overutilized, here is it possible to set 'overutilized' flag for 'shared' sched domain in second level, finally introduce lb between two clusters?
I do not understand your question here ?
Regards Thara
- /*
* If there is a misfit task in one cpu in this sched_domain
* it is likely that the imbalance cannot be sorted out among
* the cpu's in this sched_domain. In this case set the
* overutilized flag at the parent sched_domain.
*/
- if (misfit_task)
set_sd_overutilized(env->sd->parent);
Have same question with upper comment. If 'env->sd' is the second level sched domain, so should use 'env->sd' but not 'env->sd->parent'?
- /* If the domain util is greater that domain capacity, load balancing
* needs to be done at the next sched domain level as well
*/
- if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
set_sd_overutilized(env->sd->parent);
}
/** @@ -8122,8 +8180,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
if (energy_aware()) {
if (!is_sd_overutilized(env->sd))
goto out_balanced;
}
local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -8981,6 +9041,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) {
if (energy_aware()) {
if (!is_sd_overutilized(sd))
continue;
}
- /*
- Decay the newidle max times here because this is a regular
- visit to all the domains. Decay ~1% per second.
@@ -9280,6 +9345,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se;
struct sched_domain *sd;
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se);
@@ -9289,8 +9355,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr);
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
rq->rd->overutilized = true;
- rcu_read_lock();
- sd = rcu_dereference(rq->sd);
- if (!is_sd_overutilized(sd) &&
cpu_overutilized(task_cpu(curr)))
set_sd_overutilized(sd);
- rcu_read_unlock();
}
/*
2.1.4
-- Regards Thara
On Thu, Jan 26, 2017 at 10:05:17AM -0500, Thara Gopinath wrote:
[...]
/* Adjust by relative CPU capacity of the group */ @@ -7829,7 +7872,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;
@@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd }
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
&overload, &overutilized);
&overload, &overutilized,
&misfit_task);
if (local_group) goto next_group;
@@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
sds->total_util += sgs->group_util;
sg = sg->next; } while (sg != env->sd->groups);
@@ -7895,14 +7940,27 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload;
/* Update over-utilization (tipping point, U >= 0) indicator */
if (env->dst_rq->rd->overutilized != overutilized)
env->dst_rq->rd->overutilized = overutilized;
- } else {
if (!env->dst_rq->rd->overutilized && overutilized)
env->dst_rq->rd->overutilized = true;
So can we use one patch to remove 'rd->overutilized'?
Yes. When I send out the next version i will include another patch removing rd->overutilized.
}
- if (overutilized)
set_sd_overutilized(env->sd);
- else
clear_sd_overutilized(env->sd);
If only one CPU is overutilized, here is it possible to set 'overutilized' flag for 'shared' sched domain in second level, finally introduce lb between two clusters?
I do not understand your question here ?
E.g, 'env->sd' refers to the schedule domain for Cluster0 (CPUA & CPUB) and cluster1 (CPUC & CPUD); if CPUA is overutilized then upper code will set 'env->sd' flag, so at the end the lb will happen between cluster0 and cluster1. I think for this case it's better to set the overutilized flag only for schedule domain with CPUA and CPUB so lb is to happen within cluster0.
Not sure if this is same idea with you, for more specific I suggest code like below:
if (env->sd->parent) { if (overutilized) set_sd_overutilized(env->sd); else clear_sd_overutilized(env->sd); }
Thanks, Leo Yan
On 01/25/2017 06:19 PM, Leo Yan wrote:
Hi Thara,
On Wed, Jan 25, 2017 at 09:17:08AM -0500, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
include/linux/sched.h | 1 + kernel/sched/core.c | 7 ++- kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++----------- 3 files changed, 99 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 1c5122e..971842a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1112,6 +1112,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores;
- bool overutilized;
};
struct sched_domain { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 31a466f..e0a8758 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6659,11 +6659,10 @@ sd_init(struct sched_domain_topology_level *tl, * For all levels sharing cache; connect a sched_domain_shared * instance. */
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
This is based on Dietmar meantioned the 'shared' sched domain, right?
sd->private = sdd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 489f6d3..485f597 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4735,6 +4735,30 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool +is_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
return sd->shared->overutilized;
- else
return false;
+}
+static void +set_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = true;
+}
+static void +clear_sd_overutilized(struct sched_domain *sd) +{
- if (sd)
sd->shared->overutilized = false;
+}
/*
- The enqueue_task method is called before nr_running is
- increased. Here we update the fair scheduling stats and
@@ -4744,6 +4768,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq;
- struct sched_domain *sd; struct sched_entity *se = &p->se; int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4787,9 +4812,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) { add_nr_running(rq, 1);
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
rcu_read_lock();
sd = rcu_dereference(rq->sd);
if (!task_new && !is_sd_overutilized(sd) &&
cpu_overutilized(rq->cpu))
set_sd_overutilized(sd);
} hrtick_update(rq);rcu_read_unlock();
} @@ -6173,8 +6201,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) unsigned long max_spare = 0; struct sched_domain *sd;
- rcu_read_lock();
/* The rcu lock is/should be held in the caller function */ sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd)
@@ -6212,8 +6239,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu) }
unlock:
- rcu_read_unlock();
- if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu)) return prev_cpu;
@@ -6247,10 +6272,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); }
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
return select_energy_cpu_brute(p, prev_cpu);
- rcu_read_lock();
- sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
- if (energy_aware() &&
!is_sd_overutilized(sd)) {
new_cpu = select_energy_cpu_brute(p, prev_cpu);
goto unlock;
- }
- sd = NULL;
- for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break;
@@ -6315,6 +6346,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ }
+unlock: rcu_read_unlock();
return new_cpu; @@ -7366,6 +7399,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7385,6 +7419,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL,
.busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0,.total_util = 0UL,
@@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
bool *overload, bool *overutilized, bool *misfit_task)
{ unsigned long load; int i, nr_running; @@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i))
if (cpu_overutilized(i)) { *overutilized = true;
/*
* If the cpu is overutilized and if there is only one
* current task in cfs runqueue, it is potentially a misfit
* task.
*/
if (rq->cfs.h_nr_running == 1)
*misfit_task = true;
Can we also check rq->misfit? E.g. if one big task is enqueued onto rq, the rq->misfit is set but the CPU utilization will take long time to cross 'overutilized', so if check rq->misfit we can quickly get to know there have misfit task on it.
Hi Leo,
I am reworking the patch to post a V2. I am using 4.9 rc-6 kernel and there is not pointer called "misfit" in rq. Which kernel version are you on?
Regards Thara
}
}
/* Adjust by relative CPU capacity of the group */
@@ -7829,7 +7872,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;
@@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd }
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
&overload, &overutilized);
&overload, &overutilized,
&misfit_task);
if (local_group) goto next_group;
@@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
sds->total_util += sgs->group_util;
sg = sg->next; } while (sg != env->sd->groups);
@@ -7895,14 +7940,27 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload;
/* Update over-utilization (tipping point, U >= 0) indicator */
if (env->dst_rq->rd->overutilized != overutilized)
env->dst_rq->rd->overutilized = overutilized;
- } else {
if (!env->dst_rq->rd->overutilized && overutilized)
env->dst_rq->rd->overutilized = true;
So can we use one patch to remove 'rd->overutilized'?
}
- if (overutilized)
set_sd_overutilized(env->sd);
- else
clear_sd_overutilized(env->sd);
If only one CPU is overutilized, here is it possible to set 'overutilized' flag for 'shared' sched domain in second level, finally introduce lb between two clusters?
- /*
* If there is a misfit task in one cpu in this sched_domain
* it is likely that the imbalance cannot be sorted out among
* the cpu's in this sched_domain. In this case set the
* overutilized flag at the parent sched_domain.
*/
- if (misfit_task)
set_sd_overutilized(env->sd->parent);
Have same question with upper comment. If 'env->sd' is the second level sched domain, so should use 'env->sd' but not 'env->sd->parent'?
- /* If the domain util is greater that domain capacity, load balancing
* needs to be done at the next sched domain level as well
*/
- if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
set_sd_overutilized(env->sd->parent);
}
/** @@ -8122,8 +8180,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
if (energy_aware()) {
if (!is_sd_overutilized(env->sd))
goto out_balanced;
}
local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -8981,6 +9041,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) {
if (energy_aware()) {
if (!is_sd_overutilized(sd))
continue;
}
- /*
- Decay the newidle max times here because this is a regular
- visit to all the domains. Decay ~1% per second.
@@ -9280,6 +9345,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se;
struct sched_domain *sd;
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se);
@@ -9289,8 +9355,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr);
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
rq->rd->overutilized = true;
- rcu_read_lock();
- sd = rcu_dereference(rq->sd);
- if (!is_sd_overutilized(sd) &&
cpu_overutilized(task_cpu(curr)))
set_sd_overutilized(sd);
- rcu_read_unlock();
}
/*
2.1.4
-- Regards Thara
On Wed, Feb 15, 2017 at 02:07:05PM -0500, Thara Gopinath wrote:
[...]
@@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
bool *overload, bool *overutilized, bool *misfit_task)
{ unsigned long load; int i, nr_running; @@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i))
if (cpu_overutilized(i)) { *overutilized = true;
/*
* If the cpu is overutilized and if there is only one
* current task in cfs runqueue, it is potentially a misfit
* task.
*/
if (rq->cfs.h_nr_running == 1)
*misfit_task = true;
Can we also check rq->misfit? E.g. if one big task is enqueued onto rq, the rq->misfit is set but the CPU utilization will take long time to cross 'overutilized', so if check rq->misfit we can quickly get to know there have misfit task on it.
Hi Leo,
I am reworking the patch to post a V2. I am using 4.9 rc-6 kernel and there is not pointer called "misfit" in rq. Which kernel version are you on?
Usually I work on kernel v4.4, the Android common kernel is: https://android.googlesource.com/kernel/common/+/android-4.4
I can verify the patch for kernel v4.4 on Juno with ARM LT's branch: https://git.linaro.org/landing-teams/working/arm/kernel-release.git/log/?h=l...
Basically they are the same code base for EAS development.
Thanks, Leo Yan
On Wed, Jan 25, 2017 at 09:17:08AM -0500, Thara Gopinath wrote:
@@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i))
if (cpu_overutilized(i)) { *overutilized = true;
/*
* If the cpu is overutilized and if there is only one
* current task in cfs runqueue, it is potentially a misfit
* task.
*/
if (rq->cfs.h_nr_running == 1)
*misfit_task = true;
}
I guess your assumption here is that if h_nr_running > 1, then group utilization compared to group capacity will set the balance flag on the parent domain?
IIUC, there may be a few corner cases that might slip through and not be balanced. For example if you have a domain with n cpus and n+1 70% utilization tasks. One cpu will have to handle two tasks and therefore be overutilized. It is likely to have h_nr_running == 2 and group utilizations will be below the threshold so we don't set the flag on the parent domain. So even if there are more cpus in the system at higher level domain, they won't bother checking if something should be pulled.
I'm not sure if the example is purely academic or it would be a real problem. I think it is hard, if not impossible, to do 'right' without going through all the tasks on the rqs which is too expensive (requires rq locking etc.).
It definitely deserves a comment somewhere :-) IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.