On 23/06/17 15:37, Thara Gopinath wrote:
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched domain level instead of a single flag system wide. Load balancing is done at the sched domain where any of the cpu is over utilized. If energy aware scheduling is enabled and no cpu in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure that is common across all the sched domains at a level. The new flag introduced is placed in this structure so that all the sched domains the same level share the flag. In case of an overutilized cpu, the flag gets set at level1 sched_domain. The flag at the parent sched_domain level gets set in either of the two following scenarios.
- There is a misfit task in one of the cpu's in this sched_domain.
- The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to misfit tasks. For example consider a sched group with n cpus and n+1 70%utilized tasks. Ideally this is a case for load balance to happen in a parent sched domain. But neither the total group utilization is high enough for the load balance to be triggered in the parent domain nor there is a cpu with a single overutilized task so that aload balance is triggered in a parent domain. But again this could be a purely academic sceanrio, as during task wake up these tasks will be placed more appropriately.
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org
V2->V3:
- Rebased on latest kernel.
- The previous check for misfit task is replaced with the newely introduced rq->misfit_task flag.
V1->V2: - Removed overutilized flag from sched_group structure. - In case of misfit task, it is ensured that a load balance is triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 3 - kernel/sched/topology.c | 8 +-- 4 files changed, 117 insertions(+), 32 deletions(-)
[...]
@@ -8279,8 +8354,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
if (energy_aware()) {
if (!is_sd_overutilized(env->sd))
goto out_balanced;
}
local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -9164,6 +9241,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) {
if (energy_aware()) {
if (!is_sd_overutilized(sd))
continue;
}
While doing the integration of the sd over-utilization thing into the next EAS version I realized that this continue statement is not in the version in which we use rd->overutilized.
Not sure if we already talked about why you added it here? Eventually, the 'goto out_balanced' would trigger in load_balance() -> find_busiest_group().
[...]