Re: [Eas-dev] [PATCH RFC 6/8] sched/fair: correct avg_load as CPU average load

28 Jun 2016


      On 28 June 2016 at 17:52, Leo Yan leo.yan@linaro.org wrote:
...
On Tue, Jun 28, 2016 at 04:49:29PM +0200, Vincent Guittot wrote:
...
On 28 June 2016 at 15:58, Leo Yan leo.yan@linaro.org wrote:
...
On Mon, Jun 27, 2016 at 10:56:57PM +0200, Vincent Guittot wrote:
...
On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:
...
Current code calculates avg_load as below:
  sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) /
        sgs->group_capacity
Let's see below scenario for cluster level average load calculation:
The little cluster have 4 CPUs and 2 tasks running with 100%
utilization (nice = 0); if little cluster capacity is 400, then get
little cluster avg_load = (2 * 1024 * 1024) / (400 * 4) = 1310.
On the other hand, if big cluster have 4 CPUs and 4 tasks running with
100% utilization (nice = 0); if big cluster capacity is 1024, then get
big cluster avg_load = (4 * 1024 * 1024) / (1024 * 4) = 1024.
So finally scheduler considers little cluster has higher load and this
obviously doesn't make sense due big cluster has 4 CPUs been running but
little cluster actually have 2 CPUs are idle.
This perfectly makes sense, it's all about how much computation
capacity has to be shared between load so even if 2 little cpu are
idle, the tasks on big cluster have more compute capacity than those
on little cluster.
I'm not sure if I have done enough homework for this :) My
understanding for load value is the task requirement for CPU
computation capacity, the range is [0..1024]. So when reach 1024
load value is as an enhancement of task's weight that was used before
in the scheduler. The task's weight is now pondered by the runnable
time of the task so a task with a short runnable time will not make a
CPU seen as heavily loaded just because of thsi short high prio task.
This load is then used to ensure a fair share of the CPUs compute
capacity between the tasks
The range is not [0..1024] but [0..88761]
I will digest these info with reading code offline. But I recognize
I wrongly understood "load" when I connect your description with
Yuyang's an early email [1].
[1] http://article.gmane.org/gmane.linux.kernel/2036154
...
...
meaning all CPU capacity has been consumed. avg_load is the average
Only the utilization of the cpu means that all CPU capacity has been
consumed but not the load which is used for fair time sharing between
tasks
...
value for sched_group's all CPUs.
And avg_load is a singal for CPU capacity consuming but not for task.
Please free correct me if I misunderstand for this.
I gave the example which is not quite typical. Let's see more practical
example:
Big cluster has 4 CPUs with 6 runnable tasks, group_load = 3547; so
big cluster avg_load = (3547 * 1024) / 4096 = 886;
Little cluster has 4 CPUs with 2 runnable tasks, group_load = 1639;,
so little cluster avg_load = (1639 * 1024 / 1603 = 1046;
So how we calculate imbalance load between these two clusters? In
current code, even big cluster is overloaded but it will not migrate
tasks to little cluser:
In this case we use the fact the big cluster is overloaded but not
little cluster and the load_per_task value
In mainline kernel, this is not handle correctly but I thought that
this was handled by EAS code; There is this code below in
calculate_imbalance that matches with your condition
/*

Busiest group is overloaded, local is not, use the spare
cycles to maximize throughput

*/
if (busiest->group_type == group_overloaded &&
   local->group_type <= group_misfit_task) {
env->imbalance = busiest->load_per_task;
return;
}
But this code will _NOT_ be called in below flow:
find_busiest_group() {
    [...]

    if (local->avg_load >= busiest->avg_load)   --> will directly return
            goto out_balanced;

    [...]


force_balance:
        env->busiest_group_type = busiest->group_type;
        calculate_imbalance(env, &sds);
        return sds.busiest;
out_balanced:
        env->imbalance = 0;
        return NULL;
}
So I think you are suggesting code like below:
find_busiest_group() {
    [...]

    /*
     * Force balance when busiest group is overloaded and
     * local group is not imbalance or overloaded.
     */
    if (energy_aware() &&
         (busiest->group_type == group_overloaded &&
         local->group_type <= group_misfit_task))
            goto force_balance;

There is already something quite similar in find_busiest_group :
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
   busiest->group_no_capacity)
goto force_balance;
replacing env->idle == CPU_NEWLY_IDLE by env->idle == CPU_IDLE should do the job
...
    if (local->avg_load >= busiest->avg_load)
            goto out_balanced;

    [...]


force_balance:
        env->busiest_group_type = busiest->group_type;
        calculate_imbalance(env, &sds);
        return sds.busiest;
out_balanced:
        env->imbalance = 0;
        return NULL;
}
Thanks,
Leo Yan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [PATCH RFC 6/8] sched/fair: correct avg_load as CPU average load