Re: [Eas-dev] [RFC] Per Sched domain over utilization

20 Dec 2016


      Hi Morten,
On 12/16/2016 04:18 AM, Morten Rasmussen wrote:
...
On Thu, Dec 15, 2016 at 03:37:12PM -0500, Thara Gopinath wrote:
...
Hello Morten,
Thanks for the review. Sorry for the delay in reply.
No problem.
...
On 12/09/2016 12:06 PM, Morten Rasmussen wrote:
...
On Thu, Dec 08, 2016 at 11:42:50AM -0500, Thara Gopinath wrote:
...
On 12/08/2016 05:24 AM, Leo Yan wrote:
...
On Wed, Dec 07, 2016 at 05:22:37PM -0500, Thara Gopinath wrote:
...

/* If the domain util is greater that domain capacity, load balancing

needs to be done at the next sched domain level as well


*/
if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) {
     /* If  already at the highest domain nothing can be done */


     if (env->sd->parent)


             set_sd_overutilized(env->sd->parent,


                                     env->dst_rq->rd);



So usually this will set root domain's flag after the whole schedule
domain util greater than domain capacity. If CPU has one "misfit" task
then scheduler will not reach this condition, so this will not set
root domain's flag and introduce delay to migrate task.
Hmm yes. you are correct. we may have to handle misfit tasks separately.
I don't quite understand why we have to set the flag on the parent
sched_domain here. It should be set anyways when update_sd_lb_stats() is
called for the parent sched_domain in the next iteration of
for_each_domain() in rebalance_domains(). But I see that we bails out
early there, which I don't quite understand either. More on that in a
separate reply ;-)
We bail out in rebalance_domains if no group is over loaded in a sched
domain. Setting the flag here ensures that the domain is not skipped in
rebalance_domain.
Okay. I think idea of setting a flag on those domains that can safely be
skipped is good, my concern is the conditions for setting it. Comparing
total_util to total_capacity won't reveal 'misfit' tasks on big.LITTLE
as Leo pointed out. I think your proposal should work fine on SMP
platforms, but for big.LITTLE we need to set the flag for additional
cases to deal with the 'misfit' tasks.
Yes I agree with this one. I will try to incorporate this in the next
version.
...
...
...
I agree with Leo that total_util doesn't flag some quite important
cases as overutilized.
...
...
...
}
 }
@@ -7932,8 +7983,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
    */
   update_sd_lb_stats(env, &sds);

if (energy_aware() && !env->dst_rq->rd->overutilized)
     goto out_balanced;




/* Is this check really required here?? Revisit */

if (energy_aware()) {

     if (!is_sd_overutilized(env->sd, env->dst_rq->rd))


             goto out_balanced;


}
local = &sds.local_stat;
busiest = &sds.busiest_stat;


@@ -8000,6 +8054,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 force_balance:
   /* Looks like there is an imbalance. Compute it */
   calculate_imbalance(env, &sds);


/* Is this the correct place to clear this flag? Should access

to flag be locked? Revisit.


*/
clear_sd_overutilized(env->sd, env->dst_rq->rd);

Have specific sequency to clear overutilized flag for root domain and SD
level 2? Like firstly clean root domain flag and then clear SD level
2 flag.
We don't need a specific sequence. As and when each sched domain get
balanced, the corresponding flag will get cleared.
IIUC, it isn't quite right to clear the flag here. The sched_domain may
have more than two groups, for example a 4+4 big.LITTLE topology has
four groups at the lowest level. Load-balance may resolve
over-utilization in one sched_group, but one or more other groups might
still be over-utilized
I don't understand this. IIUC load balance is between the sched groups
of a domain. Irrespective of how many groups are there call to
load_balance() will balance the load between all the groups in the
domain(affinity being an exception). Am I missing something here?
You are right that load_balance() will eventually balance between all
groups, but only after it has been called by a cpu in each group (minus
one to be precise). load_balance() is pull-based, when it is called it
will only pull tasks to the destination cpu, which is the cpu where the
function call happens in most cases (nohz_idle_balance() is the
exception). So you need to try to pull in all directions before things
are balanced.
My point is that, pulling in one direction between two groups, doesn't
guarantee that all over-utilization has been resolved. For example, if
two groups are over-utilized, it will take two calls to load_balance()
from non-over-utilized cpu(s) to resolve the problem as each call only
pulls from one group. Clearing the over-utilized flag after the first
load-balance() is therefore somewhat misleading.
...
...
They won't be balanced this time anyway, so the flag might be re-enabled
again at the next load-balance, so it may not be a problem, it just
makes it a bit harder to understand.
Even in the case where we only have two sched_group, load-balance isn't
guaranteed to result no groups being over-utilized. The source group may
be so heavily loaded that it over-utilizes the destination group too
after the load has been balanced.
This I agree. This is not the correct place. I can move this under
out_balanced in load_balance function. Any other recommendation on where
this flag can be conclusively cleared?
If we want to keep the flag up until all over-utilization has been
addressed, we can only clear it if all child domains have cleared
theirs.
Maybe you can count the number of over-utilized child domains in
update_sd_lb_stats(). If it is >1 then we know that we didn't fix all
imbalances this time and we should leave the flag up. Would that work?
I have been thinking about this. I think clearing the flag in
update_sd_lb if the group is not overutilized should be okay for now. If
the group is overutilized the flag will remain set and the next cpu in
the group can attempt a load balance. If the group is not overutilized,
we anyways bail out in find_busiest_group. Do you think it makes sense?
...
Morten
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
--
Regards
Thara

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC] Per Sched domain over utilization