Hi Morten,
On 12/16/2016 04:18 AM, Morten Rasmussen wrote:
On Thu, Dec 15, 2016 at 03:37:12PM -0500, Thara Gopinath wrote:
Hello Morten,
Thanks for the review. Sorry for the delay in reply.
No problem.
On 12/09/2016 12:06 PM, Morten Rasmussen wrote:
On Thu, Dec 08, 2016 at 11:42:50AM -0500, Thara Gopinath wrote:
On 12/08/2016 05:24 AM, Leo Yan wrote:
On Wed, Dec 07, 2016 at 05:22:37PM -0500, Thara Gopinath wrote:
- /* If the domain util is greater that domain capacity, load balancing
- needs to be done at the next sched domain level as well
- */
- if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) {
/* If already at the highest domain nothing can be done */
if (env->sd->parent)
set_sd_overutilized(env->sd->parent,
env->dst_rq->rd);
So usually this will set root domain's flag after the whole schedule domain util greater than domain capacity. If CPU has one "misfit" task then scheduler will not reach this condition, so this will not set root domain's flag and introduce delay to migrate task.
Hmm yes. you are correct. we may have to handle misfit tasks separately.
I don't quite understand why we have to set the flag on the parent sched_domain here. It should be set anyways when update_sd_lb_stats() is called for the parent sched_domain in the next iteration of for_each_domain() in rebalance_domains(). But I see that we bails out early there, which I don't quite understand either. More on that in a separate reply ;-)
We bail out in rebalance_domains if no group is over loaded in a sched domain. Setting the flag here ensures that the domain is not skipped in rebalance_domain.
Okay. I think idea of setting a flag on those domains that can safely be skipped is good, my concern is the conditions for setting it. Comparing total_util to total_capacity won't reveal 'misfit' tasks on big.LITTLE as Leo pointed out. I think your proposal should work fine on SMP platforms, but for big.LITTLE we need to set the flag for additional cases to deal with the 'misfit' tasks.
Yes I agree with this one. I will try to incorporate this in the next version.
I agree with Leo that total_util doesn't flag some quite important cases as overutilized.
} }
@@ -7932,8 +7983,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
/* Is this check really required here?? Revisit */
if (energy_aware()) {
if (!is_sd_overutilized(env->sd, env->dst_rq->rd))
goto out_balanced;
}
local = &sds.local_stat; busiest = &sds.busiest_stat;
@@ -8000,6 +8054,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env) force_balance: /* Looks like there is an imbalance. Compute it */ calculate_imbalance(env, &sds);
- /* Is this the correct place to clear this flag? Should access
- to flag be locked? Revisit.
- */
- clear_sd_overutilized(env->sd, env->dst_rq->rd);
Have specific sequency to clear overutilized flag for root domain and SD level 2? Like firstly clean root domain flag and then clear SD level 2 flag.
We don't need a specific sequence. As and when each sched domain get balanced, the corresponding flag will get cleared.
IIUC, it isn't quite right to clear the flag here. The sched_domain may have more than two groups, for example a 4+4 big.LITTLE topology has four groups at the lowest level. Load-balance may resolve over-utilization in one sched_group, but one or more other groups might still be over-utilized
I don't understand this. IIUC load balance is between the sched groups of a domain. Irrespective of how many groups are there call to load_balance() will balance the load between all the groups in the domain(affinity being an exception). Am I missing something here?
You are right that load_balance() will eventually balance between all groups, but only after it has been called by a cpu in each group (minus one to be precise). load_balance() is pull-based, when it is called it will only pull tasks to the destination cpu, which is the cpu where the function call happens in most cases (nohz_idle_balance() is the exception). So you need to try to pull in all directions before things are balanced.
My point is that, pulling in one direction between two groups, doesn't guarantee that all over-utilization has been resolved. For example, if two groups are over-utilized, it will take two calls to load_balance() from non-over-utilized cpu(s) to resolve the problem as each call only pulls from one group. Clearing the over-utilized flag after the first load-balance() is therefore somewhat misleading.
They won't be balanced this time anyway, so the flag might be re-enabled again at the next load-balance, so it may not be a problem, it just makes it a bit harder to understand.
Even in the case where we only have two sched_group, load-balance isn't guaranteed to result no groups being over-utilized. The source group may be so heavily loaded that it over-utilizes the destination group too after the load has been balanced.
This I agree. This is not the correct place. I can move this under out_balanced in load_balance function. Any other recommendation on where this flag can be conclusively cleared?
If we want to keep the flag up until all over-utilization has been addressed, we can only clear it if all child domains have cleared theirs.
Maybe you can count the number of over-utilized child domains in update_sd_lb_stats(). If it is >1 then we know that we didn't fix all imbalances this time and we should leave the flag up. Would that work?
I have been thinking about this. I think clearing the flag in update_sd_lb if the group is not overutilized should be okay for now. If the group is overutilized the flag will remain set and the next cpu in the group can attempt a load balance. If the group is not overutilized, we anyways bail out in find_busiest_group. Do you think it makes sense?
Morten IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
-- Regards Thara