On Mon, Dec 19, 2016 at 09:16:58PM -0500, Thara Gopinath wrote:
On 12/19/2016 07:42 PM, Leo Yan wrote:
On Mon, Dec 19, 2016 at 10:17:29AM -0500, Thara Gopinath wrote:
[...]
But what is missing is handling of misfit task. Can we not handle misfit task as a separate condition in update_sd_lb? i.e in the above example if either CPU A or CPU B has a misfit task, set the overutilization flag for the next level SD which is equivalent to setting the flag in RD in this case.
Agree, we can do this for misfit task :)
IIUC, the idea of your patch is firstly to use SD level 2 flag to present "inner" overutilized, then later in load balance flow to check if need set rd->overutilized flag for outer 'overutilized'. So for 'misfit' case, we need wait until load balance flow to check it and set rd->overutilized flag.
rd->overutilized is like the overutilized flag at any sched group level but for the highest sched_domain that does not have a parent. I am not sure if i understand inner and outer over utilized properly. Say in a system a cpu has four levels of sched domain - level1, level2, level3 and level4. What my patch proposes is as follows- If a load balance has to happen for this cpu at level1, the flag will be set at first sched group in level2. Similarly if load balance has to happen at level2, the flag will be set at thefirst sched group in level3. Following this, if a load balance has to happen at the highest level, ie level4, the flag will be set at rd.
E.g. in upper case after set rd->overutilized flag, the scheduler cannot distinguish the load blance requirement _coming_ from which specific schedule group. rd-overutilied flag is an overall flag to indicate the load balance should happen within level 4, but we lose info like in level 4 which schedule group has performance issue so scheduler should help it.
I recognize here have a big different understanding for how to use the 'overutilized' flag. One method is to use "overutilized" flag to indicate one specific schedule domain is over-utilized so need do load balance but we cannot know from these flags which schedule groups within SD have performance bottleneck.
Another method is to use "overutilized" flag to indicate one specific schedule group has performance bootleneck so any schedule group can set "overutilized" flag for itself. Finally scheduler can easily know which schedule groups have bottlenech (the LB requirement from 'who') and should migrate out tasks from them. I personally this can give us more chance to do subtle optimization with these infos, like we know "overutilized" happens in LITTLE cluster so we can have different strategy when "overutilized" happens in big cluster.
This is why I suggest to use 'discrete' flags in corresponding SD level to present outer 'overutilized', so we can set flag at the first place for outer 'overutilized' but not delay until in load balance flow.
Instead of directly setting the flag at the highest level, should we not try to balance the load out at a lower level, if possible?
For 'misfit' task, we don't need do load balance in SD level 1; For other case, we can firstly do load balance in SD level 1.
Thanks, Leo Yan