Hi everyone, I have been looking at how different workloads react when the per entity load tracking metric is integrated into the load balancer and what are the possible reasons for it.
I had posted the integration patch earlier: https://lkml.org/lkml/2012/11/15/391
Essentially what I am doing is: 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with cfs.runnable_load_avg,the active load tracking metric. 3.I have replaced se.load.weight in task_h_load() with se.load.avg.contrib,the per entity load tracking metric. 4.The load balancer will end up using these metrics.
After conducting experiments on several workloads I found out that the performance of the workloads with the above integration would neither improve nor deteriorate.And this observation was consistent.
Ideally the performance should have improved considering,that the metric does better tracking of load.
Let me explain with a simple example as to why we should see a performance improvement ideally:Consider 2 80% tasks and 1 40% task.
With integration: ----------------
40% 80% 40% cpu1 cpu2
The above will be the scenario when the tasks fork initially.And this is a perfectly balanced system,hence no more load balancing.And proper distribution of loads on the cpu.
Without integration -------------------
40% 40% 80% 40% 80% 40% cpu1 cpu2 OR cpu1 cpu2
Because the view is that all the tasks as having the same load.The load balancer could ping pong tasks between these two situations.
When I performed this experiment,I did not see an improvement in the performance though in the former case.On further observation I found that the following was actually happening.
With integration ----------------
Initially 40% task sleeps 40% task wakes up and select_idle_sibling() decides to wake it up on cpu1
40% -> -> 40% 80% 40% 80% 40% 80% 40% cpu1 cpu2 cpu1 cpu2 cpu1 cpu2
This makes load balance trigger movement of 40% from cpu1 back to cpu2.Hence the stability that the load balancer was trying to achieve is gone.Hence the culprit boils down to select_idle_sibling.How is it the culprit and how is it hindering performance of the workloads?
*What is the way ahead with the per entity load tracking metric in the load balancer then?*
In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105, he mentions the following:
"It is my intuition that the greatest carnage here is actually caused by wake-up load-balancing getting in the way of periodic in establishing a steady state. I suspect more mileage would result from reducing the interference wake-up load-balancing has with steady state."
"The whole point of using blocked load is so that you can converge on a steady state where you don't NEED to move tasks. What disrupts this is we naturally prefer idle cpus on wake-up balance to reduce wake-up latency. I think the better answer is making these two processes load balancing() and select_idle_sibling() more co-operative."
I had not realised how this would happen until I saw it happening in the above experiment.
Based on what Paul explained above let us use the runnable load + the blocked load for calculating the load on a cfs runqueue rather than just the runnable load(which is what i am doing now) and see its consequence.
Initially: 40% task sleeps
40% 80% 40% -> 80% 40% cpu1 cpu2 cpu1 cpu2
So initially the load on cpu1 is say 80 and on cpu2 also it is 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable load+blocked load.which is still 80.
As a consequence,firstly,during periodic load balancing the load is not moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on cpu2 as 80 and not as 40). Hence the above scenario remains the same.On wake up,what happens?
Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2.
If we do that,we end up scheduling the 40% task back on cpu2.Back to the scenario which load balancing intended.Hence a steady state is maintained no matter what unless other tasks show up.
Note that considering prev_cpu as the default cpu to run the woken up task on is possible only because we use blocked load for load balancing purposes.
The above steps of using blocked load and selecting the prev_cpu as the target for the woken up task seems to me to be the next step.This could allow the load balance with the per entity load tracking metric to behave as it is supposed to without anything else disrupting it.And here i expect a performance improvement.
Please do let me know your suggestions.This will greatly help take the right steps here on, in achieving the correct integration.
Thank you
Regards Preeti U Murthy
On Wed, 2013-01-02 at 09:52 +0530, Preeti U Murthy wrote:
Hi everyone, I have been looking at how different workloads react when the per entity load tracking metric is integrated into the load balancer and what are the possible reasons for it.
I had posted the integration patch earlier: https://lkml.org/lkml/2012/11/15/391
Essentially what I am doing is: 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with cfs.runnable_load_avg,the active load tracking metric. 3.I have replaced se.load.weight in task_h_load() with se.load.avg.contrib,the per entity load tracking metric. 4.The load balancer will end up using these metrics.
After conducting experiments on several workloads I found out that the performance of the workloads with the above integration would neither improve nor deteriorate.And this observation was consistent.
Ideally the performance should have improved considering,that the metric does better tracking of load.
Let me explain with a simple example as to why we should see a performance improvement ideally:Consider 2 80% tasks and 1 40% task.
With integration:
40%
80% 40% cpu1 cpu2
The above will be the scenario when the tasks fork initially.And this is a perfectly balanced system,hence no more load balancing.And proper distribution of loads on the cpu.
Without integration
40% 40% 80% 40% 80% 40% cpu1 cpu2 OR cpu1 cpu2
Because the view is that all the tasks as having the same load.The load balancer could ping pong tasks between these two situations.
When I performed this experiment,I did not see an improvement in the performance though in the former case.On further observation I found that the following was actually happening.
With integration
Initially 40% task sleeps 40% task wakes up and select_idle_sibling() decides to wake it up on cpu1
40% -> -> 40%
80% 40% 80% 40% 80% 40% cpu1 cpu2 cpu1 cpu2 cpu1 cpu2
This makes load balance trigger movement of 40% from cpu1 back to cpu2.Hence the stability that the load balancer was trying to achieve is gone.Hence the culprit boils down to select_idle_sibling.How is it the culprit and how is it hindering performance of the workloads?
*What is the way ahead with the per entity load tracking metric in the load balancer then?*
select_idle_sibling() is all about dynamic, lowering latency and cranking up cores during ramp-up to boost throughput. If you want the system to achieve a stable state with periodic balancing, you need to turn select_idle_sibling() the heck off. Once you've gotten the box mostly committed, it's just an overhead/bounce source anyway.
In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105, he mentions the following:
"It is my intuition that the greatest carnage here is actually caused by wake-up load-balancing getting in the way of periodic in establishing a steady state. I suspect more mileage would result from reducing the interference wake-up load-balancing has with steady state."
"The whole point of using blocked load is so that you can converge on a steady state where you don't NEED to move tasks. What disrupts this is we naturally prefer idle cpus on wake-up balance to reduce wake-up latency. I think the better answer is making these two processes load balancing() and select_idle_sibling() more co-operative."
The down-side of steady state seeking via load tracking being that you want to take N% average load tasks, and stack them on top of each other, which does nothing good for those tasks when they overlap in execution. Long term balance looks all pretty, but if one or more of them could have slipped into an idle shared cache, it's still a latency hit and utilization loss. You are at odds with select_idle_sibling()'s mission. It cares about the here and now, while load tracking cares about fuzzy long-term averages.
I had not realised how this would happen until I saw it happening in the above experiment.
select_idle_sibling()'s job is to be annoying as hell.. and it does that very well :)
Based on what Paul explained above let us use the runnable load + the blocked load for calculating the load on a cfs runqueue rather than just the runnable load(which is what i am doing now) and see its consequence.
Initially: 40% task sleeps
40%
80% 40% -> 80% 40% cpu1 cpu2 cpu1 cpu2
So initially the load on cpu1 is say 80 and on cpu2 also it is 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable load+blocked load.which is still 80.
As a consequence,firstly,during periodic load balancing the load is not moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on cpu2 as 80 and not as 40). Hence the above scenario remains the same.On wake up,what happens?
Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2.
Once committed, that works fine.. until load fluctuates. Don't plug the holes opportunistically, you lose throughput.
If we do that,we end up scheduling the 40% task back on cpu2.Back to the scenario which load balancing intended.Hence a steady state is maintained no matter what unless other tasks show up.
Note that considering prev_cpu as the default cpu to run the woken up task on is possible only because we use blocked load for load balancing purposes.
The above steps of using blocked load and selecting the prev_cpu as the target for the woken up task seems to me to be the next step.This could allow the load balance with the per entity load tracking metric to behave as it is supposed to without anything else disrupting it.And here i expect a performance improvement.
Please do let me know your suggestions.This will greatly help take the right steps here on, in achieving the correct integration.
Again, I think you want a knob to turn select_idle_sibling() off. Load tracking goal is to smooth, select_idle_sibling() goal is to peak.
-Mike
Hi Mike,
Thank you very much for your feedback.Considering your suggestions,I have posted out a proposed solution to prevent select_idle_sibling() from becoming a disadvantage to normal load balancing,rather aiding it.
**This patch is *without* the enablement of the per entity load tracking metric.**
This is with an intention to correct the existing select_idle_sibling() mess before going ahead.
-------------------BEGIN PATCH--------------------------------------------------------
Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
The function of select_idle_sibling() is to place the woken up task in the vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says. This placement being only in an idle group.If an idle group is not found,the fallback cpu is either the waking cpu or the previous cpu accordingly.
This results in the runqueue of the waking cpu or the previous cpu getting overloaded when the system is committed,which is a latency hit to these tasks.
What is required is that the newly woken up tasks be placed close to the wake up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache coldness respectively.This is achieved with wake_affine() deciding which cache domain the task should be placed on.
Once this is decided,instead of searching for a completely idle group,let us search for the idlest group.This will anyway return a completely idle group if it exists and its mechanism will fall back to what select_idle_sibling() was doing.But if this fails,find_idlest_group() continues the search for a relatively more idle group.
The argument could be that,we wish to avoid migration of the newly woken up task to any other group unless it is completely idle.But in this case, to begin with we choose a sched domain,within which a migration could be less harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate with the same.
This patch is based on the tip tree without enabling the per entity load tracking.This is with an intention to clear up the select_idle_sibling() mess before introducing the metric. --- include/linux/topology.h | 4 ++- kernel/sched/fair.c | 61 +++++----------------------------------------- 2 files changed, 9 insertions(+), 56 deletions(-)
diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..eeb309e 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -95,7 +95,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ @@ -127,7 +127,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b29cdbf..c33eda7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3303,58 +3303,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return idlest; }
-/* - * Try and locate an idle CPU in the sched_domain. - */ -static int select_idle_sibling(struct task_struct *p, int target) -{ - int cpu = smp_processor_id(); - int prev_cpu = task_cpu(p); - struct sched_domain *sd; - struct sched_group *sg; - int i; - - /* - * If the task is going to be woken-up on this cpu and if it is - * already idle, then it is the right target. - */ - if (target == cpu && idle_cpu(cpu)) - return cpu; - - /* - * If the task is going to be woken-up on the cpu where it previously - * ran and if it is currently idle, then it the right target. - */ - if (target == prev_cpu && idle_cpu(prev_cpu)) - return prev_cpu; - - /* - * Otherwise, iterate the domains and find an elegible idle cpu. - */ - sd = rcu_dereference(per_cpu(sd_llc, target)); - for_each_lower_domain(sd) { - sg = sd->groups; - do { - if (!cpumask_intersects(sched_group_cpus(sg), - tsk_cpus_allowed(p))) - goto next; - - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; - } - - target = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; -next: - sg = sg->next; - } while (sg != sd->groups); - } -done: - return target; -} - #ifdef CONFIG_SCHED_NUMA static inline bool pick_numa_rand(int n) { @@ -3469,8 +3417,13 @@ find_sd: if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) prev_cpu = cpu;
- new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + if (prev_cpu == task_cpu(p) && idle_cpu(prev_cpu) || + prev_cpu == smp_processor_id() && idle_cpu(prev_cpu)) { + new_cpu = prev_cpu; + goto unlock; + } else { + sd = rcu_dereference(per_cpu(sd_llc, prev_cpu)); + } }
pick_idlest:
On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote:
Hi Mike,
Thank you very much for your feedback.Considering your suggestions,I have posted out a proposed solution to prevent select_idle_sibling() from becoming a disadvantage to normal load balancing,rather aiding it.
**This patch is *without* the enablement of the per entity load tracking metric.**
This is with an intention to correct the existing select_idle_sibling() mess before going ahead.
Bench the traversal cost, BALANCE_WAKE used to be _way_ to expensive to live with for fast movers. It was turned on briefly, only to be turned off again. Maybe that's changed, cool if so.
But before you do that: picking any idle or somewhat more idle cpu in an entire large L3 package on every wakeup rapes fast moving communicating tasks at L2. You can't do that at high frequency and live, so any integration has to prevent high frequency bounce.. but at the same time, react quickly to ramp, pull communicating tasks together quickly.. and not harm 1:N loads that currently benefit massively from full package high frequency bounce :)
-------------------BEGIN PATCH--------------------------------------------------------
Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
The function of select_idle_sibling() is to place the woken up task in the vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says. This placement being only in an idle group.If an idle group is not found,the fallback cpu is either the waking cpu or the previous cpu accordingly.
This results in the runqueue of the waking cpu or the previous cpu getting overloaded when the system is committed,which is a latency hit to these tasks.
What is required is that the newly woken up tasks be placed close to the wake up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache coldness respectively.This is achieved with wake_affine() deciding which cache domain the task should be placed on.
Once this is decided,instead of searching for a completely idle group,let us search for the idlest group.This will anyway return a completely idle group if it exists and its mechanism will fall back to what select_idle_sibling() was doing.But if this fails,find_idlest_group() continues the search for a relatively more idle group.
The argument could be that,we wish to avoid migration of the newly woken up task to any other group unless it is completely idle.But in this case, to begin with we choose a sched domain,within which a migration could be less harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate with the same.
This patch is based on the tip tree without enabling the per entity load tracking.This is with an intention to clear up the select_idle_sibling() mess before introducing the metric.
include/linux/topology.h | 4 ++- kernel/sched/fair.c | 61 +++++----------------------------------------- 2 files changed, 9 insertions(+), 56 deletions(-)
diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..eeb309e 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -95,7 +95,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \
| 0*SD_BALANCE_WAKE \
| 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \
@@ -127,7 +127,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \
| 0*SD_BALANCE_WAKE \
| 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b29cdbf..c33eda7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3303,58 +3303,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return idlest; } -/*
- Try and locate an idle CPU in the sched_domain.
- */
-static int select_idle_sibling(struct task_struct *p, int target) -{
- int cpu = smp_processor_id();
- int prev_cpu = task_cpu(p);
- struct sched_domain *sd;
- struct sched_group *sg;
- int i;
- /*
* If the task is going to be woken-up on this cpu and if it is
* already idle, then it is the right target.
*/
- if (target == cpu && idle_cpu(cpu))
return cpu;
- /*
* If the task is going to be woken-up on the cpu where it previously
* ran and if it is currently idle, then it the right target.
*/
- if (target == prev_cpu && idle_cpu(prev_cpu))
return prev_cpu;
- /*
* Otherwise, iterate the domains and find an elegible idle cpu.
*/
- sd = rcu_dereference(per_cpu(sd_llc, target));
- for_each_lower_domain(sd) {
sg = sd->groups;
do {
if (!cpumask_intersects(sched_group_cpus(sg),
tsk_cpus_allowed(p)))
goto next;
for_each_cpu(i, sched_group_cpus(sg)) {
if (!idle_cpu(i))
goto next;
}
target = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
goto done;
-next:
sg = sg->next;
} while (sg != sd->groups);
- }
-done:
- return target;
-}
#ifdef CONFIG_SCHED_NUMA static inline bool pick_numa_rand(int n) { @@ -3469,8 +3417,13 @@ find_sd: if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) prev_cpu = cpu;
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
if (prev_cpu == task_cpu(p) && idle_cpu(prev_cpu) ||
prev_cpu == smp_processor_id() && idle_cpu(prev_cpu)) {
new_cpu = prev_cpu;
goto unlock;
} else {
sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
}}
pick_idlest:
On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote:
Hi Mike,
Thank you very much for your feedback.Considering your suggestions,I have posted out a proposed solution to prevent select_idle_sibling() from becoming a disadvantage to normal load balancing,rather aiding it.
**This patch is *without* the enablement of the per entity load tracking metric.**
This is with an intention to correct the existing select_idle_sibling() mess before going ahead.
Well, on the bright side, this is much kinder to light load tbench on 4x10 core (+ht) box than current bounce happy select_idle_sibling(). At the heavy load end I'm losing heaping truckloads of throughput though, likely some migration rate limiting would fix that up.
select_idle_sibling() doesn't have much of a pretty face these days, but its evil hag face shows more prominently than ever, so making it dead is a great goal ;-)
-Mike
On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote:
Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
The function of select_idle_sibling() is to place the woken up task in the vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says. This placement being only in an idle group.If an idle group is not found,the fallback cpu is either the waking cpu or the previous cpu accordingly.
This results in the runqueue of the waking cpu or the previous cpu getting overloaded when the system is committed,which is a latency hit to these tasks.
What is required is that the newly woken up tasks be placed close to the wake up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache coldness respectively.This is achieved with wake_affine() deciding which cache domain the task should be placed on.
Once this is decided,instead of searching for a completely idle group,let us search for the idlest group.This will anyway return a completely idle group if it exists and its mechanism will fall back to what select_idle_sibling() was doing.But if this fails,find_idlest_group() continues the search for a relatively more idle group.
The argument could be that,we wish to avoid migration of the newly woken up task to any other group unless it is completely idle.But in this case, to begin with we choose a sched domain,within which a migration could be less harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate with the same.
What if..
Fast movers currently suffer from traversing large package, mostly due to traversal order walking 1:1 buddies hand in hand across the whole package endlessly. With only one buddy pair running, it's horrific.
Even if you change the order to be friendlier, perturbation induces bouncing. More spots to bounce too equals more bouncing. Ergo, I cross coupled cpu pairs to eliminate that. If buddies are perturbed, having one and only one buddy cpu pulls them back together, so can't induce a bounce fest, only correct. That worked well, but had the down side that some loads really REALLY want maximum spread, so suffer when you remove migration options as I did. There's in_interrupt() consideration I'm not so sure of too, in that case, going the extra mile to find an idle hole to plug _may_ be worth some extra cost too.. dunno.
So wrt integration, what if a buddy cpu were made a FIRST choice of generic wake balancing vs the ONLY choice of select_idle_sibling() as I did? If buddy cpu is available, cool, perturbed pairs find each other and pair back up, if not, and you were here too recently, you stay with prev_cpu, avoid bounce and traversal at high frequency. All tasks can try the cheap buddy cpu first, all can try full domain as well, just not at insane rates. The heavier the short term load average (or such, with instant decay on longish idle ala idle balance throttle so you ramp well), the longer the 'forget eating full balance' interval becomes, with cutoff affecting the cheap but also not free cross coupled buddy cpu as well at some point. Looking for an idle cpu at hefty load is a waste of cycles at best, plugging micro-holes does nothing good even if you find one, forget wake balance entirely at some cutoff, let periodic balancing do it's thing in peace.
Hrmph, that's too many words, but basically, I think whacking select_idle_sibling() integration into wake balance makes loads of sense, but needs a bit more to not end up just moving the problems to a different spot.
I still have a 2.6-rt problem I need to find time to squabble with, but maybe I'll soonish see if what you did plus what I did combined works out on that 4x10 core box where current is _so_ unbelievably horrible. Heck, it can't get any worse, and the restricted wake balance alone kinda sorta worked.
-Mike
On Sat, 2013-01-05 at 09:13 +0100, Mike Galbraith wrote:
I still have a 2.6-rt problem I need to find time to squabble with, but maybe I'll soonish see if what you did plus what I did combined works out on that 4x10 core box where current is _so_ unbelievably horrible. Heck, it can't get any worse, and the restricted wake balance alone kinda sorta worked.
Actually, I flunked copy/paste 101. Below (preeti) shows the real deal.
tbench, 3 runs, 30 secs/run revert = 37407ea7 reverted clients 1 5 10 20 40 80 3.6.0.virgin 27.83 139.50 1488.76 4172.93 6983.71 8301.73 29.23 139.98 1500.22 4162.92 6907.16 8231.13 30.00 141.43 1500.09 3975.50 6847.24 7983.98
3.6.0+revert 281.08 1404.76 2802.44 5019.49 7080.97 8592.80 282.38 1375.70 2747.23 4823.95 7052.15 8508.45 270.69 1375.53 2736.29 5243.05 7058.75 8806.72
3.6.0+preeti 26.43 126.62 1027.23 3350.06 7004.22 7561.83 26.67 128.66 922.57 3341.73 7045.05 7662.18 25.54 129.20 1015.02 3337.60 6591.32 7634.33
3.6.0+best_combined 280.48 1382.07 2730.27 4786.20 6477.28 7980.07 276.88 1392.50 2708.23 4741.25 6590.99 7992.11 278.92 1368.55 2735.49 4614.99 6573.38 7921.75
3.0.51-0.7.9-default 286.44 1415.37 2794.41 5284.39 7282.57 13670.80
Something is either wrong with 3.6 itself, or the config I'm using, as max throughput is nowhere near where it should be (see default). On the bright side, integrating the two does show some promise.
-Mike
Hi Mike, Thank you very much for your inputs.Just a few thoughts so that we are clear with the problems so far in the scheduler scalability and in what direction we ought to move to correct them.
1. During fork or exec,the scheduler goes through find_idlest_group() and find_idlest_cpu() in select_task_rq_fair() by iterating through all domains.Why then was a similar approach not followed for wake up balancing? What was so different about wake ups (except that the woken up task had to remain close to the prev/waking cpu) that we had to introduce select_idle_sibling() in the first place?
2.To the best of my knowlege,the concept of buddy cpu was introduced in select_idle_sibling() so as to avoid the entire package traversal and restrict it to the buddy cpus alone.But even during fork or exec,we iterate through all the sched domains,like I have mentioned above.Why did not the buddy cpu solution come to the rescue here as well?
3.So the correct problem stands at avoid iterating through the entire package at the cost of less aggression in finding the idle cpu or iterate through the package with an intention of finding the idlest cpu.To the best of my understanding the former is your approach or commit 37407ea7,the latter is what I tried to do.But as you have rightly pointed out my approach will have scaling issues.In this light,how does your best_combined patch(below) look like? Do you introduce a cut off value on the loads to decide on which approach to take?
Meanwhile I will also try to run tbench and a few other benchmarks to find out why the results are like below.Will update you very soon on this.
Thank you
Regards Preeti U Murthy
On 01/06/2013 10:02 PM, Mike Galbraith wrote:
On Sat, 2013-01-05 at 09:13 +0100, Mike Galbraith wrote:
I still have a 2.6-rt problem I need to find time to squabble with, but maybe I'll soonish see if what you did plus what I did combined works out on that 4x10 core box where current is _so_ unbelievably horrible. Heck, it can't get any worse, and the restricted wake balance alone kinda sorta worked.
Actually, I flunked copy/paste 101. Below (preeti) shows the real deal.
tbench, 3 runs, 30 secs/run revert = 37407ea7 reverted clients 1 5 10 20 40 80 3.6.0.virgin 27.83 139.50 1488.76 4172.93 6983.71 8301.73 29.23 139.98 1500.22 4162.92 6907.16 8231.13 30.00 141.43 1500.09 3975.50 6847.24 7983.98
3.6.0+revert 281.08 1404.76 2802.44 5019.49 7080.97 8592.80 282.38 1375.70 2747.23 4823.95 7052.15 8508.45 270.69 1375.53 2736.29 5243.05 7058.75 8806.72
3.6.0+preeti 26.43 126.62 1027.23 3350.06 7004.22 7561.83 26.67 128.66 922.57 3341.73 7045.05 7662.18 25.54 129.20 1015.02 3337.60 6591.32 7634.33
3.6.0+best_combined 280.48 1382.07 2730.27 4786.20 6477.28 7980.07 276.88 1392.50 2708.23 4741.25 6590.99 7992.11 278.92 1368.55 2735.49 4614.99 6573.38 7921.75
3.0.51-0.7.9-default 286.44 1415.37 2794.41 5284.39 7282.57 13670.80
Something is either wrong with 3.6 itself, or the config I'm using, as max throughput is nowhere near where it should be (see default). On the bright side, integrating the two does show some promise.
-Mike
On Mon, 2013-01-07 at 10:59 +0530, Preeti U Murthy wrote:
Hi Mike, Thank you very much for your inputs.Just a few thoughts so that we are clear with the problems so far in the scheduler scalability and in what direction we ought to move to correct them.
- During fork or exec,the scheduler goes through find_idlest_group()
and find_idlest_cpu() in select_task_rq_fair() by iterating through all domains.Why then was a similar approach not followed for wake up balancing? What was so different about wake ups (except that the woken up task had to remain close to the prev/waking cpu) that we had to introduce select_idle_sibling() in the first place?
Cost. select_idle_sibling() was introduced specifically to recoup overlap via placing wakee in a shared L2 where there is no pain really, it's all gain. L3 ain't the same deal, but it still does a good job, even for fast movers (on Intel) if you're not bouncing them all over the place. Breakeven depends on architecture, but critical for fast movers on all is no bounce allowed. Wakeup is a whole different thing than fork/exec balancing, there, you just want to spread to seed your future, and it doesn't matter much where you send wakees. With wakeup of fast movers, you lose on every migration, so buddies really really need to find each other fast, and stay put, so you really do recoup overlap instead of just paying again and again. My old Q6600 kicks the snot out of that 10 core Westmere CPU in a common desktop box cross core scheduling scenario. That's sick.
2.To the best of my knowlege,the concept of buddy cpu was introduced in select_idle_sibling() so as to avoid the entire package traversal and restrict it to the buddy cpus alone.But even during fork or exec,we iterate through all the sched domains,like I have mentioned above.Why did not the buddy cpu solution come to the rescue here as well?
I'm looking (well, pondering) fork/exec. It may be a common case win too, but I kinda doubt it'll matter. It's the 1:1 waker/wakee relation that suffers mightily. The buddy was introduced specifically to bind 1:1 sw buddies (common) in hw as well. They are one, make it so. It was the simplest solution I could think of to both kill the evil, and cut cost to restore select_idle_sibling() to the optimization it was intended to be vs the pessimization it has turned into on large L3 CPUs. I took it back to it's functional roots. Weld two modern cores together, create core2duo on steroids, optimization works again and is cheap again.
That the current behavior has its upsides too is.. most unfortunate :)
3.So the correct problem stands at avoid iterating through the entire package at the cost of less aggression in finding the idle cpu or iterate through the package with an intention of finding the idlest cpu.To the best of my understanding the former is your approach or commit 37407ea7,the latter is what I tried to do.But as you have rightly pointed out my approach will have scaling issues.In this light,how does your best_combined patch(below) look like?
It doesn't exist anymore, morphed into ever less wonderful as the day went on. Integration needs to fiddle with find_* to cut some wasted cycles, and methinks reordered groups _may_ make it usable with no idle-buddy hint. Using the idle-buddy first looked promising, but bottom line is that at higher load, you use it for a modest fee, buddy was busy, so you then do the full scan paying a bigger fee on top of it, which of course crops the high end. Reorder should kill the low load horror, but you've still got the problem that as you approach saturation, the less value any search has. That, and the logic is not necessarily as light as it must be when targeting wakeups, where every scheduler cycle detracts substantially from fast/light task performance.
Do you introduce a cut off value on the loads to decide on which approach to take?
I fiddled with that, but then whacked it, because the result needs to be nice to 1:1 and 1:N which may need to spread at high frequency, so the cutoff would be harmful. Seems likely either a cutoff or 1:1 vs 1:N detection or _something_ will be needed to cut cost when it's flat not affordable. With 1:N pgbench (and ilk), there's no such thing as too expensive a search, but for fast/light network stuff there is.
-Mike
Hi Mike,
Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found:
1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible. 2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
STEP1 STEP2 STEP3 _____________________ | | |See if the idle buddy|No _________________ Yes ________________ |is free at all sched |---->| Do we search the|----> |Optimized search| |domains | |sched domains | |________________| |_____________________| |for an idle cpu | | |Yes |_________________| |/ |/ |No: saturated Return target cpu Return |/ system cpu buddy Return prev_cpu
I dont know how well we can do STEP2. That seems to be the biggest challenge.STEP 1 is a reverted commit 37407ea7, "sched: Improve scalability via 'CPU buddies', which withstand random perturbations" for reasons which can hopefully be overcome by this flowchart.
I have tried to tackle STEP3.STEP 3 will not prevent bouncing but a good STEP2 could tell us if it is worth the bounce.
STEP3 Patch is given below:
***********************START PATCH**************************************
sched:Reduce the overhead of select_idle_sibling
From: Preeti U Murthy preeti@linux.vnet.ibm.com
The traversal of the sched domains to find the idlest cpu is a costly operation currently in select_idle_sibling() due to large L3 caches. So the traversal better be worth the effort.
In the current approach,in select_idle_sibling(),the sched domains are searched top to bottom starting at the last level cache domain,and the search is for the idlest *group*. It is less likely that you will find a completely idle group at a higher level sched domain compared to a lower level sched domain.This will make the first few iterations fruitless most of the times. And it is less likely to find an idle *group* compared to an idle *cpu*.
This patch aims at finding an idle cpu.And since the iterations are from bottom to top,stopping at the last level cache domain,it tries to see if the task can be scheduled on a core which shares the l2 cache with the waking cpu/prev cpu first, before probing the cpus at the higher level domain,trying to make the cost of migration lower than the current approach.
Also it tries to return an idle cpu as soon as it finds one,without querying the status of the other cpus in the sched group.This could potentially be much faster unless a worse case such as not finding an idle cpu at every domain occurs. --- kernel/sched/fair.c | 33 +++++++++++++++++++++------------ 1 file changed, 21 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b29cdbf..6123bca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3310,7 +3310,7 @@ static int select_idle_sibling(struct task_struct *p, int target) { int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); - struct sched_domain *sd; + struct sched_domain *sd, *tmp; struct sched_group *sg; int i;
@@ -3332,24 +3332,33 @@ static int select_idle_sibling(struct task_struct *p, int target) * Otherwise, iterate the domains and find an elegible idle cpu. */ sd = rcu_dereference(per_cpu(sd_llc, target)); - for_each_lower_domain(sd) { - sg = sd->groups; + /* + * Iterate the domains bottom up,to cash in on the shared lower level + * cache advantages. + * Since this iteration is costly,return any idle cpu and dont wait + * for a completely idle group. + */ + for_each_domain(target, tmp) { + sg = tmp->groups; do { if (!cpumask_intersects(sched_group_cpus(sg), tsk_cpus_allowed(p))) goto next; - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; + if (idle_cpu(i)) { + target = i; + goto done; + } } +next: sg = sg->next;
- target = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; -next: - sg = sg->next; - } while (sg != sd->groups); + } while (sg != tmp->groups); + + /* if you have covered the llc domain then break out,it is + * not worth migrating on any higher level domain + */ + if (tmp == sd) + break; } done: return target;
Regards Preeti U Murthy
On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
Hi Mike,
Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found:
- select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
STEP1 STEP2 STEP3
| | |See if the idle buddy|No _________________ Yes ________________ |is free at all sched |---->| Do we search the|----> |Optimized search| |domains | |sched domains | |________________| |_____________________| |for an idle cpu | | |Yes |_________________| |/ |/ |No: saturated Return target cpu Return |/ system cpu buddy Return prev_cpu
I re-written the patch as following. hackbench/aim9 doest show clean performance change. Actually we can get some profit. it also will be very slight. :) BTW, it still need another patch before apply this. Just to show the logical.
===========
From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
From: Alex Shi alex.shi@intel.com Date: Fri, 11 Jan 2013 16:49:03 +0800 Subject: [PATCH 3/7] sched: select_idle_sibling optimization
Current logical in this function will insist to wake up the task in a totally idle group, otherwise it would rather back to previous cpu.
The new logical will try to wake up the task on any idle cpu in the same cpu socket (in same sd_llc), while idle cpu in the smaller domain has higher priority.
It should has some help on burst wake up benchmarks like aim7.
Original-patch-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Alex Shi alex.shi@intel.com --- kernel/sched/fair.c | 40 +++++++++++++++++++--------------------- 1 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e116215..fa40e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* * Try and locate an idle CPU in the sched_domain. */ -static int select_idle_sibling(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p, + struct sched_domain *affine_sd, int sync) { int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg; - int i;
/* * If the task is going to be woken-up on this cpu and if it is @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p) /* * Otherwise, iterate the domains and find an elegible idle cpu. */ - sd = rcu_dereference(per_cpu(sd_llc, prev_cpu)); - for_each_lower_domain(sd) { + for_each_domain(prev_cpu, sd) { sg = sd->groups; do { - if (!cpumask_intersects(sched_group_cpus(sg), - tsk_cpus_allowed(p))) - goto next; - - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; - } - - prev_cpu = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; -next: - sg = sg->next; - } while (sg != sd->groups); + int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus); + int i; + + /* no idle cpu in the group */ + if (nr_busy == sg->group_weight) + continue; + for_each_cpu_and(i, sched_group_cpus(sg), + tsk_cpus_allowed(p)) + if (idle_cpu(i)) + return i; + } while (sg = sg->next, sg != sd->groups); + + /* only wake up task on the same cpu socket as prev cpu */ + if (sd == per_cpu(sd_llc, prev_cpu)) + break; } -done: return prev_cpu; }
@@ -3355,7 +3353,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) }
if (affine_sd) { - new_cpu = select_idle_sibling(p, prev_cpu); + new_cpu = select_idle_sibling(p, affine_sd, sync); goto unlock; }
On Wed, 16 Jan 2013 22:08:21 +0800, Alex Shi wrote:
On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
Hi Mike,
Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found:
- select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
STEP1 STEP2 STEP3
| | |See if the idle buddy|No _________________ Yes ________________ |is free at all sched |---->| Do we search the|----> |Optimized search| |domains | |sched domains | |________________| |_____________________| |for an idle cpu | | |Yes |_________________| |/ |/ |No: saturated Return target cpu Return |/ system cpu buddy Return prev_cpu
I re-written the patch as following. hackbench/aim9 doest show clean performance change. Actually we can get some profit. it also will be very slight. :) BTW, it still need another patch before apply this. Just to show the logical.
===========
From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
From: Alex Shi alex.shi@intel.com Date: Fri, 11 Jan 2013 16:49:03 +0800 Subject: [PATCH 3/7] sched: select_idle_sibling optimization
Current logical in this function will insist to wake up the task in a totally idle group, otherwise it would rather back to previous cpu.
Or current cpu depending on result of wake_affine(), right?
The new logical will try to wake up the task on any idle cpu in the same cpu socket (in same sd_llc), while idle cpu in the smaller domain has higher priority.
But what about SMT domain?
I mean it seems that the code prefers running a task on a idle cpu which is a sibling thread in the same core rather than running it on an idle cpu in another idle core. I guess we didn't do that before.
It should has some help on burst wake up benchmarks like aim7.
Original-patch-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Alex Shi alex.shi@intel.com
kernel/sched/fair.c | 40 +++++++++++++++++++--------------------- 1 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e116215..fa40e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /*
- Try and locate an idle CPU in the sched_domain.
*/ -static int select_idle_sibling(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p,
struct sched_domain *affine_sd, int sync)
Where are these arguments used?
{ int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg;
- int i;
/* * If the task is going to be woken-up on this cpu and if it is @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p) /* * Otherwise, iterate the domains and find an elegible idle cpu. */
- sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
- for_each_lower_domain(sd) {
- for_each_domain(prev_cpu, sd) {
Always start from the prev_cpu?
sg = sd->groups; do {
if (!cpumask_intersects(sched_group_cpus(sg),
tsk_cpus_allowed(p)))
goto next;
for_each_cpu(i, sched_group_cpus(sg)) {
if (!idle_cpu(i))
goto next;
}
prev_cpu = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
goto done;
-next:
sg = sg->next;
} while (sg != sd->groups);
int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
int i;
/* no idle cpu in the group */
if (nr_busy == sg->group_weight)
continue;
Maybe we can skip local group since it's a bottom-up search so we know there's no idle cpu in the lower domain from the prior iteration.
for_each_cpu_and(i, sched_group_cpus(sg),
tsk_cpus_allowed(p))
if (idle_cpu(i))
return i;
} while (sg = sg->next, sg != sd->groups);
/* only wake up task on the same cpu socket as prev cpu */
if (sd == per_cpu(sd_llc, prev_cpu))
}break;
-done: return prev_cpu; } @@ -3355,7 +3353,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } if (affine_sd) {
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock; }new_cpu = select_idle_sibling(p, affine_sd, sync);
Hi Namhyung,
I re-written the patch as following. hackbench/aim9 doest show clean performance change. Actually we can get some profit. it also will be very slight. :) BTW, it still need another patch before apply this. Just to show the logical.
===========
From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
From: Alex Shi alex.shi@intel.com Date: Fri, 11 Jan 2013 16:49:03 +0800 Subject: [PATCH 3/7] sched: select_idle_sibling optimization
Current logical in this function will insist to wake up the task in a totally idle group, otherwise it would rather back to previous cpu.
Or current cpu depending on result of wake_affine(), right?
The new logical will try to wake up the task on any idle cpu in the same cpu socket (in same sd_llc), while idle cpu in the smaller domain has higher priority.
But what about SMT domain?
The previous approach also descended till the SMT domain.Here we start from the SMT domain.
You could check with /proc/schedstat as to which are the different domains the cpu is a part of and SMT domain happens to be domain0.As far as i know for_each_lower_domain will descend till domain0.
I mean it seems that the code prefers running a task on a idle cpu which is a sibling thread in the same core rather than running it on an idle cpu in another idle core. I guess we didn't do that before.
It should has some help on burst wake up benchmarks like aim7.
Original-patch-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Alex Shi alex.shi@intel.com
kernel/sched/fair.c | 40 +++++++++++++++++++--------------------- 1 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e116215..fa40e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /*
- Try and locate an idle CPU in the sched_domain.
*/ -static int select_idle_sibling(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p,
struct sched_domain *affine_sd, int sync)
Where are these arguments used?
{ int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg;
- int i;
/* * If the task is going to be woken-up on this cpu and if it is @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p) /* * Otherwise, iterate the domains and find an elegible idle cpu. */
- sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
- for_each_lower_domain(sd) {
- for_each_domain(prev_cpu, sd) {
Always start from the prev_cpu?
sg = sd->groups; do {
if (!cpumask_intersects(sched_group_cpus(sg),
tsk_cpus_allowed(p)))
goto next;
for_each_cpu(i, sched_group_cpus(sg)) {
if (!idle_cpu(i))
goto next;
}
prev_cpu = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
goto done;
-next:
sg = sg->next;
} while (sg != sd->groups);
int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
int i;
/* no idle cpu in the group */
if (nr_busy == sg->group_weight)
continue;
Maybe we can skip local group since it's a bottom-up search so we know there's no idle cpu in the lower domain from the prior iteration.
We could have done this for the current logic because it checks for an *idle* group.The local group would definitely fail this test.But here we need to check the local group also because we are looking for an idle cpu.
Regards Preeti U Murthy
On 01/17/2013 01:17 PM, Namhyung Kim wrote:
On Wed, 16 Jan 2013 22:08:21 +0800, Alex Shi wrote:
On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
Hi Mike,
Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found:
- select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
STEP1 STEP2 STEP3
| | |See if the idle buddy|No _________________ Yes ________________ |is free at all sched |---->| Do we search the|----> |Optimized search| |domains | |sched domains | |________________| |_____________________| |for an idle cpu | | |Yes |_________________| |/ |/ |No: saturated Return target cpu Return |/ system cpu buddy Return prev_cpu
I re-written the patch as following. hackbench/aim9 doest show clean performance change. Actually we can get some profit. it also will be very slight. :) BTW, it still need another patch before apply this. Just to show the logical.
===========
From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
From: Alex Shi alex.shi@intel.com Date: Fri, 11 Jan 2013 16:49:03 +0800 Subject: [PATCH 3/7] sched: select_idle_sibling optimization
Current logical in this function will insist to wake up the task in a totally idle group, otherwise it would rather back to previous cpu.
Or current cpu depending on result of wake_affine(), right?
The new logical will try to wake up the task on any idle cpu in the same cpu socket (in same sd_llc), while idle cpu in the smaller domain has higher priority.
But what about SMT domain?
I mean it seems that the code prefers running a task on a idle cpu which is a sibling thread in the same core rather than running it on an idle cpu in another idle core. I guess we didn't do that before.
It should has some help on burst wake up benchmarks like aim7.
Original-patch-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Alex Shi alex.shi@intel.com
kernel/sched/fair.c | 40 +++++++++++++++++++--------------------- 1 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e116215..fa40e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /*
- Try and locate an idle CPU in the sched_domain.
*/ -static int select_idle_sibling(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p,
struct sched_domain *affine_sd, int sync)
Where are these arguments used?
{ int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg;
- int i;
/* * If the task is going to be woken-up on this cpu and if it is @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p) /* * Otherwise, iterate the domains and find an elegible idle cpu. */
- sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
- for_each_lower_domain(sd) {
- for_each_domain(prev_cpu, sd) {
Always start from the prev_cpu?
a previous patch will check wake_affine and set prev_cpu = cpu accordingly.
sg = sd->groups; do {
if (!cpumask_intersects(sched_group_cpus(sg),
tsk_cpus_allowed(p)))
goto next;
for_each_cpu(i, sched_group_cpus(sg)) {
if (!idle_cpu(i))
goto next;
}
prev_cpu = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
goto done;
-next:
sg = sg->next;
} while (sg != sd->groups);
int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
int i;
/* no idle cpu in the group */
if (nr_busy == sg->group_weight)
continue;
Maybe we can skip local group since it's a bottom-up search so we know there's no idle cpu in the lower domain from the prior iteration.
I did this change but seems results are worse on my machines, guess start seeking idle cpu bottom up is a bad idea. The following is full version with above change.
diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..386bcf4 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -132,6 +132,7 @@ int arch_update_cpu_topology(void); | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ + | 1*SD_PREFER_SIBLING \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5eea870..271b335 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3169,6 +3169,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
return 1; } + /* bias toward prev cpu */ return 0; }
@@ -3252,53 +3253,56 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* * Try and locate an idle CPU in the sched_domain. */ -static int select_idle_sibling(struct task_struct *p, int target) +static int select_idle_sibling(struct task_struct *p, + struct sched_domain *affine_sd, int sync) { int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); - struct sched_domain *sd; + struct sched_domain *sd, *llcsd; struct sched_group *sg; - int i;
/* * If the task is going to be woken-up on this cpu and if it is * already idle, then it is the right target. */ - if (target == cpu && idle_cpu(cpu)) + if (idle_cpu(cpu)) return cpu;
/* * If the task is going to be woken-up on the cpu where it previously * ran and if it is currently idle, then it the right target. */ - if (target == prev_cpu && idle_cpu(prev_cpu)) + if (cpu != prev_cpu && idle_cpu(prev_cpu)) return prev_cpu;
+ if (cpu != prev_cpu && !wake_affine(affine_sd, p, sync)) + cpu = prev_cpu; + /* * Otherwise, iterate the domains and find an elegible idle cpu. */ - sd = rcu_dereference(per_cpu(sd_llc, target)); - for_each_lower_domain(sd) { + llcsd = rcu_dereference(per_cpu(sd_llc, cpu)); + for_each_domain(cpu, sd) { sg = sd->groups; do { - if (!cpumask_intersects(sched_group_cpus(sg), - tsk_cpus_allowed(p))) - goto next; - - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; - } - - target = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; -next: - sg = sg->next; - } while (sg != sd->groups); + int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus); + int i; + + /* skip local group and if no idle cpu in group */ + if (cpumask_test_cpu(cpu, sched_group_cpus(sg)) || + nr_busy == sg->group_weight) + continue; + for_each_cpu_and(i, sched_group_cpus(sg), + tsk_cpus_allowed(p)) + if (idle_cpu(i)) + return i; + } while (sg = sg->next, sg != sd->groups); + + /* only wake up task on the same cpu socket as prev cpu */ + if (sd == llcsd) + break; } -done: - return target; + return cpu; }
/* @@ -3351,10 +3355,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) }
if (affine_sd) { - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; - - new_cpu = select_idle_sibling(p, prev_cpu); + new_cpu = select_idle_sibling(p, affine_sd, sync); goto unlock; }
Maybe we can skip local group since it's a bottom-up search so we know there's no idle cpu in the lower domain from the prior iteration.
I did this change but seems results are worse on my machines, guess start seeking idle cpu bottom up is a bad idea. The following is full version with above change.
also tried to keep top-down seeking mode, and will return any idle cpu instead of the first cpu in a idle group. But the result doesn't show better on benchmark hackbench.
=== diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5eea870..fb85094 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3169,6 +3169,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) return 1; } + /* bias toward prev cpu */ return 0; } @@ -3252,7 +3253,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* * Try and locate an idle CPU in the sched_domain. */ -static int select_idle_sibling(struct task_struct *p, int target) +static int select_idle_sibling(struct task_struct *p, + struct sched_domain *affine_sd, int sync) { int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); @@ -3264,20 +3266,23 @@ static int select_idle_sibling(struct task_struct *p, int target) * If the task is going to be woken-up on this cpu and if it is * already idle, then it is the right target. */ - if (target == cpu && idle_cpu(cpu)) + if (idle_cpu(cpu)) return cpu; /* * If the task is going to be woken-up on the cpu where it previously * ran and if it is currently idle, then it the right target. */ - if (target == prev_cpu && idle_cpu(prev_cpu)) + if (cpu != prev_cpu && idle_cpu(prev_cpu)) return prev_cpu; + if (cpu != prev_cpu && !wake_affine(affine_sd, p, sync)) + cpu = prev_cpu; + /* * Otherwise, iterate the domains and find an elegible idle cpu. */ - sd = rcu_dereference(per_cpu(sd_llc, target)); + sd = rcu_dereference(per_cpu(sd_llc, cpu)); for_each_lower_domain(sd) { sg = sd->groups; do { @@ -3290,7 +3295,7 @@ static int select_idle_sibling(struct task_struct *p, int target) goto next; } - target = cpumask_first_and(sched_group_cpus(sg), + cpu = cpumask_first_and(sched_group_cpus(sg), tsk_cpus_allowed(p)); goto done; next: @@ -3298,7 +3303,7 @@ next: } while (sg != sd->groups); } done: - return target; + return cpu; } /* @@ -3351,10 +3356,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } if (affine_sd) { - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; - - new_cpu = select_idle_sibling(p, prev_cpu); + new_cpu = select_idle_sibling(p, affine_sd, sync); goto unlock; }
Hi Alex,
On 01/16/2013 07:38 PM, Alex Shi wrote:
On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
Hi Mike,
Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found:
- select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
STEP1 STEP2 STEP3
| | |See if the idle buddy|No _________________ Yes ________________ |is free at all sched |---->| Do we search the|----> |Optimized search| |domains | |sched domains | |________________| |_____________________| |for an idle cpu | | |Yes |_________________| |/ |/ |No: saturated Return target cpu Return |/ system cpu buddy Return prev_cpu
I re-written the patch as following. hackbench/aim9 doest show clean performance change. Actually we can get some profit. it also will be very slight. :) BTW, it still need another patch before apply this. Just to show the logical.
=========== From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001 From: Alex Shi alex.shi@intel.com Date: Fri, 11 Jan 2013 16:49:03 +0800 Subject: [PATCH 3/7] sched: select_idle_sibling optimization
Current logical in this function will insist to wake up the task in a totally idle group, otherwise it would rather back to previous cpu.
As Namhyung pointed out this could be the waking cpu as well.
The new logical will try to wake up the task on any idle cpu in the same cpu socket (in same sd_llc), while idle cpu in the smaller domain has higher priority.
Here is where the problem of large sockets come in.select_idle_sibling() has its main weakness here.No doubt that this patch could improve performance over the current logic,but will still retain the major drawback of searching for an idle cpu in a large socket in the worst case.
It should has some help on burst wake up benchmarks like aim7.
Original-patch-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Alex Shi alex.shi@intel.com
kernel/sched/fair.c | 40 +++++++++++++++++++--------------------- 1 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e116215..fa40e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /*
- Try and locate an idle CPU in the sched_domain.
*/ -static int select_idle_sibling(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p,
struct sched_domain *affine_sd, int sync)
As Namhyung pointed out where is affine_sd being used?
{ int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg;
int i;
/*
- If the task is going to be woken-up on this cpu and if it is
@@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p) /* * Otherwise, iterate the domains and find an elegible idle cpu. */
- sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
- for_each_lower_domain(sd) {
- for_each_domain(prev_cpu, sd) {
Why is prev_cpu being used? Ideally it should be the target cpu (waking/prev) depending on what wake_affine() decides.
sg = sd->groups; do {
if (!cpumask_intersects(sched_group_cpus(sg),
tsk_cpus_allowed(p)))
goto next;
for_each_cpu(i, sched_group_cpus(sg)) {
if (!idle_cpu(i))
goto next;
}
prev_cpu = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
goto done;
-next:
sg = sg->next;
} while (sg != sd->groups);
int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
int i;
/* no idle cpu in the group */
if (nr_busy == sg->group_weight)
continue;
for_each_cpu_and(i, sched_group_cpus(sg),
tsk_cpus_allowed(p))
if (idle_cpu(i))
return i;
} while (sg = sg->next, sg != sd->groups);
/* only wake up task on the same cpu socket as prev cpu */
if (sd == per_cpu(sd_llc, prev_cpu))
}break;
-done: return prev_cpu;
target cpu needs to be returned right? I dont understand why prev_cpu is considered everywhere.
}
@@ -3355,7 +3353,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) }
if (affine_sd) {
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock; }new_cpu = select_idle_sibling(p, affine_sd, sync);
To overcome the drawback of large sockets, I had suggested using blocked_load+runnable_load of PJT's metric and in select_idle_sibling() querying only the L2 cache domains.
https://lkml.org/lkml/2013/1/8/619
What do you think about this?
Regards Preeti U Murthy
On 2 January 2013 05:22, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi everyone, I have been looking at how different workloads react when the per entity load tracking metric is integrated into the load balancer and what are the possible reasons for it.
I had posted the integration patch earlier: https://lkml.org/lkml/2012/11/15/391
Essentially what I am doing is: 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with cfs.runnable_load_avg,the active load tracking metric. 3.I have replaced se.load.weight in task_h_load() with se.load.avg.contrib,the per entity load tracking metric. 4.The load balancer will end up using these metrics.
After conducting experiments on several workloads I found out that the performance of the workloads with the above integration would neither improve nor deteriorate.And this observation was consistent.
Ideally the performance should have improved considering,that the metric does better tracking of load.
Let me explain with a simple example as to why we should see a performance improvement ideally:Consider 2 80% tasks and 1 40% task.
With integration:
40%
80% 40% cpu1 cpu2
The above will be the scenario when the tasks fork initially.And this is a perfectly balanced system,hence no more load balancing.And proper distribution of loads on the cpu.
Without integration
40% 40% 80% 40% 80% 40% cpu1 cpu2 OR cpu1 cpu2
Because the view is that all the tasks as having the same load.The load balancer could ping pong tasks between these two situations.
When I performed this experiment,I did not see an improvement in the performance though in the former case.On further observation I found that the following was actually happening.
With integration
Initially 40% task sleeps 40% task wakes up and select_idle_sibling() decides to wake it up on cpu1
40% -> -> 40%
80% 40% 80% 40% 80% 40% cpu1 cpu2 cpu1 cpu2 cpu1 cpu2
This makes load balance trigger movement of 40% from cpu1 back to cpu2.Hence the stability that the load balancer was trying to achieve is gone.Hence the culprit boils down to select_idle_sibling.How is it the culprit and how is it hindering performance of the workloads?
*What is the way ahead with the per entity load tracking metric in the load balancer then?*
In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105, he mentions the following:
"It is my intuition that the greatest carnage here is actually caused by wake-up load-balancing getting in the way of periodic in establishing a steady state. I suspect more mileage would result from reducing the interference wake-up load-balancing has with steady state."
"The whole point of using blocked load is so that you can converge on a steady state where you don't NEED to move tasks. What disrupts this is we naturally prefer idle cpus on wake-up balance to reduce wake-up latency. I think the better answer is making these two processes load balancing() and select_idle_sibling() more co-operative."
I had not realised how this would happen until I saw it happening in the above experiment.
Based on what Paul explained above let us use the runnable load + the blocked load for calculating the load on a cfs runqueue rather than just the runnable load(which is what i am doing now) and see its consequence.
Initially: 40% task sleeps
40%
80% 40% -> 80% 40% cpu1 cpu2 cpu1 cpu2
So initially the load on cpu1 is say 80 and on cpu2 also it is 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable load+blocked load.which is still 80.
As a consequence,firstly,during periodic load balancing the load is not moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on cpu2 as 80 and not as 40). Hence the above scenario remains the same.On wake up,what happens?
Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2.
Hi Preeti,
I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting.
IMHO, you're right that taking the blocked load into consideration should minimize tasks migration between cluster but it should no prevent fast task migration between cores that share their cache
Vincent
If we do that,we end up scheduling the 40% task back on cpu2.Back to the scenario which load balancing intended.Hence a steady state is maintained no matter what unless other tasks show up.
Note that considering prev_cpu as the default cpu to run the woken up task on is possible only because we use blocked load for load balancing purposes.
The above steps of using blocked load and selecting the prev_cpu as the target for the woken up task seems to me to be the next step.This could allow the load balance with the per entity load tracking metric to behave as it is supposed to without anything else disrupting it.And here i expect a performance improvement.
Please do let me know your suggestions.This will greatly help take the right steps here on, in achieving the correct integration.
Thank you
Regards Preeti U Murthy
On 01/07/2013 09:18 PM, Vincent Guittot wrote:
On 2 January 2013 05:22, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi everyone, I have been looking at how different workloads react when the per entity load tracking metric is integrated into the load balancer and what are the possible reasons for it.
I had posted the integration patch earlier: https://lkml.org/lkml/2012/11/15/391
Essentially what I am doing is: 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with cfs.runnable_load_avg,the active load tracking metric. 3.I have replaced se.load.weight in task_h_load() with se.load.avg.contrib,the per entity load tracking metric. 4.The load balancer will end up using these metrics.
After conducting experiments on several workloads I found out that the performance of the workloads with the above integration would neither improve nor deteriorate.And this observation was consistent.
Ideally the performance should have improved considering,that the metric does better tracking of load.
Let me explain with a simple example as to why we should see a performance improvement ideally:Consider 2 80% tasks and 1 40% task.
With integration:
40%
80% 40% cpu1 cpu2
The above will be the scenario when the tasks fork initially.And this is a perfectly balanced system,hence no more load balancing.And proper distribution of loads on the cpu.
Without integration
40% 40% 80% 40% 80% 40% cpu1 cpu2 OR cpu1 cpu2
Because the view is that all the tasks as having the same load.The load balancer could ping pong tasks between these two situations.
When I performed this experiment,I did not see an improvement in the performance though in the former case.On further observation I found that the following was actually happening.
With integration
Initially 40% task sleeps 40% task wakes up and select_idle_sibling() decides to wake it up on cpu1
40% -> -> 40%
80% 40% 80% 40% 80% 40% cpu1 cpu2 cpu1 cpu2 cpu1 cpu2
This makes load balance trigger movement of 40% from cpu1 back to cpu2.Hence the stability that the load balancer was trying to achieve is gone.Hence the culprit boils down to select_idle_sibling.How is it the culprit and how is it hindering performance of the workloads?
*What is the way ahead with the per entity load tracking metric in the load balancer then?*
In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105, he mentions the following:
"It is my intuition that the greatest carnage here is actually caused by wake-up load-balancing getting in the way of periodic in establishing a steady state. I suspect more mileage would result from reducing the interference wake-up load-balancing has with steady state."
"The whole point of using blocked load is so that you can converge on a steady state where you don't NEED to move tasks. What disrupts this is we naturally prefer idle cpus on wake-up balance to reduce wake-up latency. I think the better answer is making these two processes load balancing() and select_idle_sibling() more co-operative."
I had not realised how this would happen until I saw it happening in the above experiment.
Based on what Paul explained above let us use the runnable load + the blocked load for calculating the load on a cfs runqueue rather than just the runnable load(which is what i am doing now) and see its consequence.
Initially: 40% task sleeps
40%
80% 40% -> 80% 40% cpu1 cpu2 cpu1 cpu2
So initially the load on cpu1 is say 80 and on cpu2 also it is 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable load+blocked load.which is still 80.
As a consequence,firstly,during periodic load balancing the load is not moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on cpu2 as 80 and not as 40). Hence the above scenario remains the same.On wake up,what happens?
Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2.
Hi Preeti,
I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting.
IMHO, you're right that taking the blocked load into consideration should minimize tasks migration between cluster but it should no prevent fast task migration between cores that share their cache
True Vincent.But I think the one disadvantage even at cpu or cluster level is that when we consider blocked load, we might prevent any more tasks from being scheduled on that cpu during periodic load balance if the blocked load is too much.This is very poor cpu utilization
Also we can consider steady states if the waking tasks have a specific waking pattern.I am not sure if we can risk hoping that the blocked task would wake up soon or would wake up at time 'x' and utilize that cpu.
Vincent
Regards Preeti U Murthy
On 8 January 2013 07:06, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 01/07/2013 09:18 PM, Vincent Guittot wrote:
On 2 January 2013 05:22, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi everyone, I have been looking at how different workloads react when the per entity load tracking metric is integrated into the load balancer and what are the possible reasons for it.
I had posted the integration patch earlier: https://lkml.org/lkml/2012/11/15/391
Essentially what I am doing is: 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with cfs.runnable_load_avg,the active load tracking metric. 3.I have replaced se.load.weight in task_h_load() with se.load.avg.contrib,the per entity load tracking metric. 4.The load balancer will end up using these metrics.
After conducting experiments on several workloads I found out that the performance of the workloads with the above integration would neither improve nor deteriorate.And this observation was consistent.
Ideally the performance should have improved considering,that the metric does better tracking of load.
Let me explain with a simple example as to why we should see a performance improvement ideally:Consider 2 80% tasks and 1 40% task.
With integration:
40%
80% 40% cpu1 cpu2
The above will be the scenario when the tasks fork initially.And this is a perfectly balanced system,hence no more load balancing.And proper distribution of loads on the cpu.
Without integration
40% 40% 80% 40% 80% 40% cpu1 cpu2 OR cpu1 cpu2
Because the view is that all the tasks as having the same load.The load balancer could ping pong tasks between these two situations.
When I performed this experiment,I did not see an improvement in the performance though in the former case.On further observation I found that the following was actually happening.
With integration
Initially 40% task sleeps 40% task wakes up and select_idle_sibling() decides to wake it up on cpu1
40% -> -> 40%
80% 40% 80% 40% 80% 40% cpu1 cpu2 cpu1 cpu2 cpu1 cpu2
This makes load balance trigger movement of 40% from cpu1 back to cpu2.Hence the stability that the load balancer was trying to achieve is gone.Hence the culprit boils down to select_idle_sibling.How is it the culprit and how is it hindering performance of the workloads?
*What is the way ahead with the per entity load tracking metric in the load balancer then?*
In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105, he mentions the following:
"It is my intuition that the greatest carnage here is actually caused by wake-up load-balancing getting in the way of periodic in establishing a steady state. I suspect more mileage would result from reducing the interference wake-up load-balancing has with steady state."
"The whole point of using blocked load is so that you can converge on a steady state where you don't NEED to move tasks. What disrupts this is we naturally prefer idle cpus on wake-up balance to reduce wake-up latency. I think the better answer is making these two processes load balancing() and select_idle_sibling() more co-operative."
I had not realised how this would happen until I saw it happening in the above experiment.
Based on what Paul explained above let us use the runnable load + the blocked load for calculating the load on a cfs runqueue rather than just the runnable load(which is what i am doing now) and see its consequence.
Initially: 40% task sleeps
40%
80% 40% -> 80% 40% cpu1 cpu2 cpu1 cpu2
So initially the load on cpu1 is say 80 and on cpu2 also it is 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable load+blocked load.which is still 80.
As a consequence,firstly,during periodic load balancing the load is not moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on cpu2 as 80 and not as 40). Hence the above scenario remains the same.On wake up,what happens?
Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2.
Hi Preeti,
I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting.
IMHO, you're right that taking the blocked load into consideration should minimize tasks migration between cluster but it should no prevent fast task migration between cores that share their cache
True Vincent.But I think the one disadvantage even at cpu or cluster level is that when we consider blocked load, we might prevent any more tasks from being scheduled on that cpu during periodic load balance if the blocked load is too much.This is very poor cpu utilization
The blocked load of a cluster will be high if the blocked tasks have run recently. The contribution of a blocked task will be divided by 2 each 32ms, so it means that a high blocked load will be made of recent running tasks and the long sleeping tasks will not influence the load balancing. The load balance period is between 1 tick (10ms for idle load balance on ARM) and up to 256 ms (for busy load balance) so a high blocked load should imply some tasks that have run recently otherwise your blocked load will be small and will not have a large influence on your load balance
Also we can consider steady states if the waking tasks have a specific waking pattern.I am not sure if we can risk hoping that the blocked task would wake up soon or would wake up at time 'x' and utilize that cpu.
Ok, so you don't consider to use blocked load in load balancing any more ?
regards, Vincent
Vincent
Regards Preeti U Murthy
Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2.
Hi Preeti,
I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting.
IMHO, you're right that taking the blocked load into consideration should minimize tasks migration between cluster but it should no prevent fast task migration between cores that share their cache
True Vincent.But I think the one disadvantage even at cpu or cluster level is that when we consider blocked load, we might prevent any more tasks from being scheduled on that cpu during periodic load balance if the blocked load is too much.This is very poor cpu utilization
The blocked load of a cluster will be high if the blocked tasks have run recently. The contribution of a blocked task will be divided by 2 each 32ms, so it means that a high blocked load will be made of recent running tasks and the long sleeping tasks will not influence the load balancing. The load balance period is between 1 tick (10ms for idle load balance on ARM) and up to 256 ms (for busy load balance) so a high blocked load should imply some tasks that have run recently otherwise your blocked load will be small and will not have a large influence on your load balance
Makes a lot of sense.
Also we can consider steady states if the waking tasks have a specific waking pattern.I am not sure if we can risk hoping that the blocked task would wake up soon or would wake up at time 'x' and utilize that cpu.
Ok, so you don't consider to use blocked load in load balancing any more ?
Hmm..This has got me thinking.I thought to solve the existing select_idle_sibling() problem of bouncing tasks all over the l3 package and taking time to find an idle buddy could be solved in isolation with the PJT's metric.But that does not seem to be the case considering the suggestions by you and Mike.
Currently there are so many approaches proposed to improve the scheduler that it is confusing as to how and which pieces fit well.Let me lay them down.Please do help me put them together.
Jigsaw Piece1:Use Pjt's metric in load balancing and Blocked load+runnable load as part of cpu load while load balancing.
Jigsaw Piece2: select_idle_sibling() choosing the cpu to wake up tasks on.
Jigsaw Piece3: 'cpu buddy' concept to prevent bouncing of tasks.
Considering both yours and Mike's suggestions,what do you guys think of the following puzzle and solution?
*Puzzle*: Waking up tasks should not take too much time to find a cpu to run on and should not keep bouncing on too many cpus all over the package, and should try as much not to create too much of an imbalance in the load distribution of the cpus.
*Solution:*
Place Jigsaw Piece 1 first:Use Pjt's metric and blocked load + runnable load as part of cpu load while load balancing. (As time passes the blocked load becomes less significant on that cpu,hence load balancing will go on as usual).
Place Jigsaw Piece 2 next: When tasks wake up,**use select_idle_sibling() to see only if you can migrate tasks between cores that share their cache**, IOW see if the cpu at the lowest level sched domain is idle.If it is, then schedule on it and migrate_task_rq_fair() will remove the load from the prev_cpu,if not idle,then return the prev_cpu() which had already considered the blocked load as part of its overall load.Hence very little imbalance will be created.
*Possible End Picture*
Waking up tasks will not take time to find a cpu since we are probing the cpus at only one sched domain level.The bouncing of tasks will be restricted at the core level.An imbalance will not be created as the blocked load is also considered while load balancing.
*Concerns*
1.Is the wake up load balancing in this solution less aggressive so as to harm throughput significantly ? 2.Do we need Jigsaw Piece 3 at all?
Please do let me know what you all think.Thank you very much for your suggestions.
regards, Vincent
Regards Preeti U Murthy
On 01/09/2013 11:14 AM, Preeti U Murthy wrote:
Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2.
Hi Preeti,
I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting.
IMHO, you're right that taking the blocked load into consideration should minimize tasks migration between cluster but it should no prevent fast task migration between cores that share their cache
True Vincent.But I think the one disadvantage even at cpu or cluster level is that when we consider blocked load, we might prevent any more tasks from being scheduled on that cpu during periodic load balance if the blocked load is too much.This is very poor cpu utilization
The blocked load of a cluster will be high if the blocked tasks have run recently. The contribution of a blocked task will be divided by 2 each 32ms, so it means that a high blocked load will be made of recent running tasks and the long sleeping tasks will not influence the load balancing. The load balance period is between 1 tick (10ms for idle load balance on ARM) and up to 256 ms (for busy load balance) so a high blocked load should imply some tasks that have run recently otherwise your blocked load will be small and will not have a large influence on your load balance
Just tried using cfs's runnable_load_avg + blocked_load_avg in weighted_cpuload() with my v3 patchset, aim9 shared workfile testing show the performance dropped 70% more on the NHM EP machine. :(
The blocked load of a cluster will be high if the blocked tasks have run recently. The contribution of a blocked task will be divided by 2 each 32ms, so it means that a high blocked load will be made of recent running tasks and the long sleeping tasks will not influence the load balancing. The load balance period is between 1 tick (10ms for idle load balance on ARM) and up to 256 ms (for busy load balance) so a high blocked load should imply some tasks that have run recently otherwise your blocked load will be small and will not have a large influence on your load balance
Just tried using cfs's runnable_load_avg + blocked_load_avg in weighted_cpuload() with my v3 patchset, aim9 shared workfile testing show the performance dropped 70% more on the NHM EP machine. :(
Ops, the performance is still worse than just count runnable_load_avg. But dropping is not so big, it dropped 30%, not 70%.
Hi Alex, Thank you very much for running the below benchmark on blocked_load+runnable_load:) Just a few queries.
How did you do the wake up balancing? Did you iterate over the L3 package looking for an idle cpu? Or did you just query the L2 package for an idle cpu?
I think when you are using blocked_load+runnable_load it would be better if we just query the L2 package as Vincent had pointed out because the fundamental behind using blocked_load+runnable_load is to keep a steady state across cpus unless we could reap the advantage of moving the blocked load to a sibling core when it wakes up.
And the drop of performance is relative to what? 1.Your v3 patchset with runnable_load_avg in weighted_cpu_load(). 2.Your v3 patchset with runnable_load_avg+blocked_load_avg in weighted_cpu_load().
Are the above two what you are comparing? And in the above two versions have you included your [PATCH] sched: use instant load weight in burst regular load balance?
On 01/20/2013 09:22 PM, Alex Shi wrote:
The blocked load of a cluster will be high if the blocked tasks have run recently. The contribution of a blocked task will be divided by 2 each 32ms, so it means that a high blocked load will be made of recent running tasks and the long sleeping tasks will not influence the load balancing. The load balance period is between 1 tick (10ms for idle load balance on ARM) and up to 256 ms (for busy load balance) so a high blocked load should imply some tasks that have run recently otherwise your blocked load will be small and will not have a large influence on your load balance
Just tried using cfs's runnable_load_avg + blocked_load_avg in weighted_cpuload() with my v3 patchset, aim9 shared workfile testing show the performance dropped 70% more on the NHM EP machine. :(
Ops, the performance is still worse than just count runnable_load_avg. But dropping is not so big, it dropped 30%, not 70%.
Thank you
Regards Preeti U Murthy
On 01/21/2013 10:40 AM, Preeti U Murthy wrote:
Hi Alex, Thank you very much for running the below benchmark on blocked_load+runnable_load:) Just a few queries.
How did you do the wake up balancing? Did you iterate over the L3 package looking for an idle cpu? Or did you just query the L2 package for an idle cpu?
Just used the current select_idle_sibling function, so it search in L3 package.
I think when you are using blocked_load+runnable_load it would be better if we just query the L2 package as Vincent had pointed out because the fundamental behind using blocked_load+runnable_load is to keep a steady state across cpus unless we could reap the advantage of moving the blocked load to a sibling core when it wakes up.
And the drop of performance is relative to what?
it is 2 VS 3.8-rc3
1.Your v3 patchset with runnable_load_avg in weighted_cpu_load(). 2.Your v3 patchset with runnable_load_avg+blocked_load_avg in weighted_cpu_load().
Are the above two what you are comparing? And in the above two versions have you included your [PATCH] sched: use instant load weight in burst regular load balance?
no this patch.
On 01/20/2013 09:22 PM, Alex Shi wrote:
The blocked load of a cluster will be high if the blocked tasks have run recently. The contribution of a blocked task will be divided by 2 each 32ms, so it means that a high blocked load will be made of recent running tasks and the long sleeping tasks will not influence the load balancing. The load balance period is between 1 tick (10ms for idle load balance on ARM) and up to 256 ms (for busy load balance) so a high blocked load should imply some tasks that have run recently otherwise your blocked load will be small and will not have a large influence on your load balance
Just tried using cfs's runnable_load_avg + blocked_load_avg in weighted_cpuload() with my v3 patchset, aim9 shared workfile testing show the performance dropped 70% more on the NHM EP machine. :(
Ops, the performance is still worse than just count runnable_load_avg. But dropping is not so big, it dropped 30%, not 70%.
Thank you
Regards Preeti U Murthy