 
            Hi Alex,
On 01/16/2013 07:38 PM, Alex Shi wrote:
On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
Hi Mike,
Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found:
- select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
STEP1 STEP2 STEP3
| | |See if the idle buddy|No _________________ Yes ________________ |is free at all sched |---->| Do we search the|----> |Optimized search| |domains | |sched domains | |________________| |_____________________| |for an idle cpu | | |Yes |_________________| |/ |/ |No: saturated Return target cpu Return |/ system cpu buddy Return prev_cpu
I re-written the patch as following. hackbench/aim9 doest show clean performance change. Actually we can get some profit. it also will be very slight. :) BTW, it still need another patch before apply this. Just to show the logical.
=========== From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001 From: Alex Shi alex.shi@intel.com Date: Fri, 11 Jan 2013 16:49:03 +0800 Subject: [PATCH 3/7] sched: select_idle_sibling optimization
Current logical in this function will insist to wake up the task in a totally idle group, otherwise it would rather back to previous cpu.
As Namhyung pointed out this could be the waking cpu as well.
The new logical will try to wake up the task on any idle cpu in the same cpu socket (in same sd_llc), while idle cpu in the smaller domain has higher priority.
Here is where the problem of large sockets come in.select_idle_sibling() has its main weakness here.No doubt that this patch could improve performance over the current logic,but will still retain the major drawback of searching for an idle cpu in a large socket in the worst case.
It should has some help on burst wake up benchmarks like aim7.
Original-patch-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Alex Shi alex.shi@intel.com
kernel/sched/fair.c | 40 +++++++++++++++++++--------------------- 1 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e116215..fa40e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /*
- Try and locate an idle CPU in the sched_domain.
*/ -static int select_idle_sibling(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p,
struct sched_domain *affine_sd, int sync)
As Namhyung pointed out where is affine_sd being used?
{ int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg;
int i;
/*
- If the task is going to be woken-up on this cpu and if it is
@@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p) /* * Otherwise, iterate the domains and find an elegible idle cpu. */
- sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
- for_each_lower_domain(sd) {
- for_each_domain(prev_cpu, sd) {
Why is prev_cpu being used? Ideally it should be the target cpu (waking/prev) depending on what wake_affine() decides.
sg = sd->groups; do {
if (!cpumask_intersects(sched_group_cpus(sg),
tsk_cpus_allowed(p)))
goto next;
for_each_cpu(i, sched_group_cpus(sg)) {
if (!idle_cpu(i))
goto next;
}
prev_cpu = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
goto done;-next:
sg = sg->next;
} while (sg != sd->groups);
int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
int i;
/* no idle cpu in the group */
if (nr_busy == sg->group_weight)
continue;
for_each_cpu_and(i, sched_group_cpus(sg),
tsk_cpus_allowed(p))
if (idle_cpu(i))
return i;
} while (sg = sg->next, sg != sd->groups);
/* only wake up task on the same cpu socket as prev cpu */
if (sd == per_cpu(sd_llc, prev_cpu))}
break;-done: return prev_cpu;
target cpu needs to be returned right? I dont understand why prev_cpu is considered everywhere.
}
@@ -3355,7 +3353,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) }
if (affine_sd) {
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock; }
new_cpu = select_idle_sibling(p, affine_sd, sync);
To overcome the drawback of large sockets, I had suggested using blocked_load+runnable_load of PJT's metric and in select_idle_sibling() querying only the L2 cache domains.
https://lkml.org/lkml/2013/1/8/619
What do you think about this?
Regards Preeti U Murthy