Hi Mike,
Thank you very much for your feedback.Considering your suggestions,I have posted out a proposed solution to prevent select_idle_sibling() from becoming a disadvantage to normal load balancing,rather aiding it.
**This patch is *without* the enablement of the per entity load tracking metric.**
This is with an intention to correct the existing select_idle_sibling() mess before going ahead.
-------------------BEGIN PATCH--------------------------------------------------------
Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
The function of select_idle_sibling() is to place the woken up task in the vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says. This placement being only in an idle group.If an idle group is not found,the fallback cpu is either the waking cpu or the previous cpu accordingly.
This results in the runqueue of the waking cpu or the previous cpu getting overloaded when the system is committed,which is a latency hit to these tasks.
What is required is that the newly woken up tasks be placed close to the wake up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache coldness respectively.This is achieved with wake_affine() deciding which cache domain the task should be placed on.
Once this is decided,instead of searching for a completely idle group,let us search for the idlest group.This will anyway return a completely idle group if it exists and its mechanism will fall back to what select_idle_sibling() was doing.But if this fails,find_idlest_group() continues the search for a relatively more idle group.
The argument could be that,we wish to avoid migration of the newly woken up task to any other group unless it is completely idle.But in this case, to begin with we choose a sched domain,within which a migration could be less harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate with the same.
This patch is based on the tip tree without enabling the per entity load tracking.This is with an intention to clear up the select_idle_sibling() mess before introducing the metric. --- include/linux/topology.h | 4 ++- kernel/sched/fair.c | 61 +++++----------------------------------------- 2 files changed, 9 insertions(+), 56 deletions(-)
diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..eeb309e 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -95,7 +95,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ @@ -127,7 +127,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b29cdbf..c33eda7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3303,58 +3303,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return idlest; }
-/* - * Try and locate an idle CPU in the sched_domain. - */ -static int select_idle_sibling(struct task_struct *p, int target) -{ - int cpu = smp_processor_id(); - int prev_cpu = task_cpu(p); - struct sched_domain *sd; - struct sched_group *sg; - int i; - - /* - * If the task is going to be woken-up on this cpu and if it is - * already idle, then it is the right target. - */ - if (target == cpu && idle_cpu(cpu)) - return cpu; - - /* - * If the task is going to be woken-up on the cpu where it previously - * ran and if it is currently idle, then it the right target. - */ - if (target == prev_cpu && idle_cpu(prev_cpu)) - return prev_cpu; - - /* - * Otherwise, iterate the domains and find an elegible idle cpu. - */ - sd = rcu_dereference(per_cpu(sd_llc, target)); - for_each_lower_domain(sd) { - sg = sd->groups; - do { - if (!cpumask_intersects(sched_group_cpus(sg), - tsk_cpus_allowed(p))) - goto next; - - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; - } - - target = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; -next: - sg = sg->next; - } while (sg != sd->groups); - } -done: - return target; -} - #ifdef CONFIG_SCHED_NUMA static inline bool pick_numa_rand(int n) { @@ -3469,8 +3417,13 @@ find_sd: if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) prev_cpu = cpu;
- new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + if (prev_cpu == task_cpu(p) && idle_cpu(prev_cpu) || + prev_cpu == smp_processor_id() && idle_cpu(prev_cpu)) { + new_cpu = prev_cpu; + goto unlock; + } else { + sd = rcu_dereference(per_cpu(sd_llc, prev_cpu)); + } }
pick_idlest: