Hi Viresh,
Here is a patch that introduces global load balancing on top of the existing HMP patch set. It depends on the HMP patches already present in your task-placement-v2 branch. It can be applied on top of the HMP sysfs patches if needed. The fix should be trivial.
Could you include in the MP branch for the 12.12 release? Testing with sysbench and coremark show significant performance improvements for parallel workloads as all cpus can now be used for cpu intensive tasks.
Thanks, Morten
Morten Rasmussen (1): sched: Basic global balancing support for HMP
kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 97 insertions(+), 4 deletions(-)
This patch introduces an extra-check at task up-migration to prevent overloading the cpus in the faster hmp_domain while the slower hmp_domain is not fully utilized. The patch also introduces a periodic balance check that can down-migrate tasks if the faster domain is oversubscribed and the slower is under-utilized.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 97 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1cfe112..7ac47c9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3249,6 +3249,80 @@ static inline void hmp_next_down_delay(struct sched_entity *se, int cpu) se->avg.hmp_last_down_migration = cfs_rq_clock_task(cfs_rq); se->avg.hmp_last_up_migration = 0; } + +static inline unsigned int hmp_domain_min_load(struct hmp_domain *hmpd, + int *min_cpu) +{ + int cpu; + int min_load = INT_MAX; + int min_cpu_temp = NR_CPUS; + + for_each_cpu_mask(cpu, hmpd->cpus) { + if (cpu_rq(cpu)->cfs.tg_load_contrib < min_load) { + min_load = cpu_rq(cpu)->cfs.tg_load_contrib; + min_cpu_temp = cpu; + } + } + + if (min_cpu) + *min_cpu = min_cpu_temp; + + return min_load; +} + +/* + * Calculate the task starvation + * This is the ratio of actually running time vs. runnable time. + * If the two are equal the task is getting the cpu time it needs or + * it is alone on the cpu and the cpu is fully utilized. + */ +static inline unsigned int hmp_task_starvation(struct sched_entity *se) +{ + u32 starvation; + + starvation = se->avg.usage_avg_sum * scale_load_down(NICE_0_LOAD); + starvation /= (se->avg.runnable_avg_sum + 1); + + return scale_load(starvation); +} + +static inline unsigned int hmp_offload_down(int cpu, struct sched_entity *se) +{ + int min_usage; + int dest_cpu = NR_CPUS; + + if (hmp_cpu_is_slowest(cpu)) + return NR_CPUS; + + /* Is the current domain fully loaded? */ + /* load < ~94% */ + min_usage = hmp_domain_min_load(hmp_cpu_domain(cpu), NULL); + if (min_usage < NICE_0_LOAD-64) + return NR_CPUS; + + /* Is the cpu oversubscribed? */ + /* load < ~194% */ + if (cpu_rq(cpu)->cfs.tg_load_contrib < 2*NICE_0_LOAD-64) + return NR_CPUS; + + /* Is the task alone on the cpu? */ + if (cpu_rq(cpu)->cfs.nr_running < 2) + return NR_CPUS; + + /* Is the task actually starving? */ + if (hmp_task_starvation(se) > 768) /* <25% waiting */ + return NR_CPUS; + + /* Does the slower domain have spare cycles? */ + min_usage = hmp_domain_min_load(hmp_slower_domain(cpu), &dest_cpu); + /* load > 50% */ + if (min_usage > NICE_0_LOAD/2) + return NR_CPUS; + + if (cpumask_test_cpu(dest_cpu, &hmp_slower_domain(cpu)->cpus)) + return dest_cpu; + return NR_CPUS; +} #endif /* CONFIG_SCHED_HMP */
/* @@ -5643,10 +5717,14 @@ static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) < hmp_next_up_threshold) return 0;
- if (se->avg.load_avg_ratio > hmp_up_threshold && - cpumask_intersects(&hmp_faster_domain(cpu)->cpus, - tsk_cpus_allowed(p))) { - return 1; + if (se->avg.load_avg_ratio > hmp_up_threshold) { + /* Target domain load < ~94% */ + if (hmp_domain_min_load(hmp_faster_domain(cpu), NULL) + > NICE_0_LOAD-64) + return 0; + if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus, + tsk_cpus_allowed(p))) + return 1; } return 0; } @@ -5868,6 +5946,21 @@ static void hmp_force_up_migration(int this_cpu) hmp_next_up_delay(&p->se, target->push_cpu); } } + if (!force && !target->active_balance) { + /* + * For now we just check the currently running task. + * Selecting the lightest task for offloading will + * require extensive book keeping. + */ + target->push_cpu = hmp_offload_down(cpu, curr); + if (target->push_cpu < NR_CPUS) { + target->active_balance = 1; + target->migrate_task = p; + force = 1; + trace_sched_hmp_migrate(p, target->push_cpu, 2); + hmp_next_down_delay(&p->se, target->push_cpu); + } + } raw_spin_unlock_irqrestore(&target->lock, flags); if (force) stop_one_cpu_nowait(cpu_of(target),
On Fri, Dec 7, 2012 at 5:33 PM, Morten Rasmussen morten.rasmussen@arm.com wrote:
Hi Viresh,
Here is a patch that introduces global load balancing on top of the existing HMP patch set. It depends on the HMP patches already present in your task-placement-v2 branch. It can be applied on top of the HMP sysfs patches if needed. The fix should be trivial.
Could you include in the MP branch for the 12.12 release? Testing with sysbench and coremark show significant performance improvements for parallel workloads as all cpus can now be used for cpu intensive tasks.
Morten,
Can you share some performance number improvements and/or kernelshark-type graphs with and without this patch? It'd be very interesting to see the changes.
Monday is the deadline to get this merged into the MP tree to make it to the release. It is end of week now. Not sure how much testing and review can be done before Monday. Your numbers might make a compelling argument.
Regards, Amit
Thanks, Morten
Morten Rasmussen (1): sched: Basic global balancing support for HMP
kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 97 insertions(+), 4 deletions(-)
-- 1.7.9.5
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
Hi Amit,
I should have included the numbers in the cover letter. Here are numbers for TC2.
sysbench (normalized execution time, lower is better) threads 2 4 8 HMP 1.00 1.00 1.00 HMP+GB 1.00 0.67 0.58
coremark (normalized iterations per second, higher is better) threads 2 4 8 HMP 1.00 1.00 1.00 HMP+GB 1.00 1.39 1.73
So there is clear benefit of utilizing the A7s. It actually saves energy too as the whole benchmark completes faster.
Regards, Morten
On Fri, Dec 7, 2012 at 12:14 PM, Amit Kucheria amit.kucheria@linaro.org wrote:
On Fri, Dec 7, 2012 at 5:33 PM, Morten Rasmussen morten.rasmussen@arm.com wrote:
Hi Viresh,
Here is a patch that introduces global load balancing on top of the existing HMP patch set. It depends on the HMP patches already present in your task-placement-v2 branch. It can be applied on top of the HMP sysfs patches if needed. The fix should be trivial.
Could you include in the MP branch for the 12.12 release? Testing with sysbench and coremark show significant performance improvements for parallel workloads as all cpus can now be used for cpu intensive tasks.
Morten,
Can you share some performance number improvements and/or kernelshark-type graphs with and without this patch? It'd be very interesting to see the changes.
Monday is the deadline to get this merged into the MP tree to make it to the release. It is end of week now. Not sure how much testing and review can be done before Monday. Your numbers might make a compelling argument.
Regards, Amit
Thanks, Morten
Morten Rasmussen (1): sched: Basic global balancing support for HMP
kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 97 insertions(+), 4 deletions(-)
-- 1.7.9.5
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
On 7 December 2012 18:43, Morten Rasmussen morten.rasmussen@arm.com wrote:
I should have included the numbers in the cover letter. Here are numbers for TC2.
sysbench (normalized execution time, lower is better) threads 2 4 8 HMP 1.00 1.00 1.00 HMP+GB 1.00 0.67 0.58
coremark (normalized iterations per second, higher is better) threads 2 4 8 HMP 1.00 1.00 1.00 HMP+GB 1.00 1.39 1.73
So there is clear benefit of utilizing the A7s. It actually saves energy too as the whole benchmark completes faster.
Hi Morten,
I have applied your patch now and pushed v13. Please cross-check v13 to see if everything is correct.
On 07/12/12 14:54, Viresh Kumar wrote:
On 7 December 2012 18:43, Morten Rasmussen morten.rasmussen@arm.com wrote:
I should have included the numbers in the cover letter. Here are numbers for TC2.
sysbench (normalized execution time, lower is better) threads 2 4 8 HMP 1.00 1.00 1.00 HMP+GB 1.00 0.67 0.58
coremark (normalized iterations per second, higher is better) threads 2 4 8 HMP 1.00 1.00 1.00 HMP+GB 1.00 1.39 1.73
So there is clear benefit of utilizing the A7s. It actually saves energy too as the whole benchmark completes faster.
Hi Morten,
I have applied your patch now and pushed v13. Please cross-check v13 to see if everything is correct.
It looks right to me.
Morten
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.