capacity_spare_wake in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up.
Several tests with results are included below to show improvements with this change.
1) Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core) ------------------------------------------------------------ Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms.
Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+
2) Rohit ran barrier.c test (details below) with following improvements: ------------------------------------------------------------------------ This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees.
barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop.
Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX'
barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ |Threads | Without patch | With patch | | | | | +--------+--------+---------+-----------------+---------+ | | Mean | Std Dev | Mean | Std Dev | +--------+--------+---------+-----------------+---------+ |1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 | |2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 | |4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 | |8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 | |16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 | |32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 | |64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 | +--------+--------+---------+-----------------+---------+
3) ping+hackbench test on bare-metal sever (Rohit ran this test) ---------------------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX'
This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better).
+--------------+-----------------+--------------------------+ |Task Groups | Without patch | With patch | | +-------+---------+----------------+---------+ |(Groups of 40)| Mean | Std Dev | Mean | Std Dev | +--------------+-------+---------+----------------+---------+ |1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 | |2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 | |4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 | |8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 | |16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 | |25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 | +--------------+-------+---------+----------------+---------+
[1] https://patchwork.kernel.org/patch/9991635/
Matt Fleming also ran cyclictest and several different hackbench tests on his test machines to santiy-check that the patch doesn't harm any of his usecases.
Cc: Dietmar Eggemann dietmar.eggemann@arm.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Morten Ramussen morten.rasmussen@arm.com Cc: Brendan Jackman brendan.jackman@arm.com Tested-by: Rohit Jain rohit.k.jain@oracle.com Tested-by: Matt Fleming matt@codeblueprint.co.uk Signed-off-by: Joel Fernandes joelaf@google.com --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b8e749..ba9609407cb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) { - return capacity_orig_of(cpu) - cpu_util_wake(cpu, p); + return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0); }
/* -- 2.15.0.448.gf294e3d99a-goog
On 9 November 2017 at 19:52, Joel Fernandes joelaf@google.com wrote:
capacity_spare_wake in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up.
Several tests with results are included below to show improvements with this change.
- Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core)
"4x4 ARM64 Octa core" is confusing . At least for me, 4x4 means 16 cores :-)
Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms.
Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+
Out of curiosity, do you know why you don't see any improvement for 16 tasks but only for 8 and 32 tasks ?
- Rohit ran barrier.c test (details below) with following improvements:
This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees.
barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop.
Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX'
barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ |Threads | Without patch | With patch | | | | | +--------+--------+---------+-----------------+---------+ | | Mean | Std Dev | Mean | Std Dev | +--------+--------+---------+-----------------+---------+ |1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 | |2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 | |4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 | |8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 | |16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 | |32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 | |64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 | +--------+--------+---------+-----------------+---------+
- ping+hackbench test on bare-metal sever (Rohit ran this test)
Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX'
This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better).
+--------------+-----------------+--------------------------+ |Task Groups | Without patch | With patch | | +-------+---------+----------------+---------+ |(Groups of 40)| Mean | Std Dev | Mean | Std Dev | +--------------+-------+---------+----------------+---------+ |1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 | |2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 | |4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 | |8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 | |16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 | |25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 | +--------------+-------+---------+----------------+---------+
[1] https://patchwork.kernel.org/patch/9991635/
Matt Fleming also ran cyclictest and several different hackbench tests on his test machines to santiy-check that the patch doesn't harm any of his usecases.
Cc: Dietmar Eggemann dietmar.eggemann@arm.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Morten Ramussen morten.rasmussen@arm.com Cc: Brendan Jackman brendan.jackman@arm.com Tested-by: Rohit Jain rohit.k.jain@oracle.com Tested-by: Matt Fleming matt@codeblueprint.co.uk Signed-off-by: Joel Fernandes joelaf@google.com
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b8e749..ba9609407cb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) {
return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);
Make sense
Reviewed-by: Vincent Guittot vincent.guittot@linaro.org
}
/*
2.15.0.448.gf294e3d99a-goog
Hi Vincent,
Thanks a lot for your reply, and sorry for the late reply. Actually I just started paternity leave so that's why the delay. My working hours and completely random at the moment :-)
On Fri, Nov 10, 2017 at 12:29 AM, Vincent Guittot vincent.guittot@linaro.org wrote:
On 9 November 2017 at 19:52, Joel Fernandes joelaf@google.com wrote:
capacity_spare_wake in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up.
Several tests with results are included below to show improvements with this change.
- Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core)
"4x4 ARM64 Octa core" is confusing . At least for me, 4x4 means 16 cores :-)
Sure I'll fix it, I meant 4 big and 4 LITTLE CPUs :)
Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms.
Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+
Out of curiosity, do you know why you don't see any improvement for 16 tasks but only for 8 and 32 tasks ?
Yes I'm not fully sure why 16 tasks didn't show that much improvement. I can try to trace it when I can get a chance. Generally for this test, with more number of tasks, the improvement is lesser. However you're right to point out that the improvement with 32 is > with 16 for this test.
[..]
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b8e749..ba9609407cb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) {
return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);
Make sense
Reviewed-by: Vincent Guittot vincent.guittot@linaro.org
Thanks!
- Joel
On 16 November 2017 at 22:53, Joel Fernandes joelaf@google.com wrote:
Hi Vincent,
Thanks a lot for your reply, and sorry for the late reply. Actually I just started paternity leave so that's why the delay. My working hours
Congratulations !
and completely random at the moment :-)
On Fri, Nov 10, 2017 at 12:29 AM, Vincent Guittot vincent.guittot@linaro.org wrote:
On 9 November 2017 at 19:52, Joel Fernandes joelaf@google.com wrote:
capacity_spare_wake in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up.
Several tests with results are included below to show improvements with this change.
- Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core)
"4x4 ARM64 Octa core" is confusing . At least for me, 4x4 means 16 cores :-)
Sure I'll fix it, I meant 4 big and 4 LITTLE CPUs :)
Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms.
Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+
Out of curiosity, do you know why you don't see any improvement for 16 tasks but only for 8 and 32 tasks ?
Yes I'm not fully sure why 16 tasks didn't show that much improvement.
Yes. This is just to make sure that there no unexpected side effect
I can try to trace it when I can get a chance. Generally for this test, with more number of tasks, the improvement is lesser. However you're right to point out that the improvement with 32 is > with 16 for this test.
[..]
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b8e749..ba9609407cb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) {
return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);
Make sense
Reviewed-by: Vincent Guittot vincent.guittot@linaro.org
Thanks!
- Joel
Hi Vincent,
Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms.
Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+
Out of curiosity, do you know why you don't see any improvement for 16 tasks but only for 8 and 32 tasks ?
Yes I'm not fully sure why 16 tasks didn't show that much improvement.
Yes. This is just to make sure that there no unexpected side effect
Just got back from vacation. Tried to reproduce these results, looks like our product kernel changed enough that I am not able to exactly replicate these results and I don't recall the tree I ran these on. I will redo these tests and share my data in the next rev. Worst case I can probably drop this test, since there are other hackbench tests in this patch as well that show improvements. But I'll give it a shot to make sure no side effects from this. thanks.
- Joel
On Mon, Dec 11, 2017 at 4:43 PM, Joel Fernandes joelaf@google.com wrote:
Hi Vincent,
Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms.
Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+
Out of curiosity, do you know why you don't see any improvement for 16 tasks but only for 8 and 32 tasks ?
Yes I'm not fully sure why 16 tasks didn't show that much improvement.
Yes. This is just to make sure that there no unexpected side effect
It could have been sloppy testing - I could have hit thermal throttling or forgotten to stop Android runtime before running the test. Looking at my old data, the case for 16 tasks has higher completion times than 32 tasks which doesn't make sense. Sorry about that. I was careful this time, I recreated the product tree and applied patch - ran the same test as in this patch, the data prefixed with "with" is with patch and "without" is without patch.
The naming of the Test column is "<test>-<numFDs>-<numGroups>". Data is completion time of hackbench in seconds.
RUN 1:
Test Mean Median Stddev with-f4-1g 0.67645 (+3.7%) 0.68000 (+3.8%) 0.025755 with-f4-2g 1.0685 (-0.3%) 1.0570 (+1%) 0.044122 with-f4-4g 1.7558 (+0.7%) 1.7685 (+0.08%) 0.096015
without-f4-1g 0.70255 0.70750 0.025330 without-f4-2g 1.0653 1.0680 0.040300 without-f4-4g 1.7688 1.7670 0.046341
RUN 2:
Test Mean Median Stddev with-f4-1g 0.68100 (+1%) 0.67800 (+2%) 0.025543 with-f4-2g 1.0242 (+1.5%) 1.0260 (+1.5%) 0.042886 with-f4-4g 1.6100 (+3%) 1.6075 (+3.7%) 0.052677
without-f4-1g 0.68840 0.69150 0.030988 without-f4-2g 1.0400 1.0420 0.034288 without-f4-4g 1.6636 1.6670 0.056963
Let me know what you think, thanks.
- Joel
Hi Joel,
On 13 December 2017 at 21:00, Joel Fernandes joelaf@google.com wrote:
On Mon, Dec 11, 2017 at 4:43 PM, Joel Fernandes joelaf@google.com wrote:
Hi Vincent,
Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms.
Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ | groups | fds | tasks | Without Patch | With Patch | +--------+-----+-------+---------+---------+-----------------+---------+ | | | | Mean | Stdev | Mean | Stdev | | | | +-------------------+-----------------+---------+ | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | +--------+-----+-------+---------+---------+-----------------+---------+
Out of curiosity, do you know why you don't see any improvement for 16 tasks but only for 8 and 32 tasks ?
Yes I'm not fully sure why 16 tasks didn't show that much improvement.
Yes. This is just to make sure that there no unexpected side effect
It could have been sloppy testing - I could have hit thermal throttling or forgotten to stop Android runtime before running the test. Looking at my old data, the case for 16 tasks has higher completion times than 32 tasks which doesn't make sense. Sorry about that. I was careful this time, I recreated the product tree and applied patch - ran the same test as in this patch, the data prefixed with "with" is with patch and "without" is without patch.
The naming of the Test column is "<test>-<numFDs>-<numGroups>". Data is completion time of hackbench in seconds.
RUN 1:
Test Mean Median Stddev with-f4-1g 0.67645 (+3.7%) 0.68000 (+3.8%) 0.025755 with-f4-2g 1.0685 (-0.3%) 1.0570 (+1%) 0.044122 with-f4-4g 1.7558 (+0.7%) 1.7685 (+0.08%) 0.096015
without-f4-1g 0.70255 0.70750 0.025330 without-f4-2g 1.0653 1.0680 0.040300 without-f4-4g 1.7688 1.7670 0.046341
RUN 2:
Test Mean Median Stddev with-f4-1g 0.68100 (+1%) 0.67800 (+2%) 0.025543 with-f4-2g 1.0242 (+1.5%) 1.0260 (+1.5%) 0.042886 with-f4-4g 1.6100 (+3%) 1.6075 (+3.7%) 0.052677
without-f4-1g 0.68840 0.69150 0.030988 without-f4-2g 1.0400 1.0420 0.034288 without-f4-4g 1.6636 1.6670 0.056963
Let me know what you think, thanks.
The improvement has decreased compared to previous results and there is instability between your runs; As an example, run2 without patch does better than run1 with patchs for 2g and 4g. Could you run tests on a SMP linux kernel instead of big/LITTLE android in order to have a saner test environnement and remove some possible disturbances
Vincent
- Joel
Hi Vincent, Thanks for your reply.
On Thu, Dec 14, 2017 at 7:46 AM, Vincent Guittot vincent.guittot@linaro.org wrote:
Hi Joel,
On 13 December 2017 at 21:00, Joel Fernandes joelaf@google.com wrote:
On Mon, Dec 11, 2017 at 4:43 PM, Joel Fernandes joelaf@google.com wrote:
Hi Vincent,
> ------------------------------------------------------------ > Here we have RT activity running on big CPU cluster induced with rt-app, > and running hackbench in parallel. The RT tasks are bound to 4 CPUs on > the big cluster (cpu 4,5,6,7) and have 100ms periodicity with > runtime=20ms sleep=80ms. > > Hackbench shows big benefit (30%) improvement when number of tasks is 8 > and 32: Note: data is completion time in seconds (lower is better). > Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. > +--------+-----+-------+-------------------+---------------------------+ > | groups | fds | tasks | Without Patch | With Patch | > +--------+-----+-------+---------+---------+-----------------+---------+ > | | | | Mean | Stdev | Mean | Stdev | > | | | +-------------------+-----------------+---------+ > | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | > | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | > | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | > +--------+-----+-------+---------+---------+-----------------+---------+
Out of curiosity, do you know why you don't see any improvement for 16 tasks but only for 8 and 32 tasks ?
Yes I'm not fully sure why 16 tasks didn't show that much improvement.
Yes. This is just to make sure that there no unexpected side effect
It could have been sloppy testing - I could have hit thermal throttling or forgotten to stop Android runtime before running the test. Looking at my old data, the case for 16 tasks has higher completion times than 32 tasks which doesn't make sense. Sorry about that. I was careful this time, I recreated the product tree and applied patch - ran the same test as in this patch, the data prefixed with "with" is with patch and "without" is without patch.
The naming of the Test column is "<test>-<numFDs>-<numGroups>". Data is completion time of hackbench in seconds.
RUN 1:
Test Mean Median Stddev with-f4-1g 0.67645 (+3.7%) 0.68000 (+3.8%) 0.025755 with-f4-2g 1.0685 (-0.3%) 1.0570 (+1%) 0.044122 with-f4-4g 1.7558 (+0.7%) 1.7685 (+0.08%) 0.096015
without-f4-1g 0.70255 0.70750 0.025330 without-f4-2g 1.0653 1.0680 0.040300 without-f4-4g 1.7688 1.7670 0.046341
RUN 2:
Test Mean Median Stddev with-f4-1g 0.68100 (+1%) 0.67800 (+2%) 0.025543 with-f4-2g 1.0242 (+1.5%) 1.0260 (+1.5%) 0.042886 with-f4-4g 1.6100 (+3%) 1.6075 (+3.7%) 0.052677
without-f4-1g 0.68840 0.69150 0.030988 without-f4-2g 1.0400 1.0420 0.034288 without-f4-4g 1.6636 1.6670 0.056963
Let me know what you think, thanks.
The improvement has decreased compared to previous results and there
Yes but the previous result was invalid as I mentioned, I controlled the environment better this time. Previous result showed 4g completed quicker than 2g which wasn't very meaningful.
is instability between your runs; As an example, run2 without patch does better than run1 with patchs for 2g and 4g.
That's true. The improvement percent isn't stable.
Could you run tests on a SMP linux kernel instead of big/LITTLE android in order to have a saner test environnement and remove some possible disturbances
Would it be Ok with you if I just dropped this synthetic test from the patch since there are other hackbench results (case 3) from Rohit which are on SMP?
Thanks,
- Joel
On 14 December 2017 at 18:08, Joel Fernandes joelaf@google.com wrote:
Hi Vincent, Thanks for your reply.
On Thu, Dec 14, 2017 at 7:46 AM, Vincent Guittot vincent.guittot@linaro.org wrote:
Hi Joel,
On 13 December 2017 at 21:00, Joel Fernandes joelaf@google.com wrote:
On Mon, Dec 11, 2017 at 4:43 PM, Joel Fernandes joelaf@google.com wrote:
Hi Vincent,
> >> ------------------------------------------------------------ >> Here we have RT activity running on big CPU cluster induced with rt-app, >> and running hackbench in parallel. The RT tasks are bound to 4 CPUs on >> the big cluster (cpu 4,5,6,7) and have 100ms periodicity with >> runtime=20ms sleep=80ms. >> >> Hackbench shows big benefit (30%) improvement when number of tasks is 8 >> and 32: Note: data is completion time in seconds (lower is better). >> Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. >> +--------+-----+-------+-------------------+---------------------------+ >> | groups | fds | tasks | Without Patch | With Patch | >> +--------+-----+-------+---------+---------+-----------------+---------+ >> | | | | Mean | Stdev | Mean | Stdev | >> | | | +-------------------+-----------------+---------+ >> | 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 | >> | 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 | >> | 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 | >> +--------+-----+-------+---------+---------+-----------------+---------+ > > Out of curiosity, do you know why you don't see any improvement for > 16 tasks but only for 8 and 32 tasks ?
Yes I'm not fully sure why 16 tasks didn't show that much improvement.
Yes. This is just to make sure that there no unexpected side effect
It could have been sloppy testing - I could have hit thermal throttling or forgotten to stop Android runtime before running the test. Looking at my old data, the case for 16 tasks has higher completion times than 32 tasks which doesn't make sense. Sorry about that. I was careful this time, I recreated the product tree and applied patch - ran the same test as in this patch, the data prefixed with "with" is with patch and "without" is without patch.
The naming of the Test column is "<test>-<numFDs>-<numGroups>". Data is completion time of hackbench in seconds.
RUN 1:
Test Mean Median Stddev with-f4-1g 0.67645 (+3.7%) 0.68000 (+3.8%) 0.025755 with-f4-2g 1.0685 (-0.3%) 1.0570 (+1%) 0.044122 with-f4-4g 1.7558 (+0.7%) 1.7685 (+0.08%) 0.096015
without-f4-1g 0.70255 0.70750 0.025330 without-f4-2g 1.0653 1.0680 0.040300 without-f4-4g 1.7688 1.7670 0.046341
RUN 2:
Test Mean Median Stddev with-f4-1g 0.68100 (+1%) 0.67800 (+2%) 0.025543 with-f4-2g 1.0242 (+1.5%) 1.0260 (+1.5%) 0.042886 with-f4-4g 1.6100 (+3%) 1.6075 (+3.7%) 0.052677
without-f4-1g 0.68840 0.69150 0.030988 without-f4-2g 1.0400 1.0420 0.034288 without-f4-4g 1.6636 1.6670 0.056963
Let me know what you think, thanks.
The improvement has decreased compared to previous results and there
Yes but the previous result was invalid as I mentioned, I controlled the environment better this time. Previous result showed 4g completed quicker than 2g which wasn't very meaningful.
Yes. It was just to highlight that we don't see improvements for this test anymore with new results
is instability between your runs; As an example, run2 without patch does better than run1 with patchs for 2g and 4g.
That's true. The improvement percent isn't stable.
Could you run tests on a SMP linux kernel instead of big/LITTLE android in order to have a saner test environnement and remove some possible disturbances
Would it be Ok with you if I just dropped this synthetic test from the patch since there are other hackbench results (case 3) from Rohit which are on SMP?
Yes you can probably remove it as there is no improvement and others tests show improvement
Thanks,
- Joel
On 11/09/2017 06:52 PM, Joel Fernandes wrote:
[...]
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b8e749..ba9609407cb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) {
- return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0); }
/*
Looks good to me. Maybe you could mention in the patch header that you switch capacity_orig_of() for capacity_of() since its only a tiny diff in the hunk.
Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com