o This patch series include performance optimization and some fixes. One main purpose is to resolve performance issues for multi-threading, this is finished by patch 0001, 0003, 0005 and 0006; also includes one main fix for tipping point which is finished by patch 0007.
o All these patches have been tested on Juno R2 board. Especially for performance optimization patches, the testing result is consistent and repeatable on Juno board. This will make sure we have more confidience to upstream these patches into Android common kernel and mainline kernel.
The testing enviornment is based on ARM LT git tree: https://git.linaro.org/landing-teams/working/arm/kernel-release.git branch: origin/lsk-4.4-armlt-experimental
Test case: Geekbench with workload-automation
Test setting: echo 0 > /proc/sys/kernel/sched_migration_cost_ns
echo 1 > /proc/sys/kernel/sched_domain/cpu0/domain0/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu0/domain1/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu1/domain0/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu1/domain1/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu2/domain0/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu2/domain1/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu3/domain0/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu3/domain1/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu4/domain0/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu4/domain1/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu5/domain0/busy_factor echo 1 > /proc/sys/kernel/sched_domain/cpu5/domain1/busy_factor
o Test result:
Optimization with Patch 0001: baseline Patch 0001 Opt. Geekbench ST: 953.2 966.2 1.36% Geekbench MT: 2175.8 2280.8 4.83%
Optimization with Patch 0003: baseline Patch 0001+0003 Opt. Geekbench ST: 953.2 969.2 1.68% Geekbench MT: 2175.8 2356.8 8.32%
Optimization with all patches: baseline All Patch Opt. Geekbench ST: 953.2 968.6 1.62% Geekbench MT: 2175.8 2371.2 8.98%
For performance improvment, three main contributed patches are: 0001: ~4.83%, 0003: ~3.3%, 0005: ~0.7%.
Also need note one thing is: usually sched_migration_cost_ns also has big impaction on multi-threading performance, but we cannot see prominent boosting on Juno board; the mainly reason is Juno board has only 2 big cores.
o Compared to RFCv4 version [1], I have dropped all power optimization related patches. The related patches are important for power saving, but in the patches there have many hard-coded code but not general enough. So I'd like to split these patches into a individe patch set.
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000543.html
Leo Yan (7): sched/fair: kick nohz idle balance for misfit task sched/fair: replace capacity_of by capacity_orig_of sched/fair: fall back to traditional wakeup migration when system is busy sched/fair: fix build error for schedtune_task_margin sched/fair: force load balance when busiest group is overloaded Documentation: use sysfs for EAS performance tunning sched/fair: consider CPU overutilized only when it is not idle
Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++ kernel/sched/fair.c | 57 +++++++++++++++++++++++++++----- 2 files changed, 72 insertions(+), 9 deletions(-)
-- 1.9.1
When a task is running on one CPU, when the task utilization is more than 80% of CPU capacity the task will be considered as a misfit task and should be migrated to higher capacity CPU. But the running task will take more than 200ms for migration, this is caused by long latency to trigger load balance.
The latency is decided by two factors, the first one factor is time interval for schedule domain: busy_factor * balance_interval, by default cluster's schedule domain busy_factor = 32 and balance_interval = 8ms so finally latency is 256ms. If we set busy_factor to 1 from sysfs node for every schedule domain, this will reduce the time interval for load balance.
Besides this, another factor is to trigger active load balance for running task, this can be finished by kicking off an idle balance to pull running task to a big core. In the function nohz_kick_needed() it will check if need to wake up a idle CPU for idle balance, but in previous code it have no any checking for misfit task on rq, so finally will not trigger idle balance. As result we can see the running task is sticking on LITTLE core for long time.
This patch is to add checking misfit task in function nohz_kick_needed(), so make sure if there have misfit task can be quickly pulled to higher capacity CPU.
Tested this patch with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2176 to 2281, so can improve performance ~4.8%.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cf56241..dedb3e0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8939,6 +8939,10 @@ static inline bool nohz_kick_needed(struct rq *rq) (!energy_aware() || cpu_overutilized(cpu))) return true;
+ /* Do idle load balance if there have misfit task */ + if (energy_aware() && rq->misfit_task) + return true; + rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); if (sd && !energy_aware()) { -- 1.9.1
On 10-Oct 16:35, Leo Yan wrote:
When a task is running on one CPU, when the task utilization is more than 80% of CPU capacity the task will be considered as a misfit task and should be migrated to higher capacity CPU. But the running task will take more than 200ms for migration, this is caused by long latency to trigger load balance.
The latency is decided by two factors, the first one factor is time interval for schedule domain: busy_factor * balance_interval, by default cluster's schedule domain busy_factor = 32 and balance_interval = 8ms so finally latency is 256ms. If we set busy_factor to 1 from sysfs node for every schedule domain, this will reduce the time interval for load balance.
Besides this, another factor is to trigger active load balance for running task, this can be finished by kicking off an idle balance to pull running task to a big core. In the function nohz_kick_needed() it will check if need to wake up a idle CPU for idle balance, but in previous code it have no any checking for misfit task on rq, so finally will not trigger idle balance. As result we can see the running task is sticking on LITTLE core for long time.
In nohz_kick_needed we already check for cpu_overutilized which in the "general case" (i.e. no boosting, not cappings) should match with misfit_task. I mean, when misfit_task is set the CPU is also always marked as overutilized, isn't it?
Actually, task_fits_max checks for the task fitting the _maximum_ capacity available in the system, which is tracked at root SD level. Thus, it normally checks if a task fits the 1024 (minus margin) capacity.
AFAIKS, the main difference between cpu_overutilized and misfit_task is that this last (only) considers the "boosted" task utilization.
Thus, while a small boosted task does not mark a CPU as overutilized, the same task can still be marked as a misfitting one.
Do you think that's the case captured by the following extra check condition?
This patch is to add checking misfit task in function nohz_kick_needed(), so make sure if there have misfit task can be quickly pulled to higher capacity CPU.
Tested this patch with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2176 to 2281, so can improve performance ~4.8%.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cf56241..dedb3e0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8939,6 +8939,10 @@ static inline bool nohz_kick_needed(struct rq *rq) (!energy_aware() || cpu_overutilized(cpu))) return true;
- /* Do idle load balance if there have misfit task */
- if (energy_aware() && rq->misfit_task)
return true;
- rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); if (sd && !energy_aware()) {
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
Hi Patrick,
On Tue, Oct 11, 2016 at 01:01:46PM +0100, Patrick Bellasi wrote:
On 10-Oct 16:35, Leo Yan wrote:
When a task is running on one CPU, when the task utilization is more than 80% of CPU capacity the task will be considered as a misfit task and should be migrated to higher capacity CPU. But the running task will take more than 200ms for migration, this is caused by long latency to trigger load balance.
The latency is decided by two factors, the first one factor is time interval for schedule domain: busy_factor * balance_interval, by default cluster's schedule domain busy_factor = 32 and balance_interval = 8ms so finally latency is 256ms. If we set busy_factor to 1 from sysfs node for every schedule domain, this will reduce the time interval for load balance.
Besides this, another factor is to trigger active load balance for running task, this can be finished by kicking off an idle balance to pull running task to a big core. In the function nohz_kick_needed() it will check if need to wake up a idle CPU for idle balance, but in previous code it have no any checking for misfit task on rq, so finally will not trigger idle balance. As result we can see the running task is sticking on LITTLE core for long time.
In nohz_kick_needed we already check for cpu_overutilized which in the "general case" (i.e. no boosting, not cappings) should match with misfit_task. I mean, when misfit_task is set the CPU is also always marked as overutilized, isn't it?
The checking code you meantion is as below:
9341 if (rq->nr_running >= 2 && 9342 (!energy_aware() || cpu_overutilized(cpu))) 9343 return true;
So it must meet the condition to have at least two runnable tasks, but if there have only one running task on rq, it's hard to trigger nohz idle balance. This is the purpose this patch try to fix.
Maybe I can change code as below:
if (rq->nr_running >= 2 && !energy_aware()) return true;
if (energy_aware() && cpu_overutilized(cpu)) return true;
This will give more chance to migrate tasks to big core.
Actually, task_fits_max checks for the task fitting the _maximum_ capacity available in the system, which is tracked at root SD level. Thus, it normally checks if a task fits the 1024 (minus margin) capacity.
AFAIKS, the main difference between cpu_overutilized and misfit_task is that this last (only) considers the "boosted" task utilization.
Thus, while a small boosted task does not mark a CPU as overutilized, the same task can still be marked as a misfitting one.
Do you think that's the case captured by the following extra check condition?
This patch is to add checking misfit task in function nohz_kick_needed(), so make sure if there have misfit task can be quickly pulled to higher capacity CPU.
Tested this patch with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2176 to 2281, so can improve performance ~4.8%.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cf56241..dedb3e0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8939,6 +8939,10 @@ static inline bool nohz_kick_needed(struct rq *rq) (!energy_aware() || cpu_overutilized(cpu))) return true;
- /* Do idle load balance if there have misfit task */
- if (energy_aware() && rq->misfit_task)
return true;
- rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); if (sd && !energy_aware()) {
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
On 11-Oct 22:14, Leo Yan wrote:
Hi Patrick,
On Tue, Oct 11, 2016 at 01:01:46PM +0100, Patrick Bellasi wrote:
On 10-Oct 16:35, Leo Yan wrote:
When a task is running on one CPU, when the task utilization is more than 80% of CPU capacity the task will be considered as a misfit task and should be migrated to higher capacity CPU. But the running task will take more than 200ms for migration, this is caused by long latency to trigger load balance.
The latency is decided by two factors, the first one factor is time interval for schedule domain: busy_factor * balance_interval, by default cluster's schedule domain busy_factor = 32 and balance_interval = 8ms so finally latency is 256ms. If we set busy_factor to 1 from sysfs node for every schedule domain, this will reduce the time interval for load balance.
Besides this, another factor is to trigger active load balance for running task, this can be finished by kicking off an idle balance to pull running task to a big core. In the function nohz_kick_needed() it will check if need to wake up a idle CPU for idle balance, but in previous code it have no any checking for misfit task on rq, so finally will not trigger idle balance. As result we can see the running task is sticking on LITTLE core for long time.
In nohz_kick_needed we already check for cpu_overutilized which in the "general case" (i.e. no boosting, not cappings) should match with misfit_task. I mean, when misfit_task is set the CPU is also always marked as overutilized, isn't it?
The checking code you meantion is as below:
9341 if (rq->nr_running >= 2 && 9342 (!energy_aware() || cpu_overutilized(cpu))) 9343 return true;
So it must meet the condition to have at least two runnable tasks, but if there have only one running task on rq, it's hard to trigger nohz idle balance. This is the purpose this patch try to fix.
Maybe I can change code as below:
if (rq->nr_running >= 2 && !energy_aware()) return true; if (energy_aware() && cpu_overutilized(cpu)) return true;
This will give more chance to migrate tasks to big core.
That seems better to me, however in that case we are missing the opportunity to move boosted tasks.
Actually, task_fits_max checks for the task fitting the _maximum_ capacity available in the system, which is tracked at root SD level. Thus, it normally checks if a task fits the 1024 (minus margin) capacity.
AFAIKS, the main difference between cpu_overutilized and misfit_task is that this last (only) considers the "boosted" task utilization.
Thus, while a small boosted task does not mark a CPU as overutilized, the same task can still be marked as a misfitting one.
I'm referring to this last point of my previous comment.
Boosted utilization does not mark a CPU overutilized, thus we should use task_misfits as well to move these tasks.
What about:
if (energy_aware) return (cpu_overutilized(cpu) || rq->misfit_task); else if (rq->nr_running >=2) return true;
Do you think that's the case captured by the following extra check condition?
This patch is to add checking misfit task in function nohz_kick_needed(), so make sure if there have misfit task can be quickly pulled to higher capacity CPU.
Tested this patch with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2176 to 2281, so can improve performance ~4.8%.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cf56241..dedb3e0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8939,6 +8939,10 @@ static inline bool nohz_kick_needed(struct rq *rq) (!energy_aware() || cpu_overutilized(cpu))) return true;
- /* Do idle load balance if there have misfit task */
- if (energy_aware() && rq->misfit_task)
return true;
- rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); if (sd && !energy_aware()) {
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
Hi Patrick,
On Tue, Oct 11, 2016 at 05:36:46PM +0100, Patrick Bellasi wrote:
[...]
Actually, task_fits_max checks for the task fitting the _maximum_ capacity available in the system, which is tracked at root SD level. Thus, it normally checks if a task fits the 1024 (minus margin) capacity.
AFAIKS, the main difference between cpu_overutilized and misfit_task is that this last (only) considers the "boosted" task utilization.
Thus, while a small boosted task does not mark a CPU as overutilized, the same task can still be marked as a misfitting one.
I'm referring to this last point of my previous comment.
Boosted utilization does not mark a CPU overutilized, thus we should use task_misfits as well to move these tasks.
What about:
if (energy_aware) return (cpu_overutilized(cpu) || rq->misfit_task); else if (rq->nr_running >=2) return true;
Sorry I missed this point in my previous replying :) You are right, I will work out a new patch with your suggestion.
Thanks, Leo Yan
On 10/11/2016 05:43 PM, Leo Yan wrote:
Hi Patrick,
On Tue, Oct 11, 2016 at 05:36:46PM +0100, Patrick Bellasi wrote:
[...]
Actually, task_fits_max checks for the task fitting the _maximum_ capacity available in the system, which is tracked at root SD level. Thus, it normally checks if a task fits the 1024 (minus margin) capacity.
AFAIKS, the main difference between cpu_overutilized and misfit_task is that this last (only) considers the "boosted" task utilization.
Thus, while a small boosted task does not mark a CPU as overutilized, the same task can still be marked as a misfitting one.
I'm referring to this last point of my previous comment.
Boosted utilization does not mark a CPU overutilized, thus we should use task_misfits as well to move these tasks.
What about:
if (energy_aware) return (cpu_overutilized(cpu) || rq->misfit_task); else if (rq->nr_running >=2) return true;
I ran into same problem while I was testing upmigration latency with single CPU bound task. Above fix suggested by Patrick combined with 'sched/fair: replace capacity_of by capacity_orig_of' by Leo fixed my problem and reduced upmigration latency from ~1sec to ~250ms. ~250ms is still huge but CPU became overutilized after ~210ms of long ramp up time in my setup so it's separate problem.
So please feel free to
Tested-by: Joonwoo Park joonwoop@codeaurora.org
Thanks! Joonwoo
Sorry I missed this point in my previous replying :) You are right, I will work out a new patch with your suggestion.
Thanks, Leo Yan _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
Hi Joonwoo,
On Fri, Oct 21, 2016 at 02:06:14PM -0700, Joonwoo Park wrote:
On 10/11/2016 05:43 PM, Leo Yan wrote:
Hi Patrick,
On Tue, Oct 11, 2016 at 05:36:46PM +0100, Patrick Bellasi wrote:
[...]
Actually, task_fits_max checks for the task fitting the _maximum_ capacity available in the system, which is tracked at root SD level. Thus, it normally checks if a task fits the 1024 (minus margin) capacity.
AFAIKS, the main difference between cpu_overutilized and misfit_task is that this last (only) considers the "boosted" task utilization.
Thus, while a small boosted task does not mark a CPU as overutilized, the same task can still be marked as a misfitting one.
I'm referring to this last point of my previous comment.
Boosted utilization does not mark a CPU overutilized, thus we should use task_misfits as well to move these tasks.
What about:
if (energy_aware) return (cpu_overutilized(cpu) || rq->misfit_task); else if (rq->nr_running >=2) return true;
I ran into same problem while I was testing upmigration latency with single CPU bound task. Above fix suggested by Patrick combined with 'sched/fair: replace capacity_of by capacity_orig_of' by Leo fixed my problem and reduced upmigration latency from ~1sec to ~250ms.
I found Patrick's fix will introduce significant IPIs for rescheduling; this is because after big core is "overutilized" then it will kick off nohz idle balance, even the big core has only one task is running. So after compared Patrick's suggestion with my v1 patch, the "Rescheduling interrupts" will increase > 50%. As a side effect, this also will harm energy by waking up CPUs.
So I prefer to go back to use v1 patch, need Patrick's reviewing if this is okay or not.
~250ms is still huge but CPU became overutilized after ~210ms of long ramp up time in my setup so it's separate problem.
The ramp up time is longer than expected, if set to highest OPP the util_avg takes 31ms to reach 50% level and takes 74ms to reach 80%: http://people.linaro.org/~leo.yan/eas_profiling/pelt/pelt_up_down_y%5e32_0.5...
So what CPUFreq governor are you using? Before I used "ondemand" governor with long smapling window (80ms), I can see similiar behaviour.
So please feel free to
Tested-by: Joonwoo Park joonwoop@codeaurora.org
Thanks for testing, will add.
Thanks, Leo Yan
On 10/22/2016 07:45 AM, Leo Yan wrote:
Hi Joonwoo,
On Fri, Oct 21, 2016 at 02:06:14PM -0700, Joonwoo Park wrote:
On 10/11/2016 05:43 PM, Leo Yan wrote:
Hi Patrick,
On Tue, Oct 11, 2016 at 05:36:46PM +0100, Patrick Bellasi wrote:
[...]
Actually, task_fits_max checks for the task fitting the _maximum_ capacity available in the system, which is tracked at root SD level. Thus, it normally checks if a task fits the 1024 (minus margin) capacity.
AFAIKS, the main difference between cpu_overutilized and misfit_task is that this last (only) considers the "boosted" task utilization.
Thus, while a small boosted task does not mark a CPU as overutilized, the same task can still be marked as a misfitting one.
I'm referring to this last point of my previous comment.
Boosted utilization does not mark a CPU overutilized, thus we should use task_misfits as well to move these tasks.
What about:
if (energy_aware) return (cpu_overutilized(cpu) || rq->misfit_task); else if (rq->nr_running >=2) return true;
I ran into same problem while I was testing upmigration latency with single CPU bound task. Above fix suggested by Patrick combined with 'sched/fair: replace capacity_of by capacity_orig_of' by Leo fixed my problem and reduced upmigration latency from ~1sec to ~250ms.
I found Patrick's fix will introduce significant IPIs for rescheduling; this is because after big core is "overutilized" then it will kick off nohz idle balance, even the big core has only one task is running. So after compared Patrick's suggestion with my v1 patch, the "Rescheduling interrupts" will increase > 50%. As a side effect, this also will harm energy by waking up CPUs.
So I prefer to go back to use v1 patch, need Patrick's reviewing if this is okay or not.
Okay. I haven't yet tried your original patch. I happened to make this fix below and it addressed my test case and found Patrick's suggestion and confirmed it also did same job.
if (rq->nr_running >= 2 && !energy_aware()) return true;
if (energy_aware() && cpu_overutilized(cpu)) return true;
Looking at your original patch again. I'm bit worried upmigration latency delay since misfit_task will be set by scheduler tick path if there is 1 cpu bound task running without new wake up. I will see how Patrick thinks about and do another test.
~250ms is still huge but CPU became overutilized after ~210ms of long ramp up time in my setup so it's separate problem.
The ramp up time is longer than expected, if set to highest OPP the util_avg takes 31ms to reach 50% level and takes 74ms to reach 80%: http://people.linaro.org/~leo.yan/eas_profiling/pelt/pelt_up_down_y%5e32_0.5...
This is with upstream pelt delacying factor? Interestingly I even have pelt half lift change running and having longer rampup time. So mine should have ramped even faster than yours.
So what CPUFreq governor are you using? Before I used "ondemand" governor with long smapling window (80ms), I can see similiar behaviour.
I'm using sched-freq.
Thanks, Joonwoo
So please feel free to
Tested-by: Joonwoo Park joonwoop@codeaurora.org
Thanks for testing, will add.
Thanks, Leo Yan
Hi Joonwoo,
On Mon, Oct 24, 2016 at 02:34:51PM -0700, Joonwoo Park wrote:
[...]
Boosted utilization does not mark a CPU overutilized, thus we should use task_misfits as well to move these tasks.
What about:
if (energy_aware) return (cpu_overutilized(cpu) || rq->misfit_task); else if (rq->nr_running >=2) return true;
I ran into same problem while I was testing upmigration latency with single CPU bound task. Above fix suggested by Patrick combined with 'sched/fair: replace capacity_of by capacity_orig_of' by Leo fixed my problem and reduced upmigration latency from ~1sec to ~250ms.
I found Patrick's fix will introduce significant IPIs for rescheduling; this is because after big core is "overutilized" then it will kick off nohz idle balance, even the big core has only one task is running. So after compared Patrick's suggestion with my v1 patch, the "Rescheduling interrupts" will increase > 50%. As a side effect, this also will harm energy by waking up CPUs.
So I prefer to go back to use v1 patch, need Patrick's reviewing if this is okay or not.
Okay. I haven't yet tried your original patch. I happened to make this fix below and it addressed my test case and found Patrick's suggestion and confirmed it also did same job.
if (rq->nr_running >= 2 && !energy_aware()) return true; if (energy_aware() && cpu_overutilized(cpu)) return true;
Looking at your original patch again. I'm bit worried upmigration latency delay since misfit_task will be set by scheduler tick path if there is 1 cpu bound task running without new wake up. I will see how Patrick thinks about and do another test.
Yeah, so how about below code:
int max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val;
if (rq->nr_running >= 2 && !energy_aware()) return true;
if (energy_aware() && cpu_overutilized(cpu)) { /* * If highest capacity CPU has only one CPU, it's pointless to * trigger load balance for this case. So if the CPU has >=2 * tasks, then kick load balance to spread tasks as possible. */ if ((capacity_orig_of(cpu) == max_cap) && rq->nr_running >= 2) return true;
/* * For low capacity CPU, always kick off load balance; if * has only one task, it's possible to migrate it to higher * capacity CPU. */ if (capacity_orig_of(cpu) < max_cap) return true; }
/* * Still need check misfit flag, this flag is set when switch * runnable task to running task. So use this flag for more * chance to kick load balance after task is running task. */ if (energy_aware() && rq->misfit_task) return true;
~250ms is still huge but CPU became overutilized after ~210ms of long ramp up time in my setup so it's separate problem.
The ramp up time is longer than expected, if set to highest OPP the util_avg takes 31ms to reach 50% level and takes 74ms to reach 80%: http://people.linaro.org/~leo.yan/eas_profiling/pelt/pelt_up_down_y%5e32_0.5...
This is with upstream pelt delacying factor? Interestingly I even
Yes.
have pelt half lift change running and having longer rampup time. So mine should have ramped even faster than yours.
This is weired, before I also generated similiar patch and can see PELT signal ramps up with shorter time. I pasted the modification in case you are interesting:
---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1bb7efd..4819e17 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -665,9 +665,9 @@ static unsigned long task_h_load(struct task_struct *p); * Note: The tables runnable_avg_yN_inv and runnable_avg_yN_sum are * dependent on this value. */ -#define LOAD_AVG_PERIOD 32 -#define LOAD_AVG_MAX 47742 /* maximum possible load avg */ -#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_AVG_MAX */ +#define LOAD_AVG_PERIOD 16 +#define LOAD_AVG_MAX 24130 /* maximum possible load avg */ +#define LOAD_AVG_MAX_N 174 /* number of full periods to produce LOAD_AVG_MAX */
/* Give new sched_entity start runnable values to heavy its load in infant time */ void init_entity_runnable_average(struct sched_entity *se) @@ -2449,12 +2449,9 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) #ifdef CONFIG_SMP /* Precomputed fixed inverse multiplies for multiplication by y^n */ static const u32 runnable_avg_yN_inv[] = { - 0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6, - 0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85, - 0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581, - 0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9, - 0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80, - 0x85aac367, 0x82cd8698, + 0xffffffff, 0xf5257d14, 0xeac0c6e6, 0xe0ccdeeb, 0xd744fcc9, 0xce248c14, + 0xc5672a10, 0xbd08a39e, 0xb504f333, 0xad583ee9, 0xa5fed6a9, 0x9ef5325f, + 0x9837f050, 0x91c3d373, 0x8b95c1e3, 0x85aac367, };
/* @@ -2462,9 +2459,8 @@ static const u32 runnable_avg_yN_inv[] = { * over-estimates when re-combining. */ static const u32 runnable_avg_yN_sum[] = { - 0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103, - 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082, - 17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371, + 0, 980, 1919, 2818, 3679, 4503, 5292, 6048, 6772, 7465, 8129, + 8764, 9373, 9956,10514,11048,11560, };
So what CPUFreq governor are you using? Before I used "ondemand" governor with long smapling window (80ms), I can see similiar behaviour.
I'm using sched-freq.
I'm not familiar with sched-freq related tunning, but you can use "performance" governor to check if PELT is ramping up as expected; after that can analyze if sched-freq set CPU low frequency for long time so finally suppress PELT's ramping up.
Thanks, Leo Yan
In the case to migrate the task to higher capacity CPU, the scheduler need to distinguish CPU capacity is higher or lower. If use the function capacity_of(), this function will return back CPU capacity which is the value which reduce the occupied value by RT and DL class, so finally even the two CPUs have same capacity but this function will return back two different value so let them looks have different capacity.
This will introduce unnecessary active load balance for task migration within the same cluster. So change to use capacity_orig_of() instead, it returns back consistent value for CPU original capacity value.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dedb3e0..f2ab238 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8028,12 +8028,11 @@ static int need_active_balance(struct lb_env *env) return 1; }
- if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) && - env->src_rq->cfs.h_nr_running == 1 && - cpu_overutilized(env->src_cpu) && - !cpu_overutilized(env->dst_cpu)) { - return 1; - } + if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) && + env->src_rq->cfs.h_nr_running == 1 && + cpu_overutilized(env->src_cpu) && + !cpu_overutilized(env->dst_cpu)) + return 1;
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); } -- 1.9.1
On 10/10/2016 01:35 AM, Leo Yan wrote:
In the case to migrate the task to higher capacity CPU, the scheduler need to distinguish CPU capacity is higher or lower. If use the function capacity_of(), this function will return back CPU capacity which is the value which reduce the occupied value by RT and DL class, so finally even the two CPUs have same capacity but this function will return back two different value so let them looks have different capacity.
This will introduce unnecessary active load balance for task migration within the same cluster. So change to use capacity_orig_of() instead, it returns back consistent value for CPU original capacity value.
This fixed issue which I had that meaningless active migrations happens among the little CPUs while running a simple CPU bound task. Thanks!
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dedb3e0..f2ab238 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8028,12 +8028,11 @@ static int need_active_balance(struct lb_env *env) return 1; }
- if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
env->src_rq->cfs.h_nr_running == 1 &&
cpu_overutilized(env->src_cpu) &&
!cpu_overutilized(env->dst_cpu)) {
return 1;
- }
- if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) &&
Initially I thought we should have both of them like :
if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) && + ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu))) &&
env->src_rq->cfs.h_nr_running == 1 &&
cpu_overutilized(env->src_cpu) &&
!cpu_overutilized(env->dst_cpu))
But I think your version is good enough since this makes sure dst cpu has more spare capacity than src always after taking account of rt/dl task loads.
return 1;
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
BTW, I think we have a potential issue here when max capacity delta between little and big is large. For example, when little cap = 1024, big cap = 8192. If I'm not mistaken each little and big CPU will mark overutilized = true when spare capacity reaches down to ~204 and ~1638 on each CPU. (20% margin) At present, we don't upmigrate a task with load = 820 (1024 - 204) from little CPU to big even though the big CPU has enough spare capacity to take the task from little CPU since big CPU is marked as overutilized too. We might want to run this on big CPU? I haven't seen such soc so this is just speculation though.
Thanks, Joonwoo
Hi Joonwoo,
On Fri, Oct 21, 2016 at 04:37:33PM -0700, Joonwoo Park wrote:
On 10/10/2016 01:35 AM, Leo Yan wrote:
In the case to migrate the task to higher capacity CPU, the scheduler need to distinguish CPU capacity is higher or lower. If use the function capacity_of(), this function will return back CPU capacity which is the value which reduce the occupied value by RT and DL class, so finally even the two CPUs have same capacity but this function will return back two different value so let them looks have different capacity.
This will introduce unnecessary active load balance for task migration within the same cluster. So change to use capacity_orig_of() instead, it returns back consistent value for CPU original capacity value.
This fixed issue which I had that meaningless active migrations happens among the little CPUs while running a simple CPU bound task. Thanks!
Also thanks a lot for your testing :)
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dedb3e0..f2ab238 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8028,12 +8028,11 @@ static int need_active_balance(struct lb_env *env) return 1; }
- if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
env->src_rq->cfs.h_nr_running == 1 &&
cpu_overutilized(env->src_cpu) &&
!cpu_overutilized(env->dst_cpu)) {
return 1;
- }
- if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) &&
Initially I thought we should have both of them like :
if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
((capacity_orig_of(env->src_cpu) <
capacity_orig_of(env->dst_cpu))) &&
env->src_rq->cfs.h_nr_running == 1 &&
cpu_overutilized(env->src_cpu) &&
!cpu_overutilized(env->dst_cpu))
But I think your version is good enough since this makes sure dst cpu has more spare capacity than src always after taking account of rt/dl task loads.
Yeah. cpu_overutilized() has taken account of rt/dl task loads yet.
return 1;
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
BTW, I think we have a potential issue here when max capacity delta between little and big is large. For example, when little cap = 1024, big cap = 8192.
EAS will normalize CPU capacity to range [0..1024], so here I think you mean little cap = 128 but big cap = 1024 for single CPU, right?
If I'm not mistaken each little and big CPU will mark overutilized = true when spare capacity reaches down to ~204 and ~1638 on each CPU. (20% margin) At present, we don't upmigrate a task with load = 820 (1024 - 204) from little CPU to big even though the big CPU has enough spare capacity to take the task from little CPU since big CPU is marked as overutilized too. We might want to run this on big CPU?
I haven't seen such soc so this is just speculation though.
Are you suggesting some code like below:
static unsigned long cpu_spare_capacity(int cpu) { return max((capacity_of(cpu) - cpu_util(cpu)), 0); }
if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) && env->src_rq->cfs.h_nr_running == 1 && cpu_overutilized(env->src_cpu) && cpu_util(env->src_cpu) < cpu_spare_capacity(env->dst_cpu)) return 1;
So cpu_util(env->src_cpu) = 128 * 80% = 102 and cpu_spare_capacity(env->dst_cpu) = 1024 * 20% = 204, that means even big core is overutilized, but it still have higher capacity than little core.
Dietmar, how about you think for this? This code originally is introduced by your patch "sched: Enable idle balance to pull single task towards cpu with higher capacity". So I'd like get suggetion from you.
Thanks, Leo Yan
On 10/22/2016 07:28 AM, Leo Yan wrote:
Hi Joonwoo,
On Fri, Oct 21, 2016 at 04:37:33PM -0700, Joonwoo Park wrote:
On 10/10/2016 01:35 AM, Leo Yan wrote:
In the case to migrate the task to higher capacity CPU, the scheduler need to distinguish CPU capacity is higher or lower. If use the function capacity_of(), this function will return back CPU capacity which is the value which reduce the occupied value by RT and DL class, so finally even the two CPUs have same capacity but this function will return back two different value so let them looks have different capacity.
This will introduce unnecessary active load balance for task migration within the same cluster. So change to use capacity_orig_of() instead, it returns back consistent value for CPU original capacity value.
This fixed issue which I had that meaningless active migrations happens among the little CPUs while running a simple CPU bound task. Thanks!
Also thanks a lot for your testing :)
My pleasure :)
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dedb3e0..f2ab238 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8028,12 +8028,11 @@ static int need_active_balance(struct lb_env *env) return 1; }
- if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
env->src_rq->cfs.h_nr_running == 1 &&
cpu_overutilized(env->src_cpu) &&
!cpu_overutilized(env->dst_cpu)) {
return 1;
- }
- if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) &&
Initially I thought we should have both of them like :
if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
((capacity_orig_of(env->src_cpu) <
capacity_orig_of(env->dst_cpu))) &&
env->src_rq->cfs.h_nr_running == 1 &&
cpu_overutilized(env->src_cpu) &&
!cpu_overutilized(env->dst_cpu))
But I think your version is good enough since this makes sure dst cpu has more spare capacity than src always after taking account of rt/dl task loads.
Yeah. cpu_overutilized() has taken account of rt/dl task loads yet.
return 1;
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
BTW, I think we have a potential issue here when max capacity delta between little and big is large. For example, when little cap = 1024, big cap = 8192.
EAS will normalize CPU capacity to range [0..1024], so here I think you mean little cap = 128 but big cap = 1024 for single CPU, right?'
Yes. Thanks for correction.
If I'm not mistaken each little and big CPU will mark overutilized = true when spare capacity reaches down to ~204 and ~1638 on each CPU. (20% margin) At present, we don't upmigrate a task with load = 820 (1024 - 204) from little CPU to big even though the big CPU has enough spare capacity to take the task from little CPU since big CPU is marked as overutilized too. We might want to run this on big CPU?
I haven't seen such soc so this is just speculation though.
Are you suggesting some code like below:
static unsigned long cpu_spare_capacity(int cpu) { return max((capacity_of(cpu) - cpu_util(cpu)), 0); }
if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) && env->src_rq->cfs.h_nr_running == 1 && cpu_overutilized(env->src_cpu) && cpu_util(env->src_cpu) < cpu_spare_capacity(env->dst_cpu)) return 1;
So cpu_util(env->src_cpu) = 128 * 80% = 102 and cpu_spare_capacity(env->dst_cpu) = 1024 * 20% = 204, that means even big core is overutilized, but it still have higher capacity than little core.
Exactly this is what I meant.
Thanks, Joonwoo
Dietmar, how about you think for this? This code originally is introduced by your patch "sched: Enable idle balance to pull single task towards cpu with higher capacity". So I'd like get suggetion from you.
Thanks, Leo Yan _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
When one the task is waken up, energy aware scheduling has one significant change is to set 'want_affine', the purpose is to decide CPU selection for energy aware path when system is under tipping point, or select idle sibling CPU after system is over tipping point. For idle sibling CPU selection, it only selects idle CPU in the first level schedule domain.
As result, if there have many big tasks are running the scheduler has no chance to migrate some of these tasks across higher level schedule domain. So the tasks is hard to migrate to CPUs in another cluster, so one cluster has packed many tasks but another cluster is idle; Finally this harms performance for multi-threading case.
This patch is to add more checking for 'want_affine'. If all CPUs in the highest capacity cluster have tasks are running on them, then need consider to fall back to traditional wakeup migration path, which will help select most idle CPU in the system. So this will give more chance to migrate task to idle CPU, finally decrease scheduling latency and improve performance.
Tested this patch with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2281 to 2357, so can improve performance ~3.3%.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 37 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f2ab238..16eb48d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5328,6 +5328,34 @@ static bool cpu_overutilized(int cpu) return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); }
+static bool need_want_affine(struct task_struct *p, int cpu) +{ + int capacity = capacity_orig_of(cpu); + int max_capacity = cpu_rq(cpu)->rd->max_cpu_capacity.val; + unsigned long margin = schedtune_task_margin(p); + struct sched_domain *sd; + int affine = 0, i; + + if (margin) + return 1; + + if (capacity != max_capacity) + return 1; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + if (!sd) + return 1; + + for_each_cpu(i, sched_domain_span(sd)) { + if (idle_cpu(i)) { + affine = 1; + break; + } + } + + return affine; +} + #ifdef CONFIG_SCHED_TUNE
static long @@ -5891,7 +5919,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (sd_flag & SD_BALANCE_WAKE) want_affine = (!wake_wide(p) && task_fits_max(p, cpu) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) || - energy_aware(); + (energy_aware() && need_want_affine(p, cpu));
rcu_read_lock(); for_each_domain(cpu, tmp) { @@ -8030,9 +8058,14 @@ static int need_active_balance(struct lb_env *env)
if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) && env->src_rq->cfs.h_nr_running == 1 && - cpu_overutilized(env->src_cpu) && - !cpu_overutilized(env->dst_cpu)) - return 1; + cpu_overutilized(env->src_cpu)) { + + if (idle_cpu(env->dst_cpu)) + return 1; + + if (!idle_cpu(env->dst_cpu) && !cpu_overutilized(env->dst_cpu)) + return 1; + }
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); } -- 1.9.1
Fix minor errors for function definition, so can build successfully.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 16eb48d..924adec 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4198,6 +4198,9 @@ static inline void hrtick_update(struct rq *rq) #endif
#ifdef CONFIG_SMP + +static inline long schedtune_task_margin(struct task_struct *task); + static bool cpu_overutilized(int cpu); static inline unsigned long boosted_cpu_util(int cpu); #else @@ -5432,7 +5435,7 @@ schedtune_cpu_margin(unsigned long util, int cpu) return 0; }
-static inline int +static inline long schedtune_task_margin(struct task_struct *task) { return 0; -- 1.9.1
When execute load balance, scheduler calculates local schedule group and busiest schedule group are overloaded or not. If local schedule group has spare capacity but busiest schedule group is overloaded, scheduler forces to load balance between these two groups for new idle CPU. So in the case for idle balance but not new idle CPU, scheduler does not force load balance.
Usually this will not introduce issue for SMP platform, due below code later checks furthermore for average load for schedule groups:
/* * If the local group is busier than the selected busiest group * don't try and pull any tasks. */ if (local->avg_load >= busiest->avg_load) goto out_balanced;
/* * Don't pull any tasks if this group is already above the domain * average load. */ if (local->avg_load >= sds.avg_load) goto out_balanced;
But for asymmetric capacity architecture (Like ARM big.LITTLE system), it introduces one corner case is: the local schedule group is LITTLE core with lower capacity, the busiest schedule group is big core with higher capacity. So busiest schedule group is overloaded and local schedule group has spare capacity, but because LITTLE core have less capacity value so when calculate avg_load the LITTLE core has higher average load than big core. As result, scheduler skips load balance between the two schedule groups; so it misses the chances to migrate tasks from big cluster to little cluster when big cluster is overloaded and LITTLE cluster has idle CPUs.
So this patch checks if it's idle balance and local schedule group has spare capacity and busiest schedule group are overloaded, then force to trigger load balance between them.
Tested this patch with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2357 to 2371, so can improve performance ~0.7%.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 924adec..937eca2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7892,8 +7892,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && - busiest->group_no_capacity) + if ((env->idle == CPU_NEWLY_IDLE || env->idle == CPU_IDLE) && + group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance;
/* Misfitting tasks should be dealt with regardless of the avg load */ -- 1.9.1
Add extra two performance optimization methods by setting sysfs nodes:
- Method 1: set sched_migration_cost_ns to 0:
By default sched_migration_cost_ns = 50000, scheduler calls can_migrate_task() to check if tasks are cache hot or not and it compares sched_migration_cost_ns to avoid migrate tasks frequently.
This introduces side effects to easily pack tasks on the same one CPU and introduce latency to spread tasks within multi-cores, especially if we think energy aware scheduling is easily to pack tasks on single CPU. So after task packing on one CPU with high utilization, we can easily spread out tasks after we set sched_migration_cost_ns to 0.
- Method 2: set busy_factor to 1:
This decreases load balance inteval time, so it will give more chance for active load balance for migration running task from little core to big core.
Method 1 can improve prominent performance on one big.LITTLE system (which has CA53x4 + CA72x4 cores), from the Geekbench testing result the score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Tested method 2 on Juno as well, but it has very minor performance boosting.
Signed-off-by: Leo Yan leo.yan@linaro.org --- Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared resources. The cpu can be tricked into different per-cpu idle states by disabling the other states. Based on various combinations of measurements with specific cpus busy and disabling idle-states it is possible to extrapolate the idle-state power. + +Performance tunning method +========================== + +Below setting may impact heavily for performance tunning: + +echo 0 > /proc/sys/kernel/sched_migration_cost_ns + +After set sched_migration_cost_ns to 0, it is helpful to spread tasks within +the big cluster. Otherwise when scheduler executes load balance, it calls +can_migrate_task() to check if tasks are cache hot or not and it compares +sched_migration_cost_ns to avoid migrate tasks frequently. This introduce side +effect to easily pack tasks on the same one CPU and introduce latency to +spread tasks within multi-cores, especially if we think about energy awared +scheduling to pack tasks on single CPU. + +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor + +After set busy_factor to 1, it decreases load balance inteval time. So if we +take min_interval = 8, that means we permit the load balance interval = +busy_factor * min_interval = 8ms. So this will shorten task migration latency, +especially if we want to migrate a running task from little core to big core +to trigger active load balance. -- 1.9.1
On 10-Oct 16:35, Leo Yan wrote:
Add extra two performance optimization methods by setting sysfs nodes:
Method 1: set sched_migration_cost_ns to 0:
By default sched_migration_cost_ns = 50000, scheduler calls
That's 50us...
can_migrate_task() to check if tasks are cache hot or not and it compares sched_migration_cost_ns to avoid migrate tasks frequently.
This introduces side effects to easily pack tasks on the same one CPU and introduce latency to spread tasks within multi-cores, especially if we think energy aware scheduling is easily to pack tasks on single CPU. So after task packing on one CPU with high utilization, we can easily spread out tasks after we set sched_migration_cost_ns to 0.
... dunno how exactly this metric is used by the scheduler but, according to its name and you explanation, it seems that in the use-case you are targeting, tasks needs to me migrated more often than 50us. Is that the case?
As a general comment, I can understand that an hardcoded 50us value could be not generic at all, however: is there any indication on how to properly dimension this value for a specific target?
Maybe a specific set of synthetics experiments can be used to figure out what is the best value to be used. In that case we should probably report in the documentation how to measure and tune experimentally this value instead of just changing an hardcoded value with another one.
Method 2: set busy_factor to 1:
This decreases load balance inteval time, so it will give more chance for active load balance for migration running task from little core to big core.
Same reasoning as before, how can be sure that the value you are proposing (ie. busy_factor=1) it is really generic enough?
Method 1 can improve prominent performance on one big.LITTLE system (which has CA53x4 + CA72x4 cores), from the Geekbench testing result the score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Am I correct on assuming that potentially different values can give us even better performance but we tried and tested only the two values you are proposing?
Moreover, do we have any measure of the impact on energy consumption for the proposed value?
Tested method 2 on Juno as well, but it has very minor performance boosting.
That seems to support the idea that what you are proposing are values "optimal" only for performance on a specific platform. Isn't it?
Signed-off-by: Leo Yan leo.yan@linaro.org
Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared resources. The cpu can be tricked into different per-cpu idle states by disabling the other states. Based on various combinations of measurements with specific cpus busy and disabling idle-states it is possible to extrapolate the idle-state power.
+Performance tunning method +==========================
+Below setting may impact heavily for performance tunning:
+echo 0 > /proc/sys/kernel/sched_migration_cost_ns
+After set sched_migration_cost_ns to 0, it is helpful to spread tasks within +the big cluster. Otherwise when scheduler executes load balance, it calls +can_migrate_task() to check if tasks are cache hot or not and it compares +sched_migration_cost_ns to avoid migrate tasks frequently. This introduce side +effect to easily pack tasks on the same one CPU and introduce latency to +spread tasks within multi-cores, especially if we think about energy awared +scheduling to pack tasks on single CPU.
+echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor
+After set busy_factor to 1, it decreases load balance inteval time. So if we +take min_interval = 8, that means we permit the load balance interval = +busy_factor * min_interval = 8ms. So this will shorten task migration latency, +especially if we want to migrate a running task from little core to big core
+to trigger active load balance.
1.9.1
-- #include <best/regards.h>
Patrick Bellasi
On 13 October 2016 at 15:05, Patrick Bellasi patrick.bellasi@arm.com wrote:
On 10-Oct 16:35, Leo Yan wrote:
Add extra two performance optimization methods by setting sysfs nodes:
Method 1: set sched_migration_cost_ns to 0:
By default sched_migration_cost_ns = 50000, scheduler calls
That's 50us...
In fact default value is 500000 = 500us not 50000 = 50us
can_migrate_task() to check if tasks are cache hot or not and it compares sched_migration_cost_ns to avoid migrate tasks frequently.
This introduces side effects to easily pack tasks on the same one CPU and introduce latency to spread tasks within multi-cores, especially if we think energy aware scheduling is easily to pack tasks on single CPU. So after task packing on one CPU with high utilization, we can easily spread out tasks after we set sched_migration_cost_ns to 0.
... dunno how exactly this metric is used by the scheduler but, according to its name and you explanation, it seems that in the use-case you are targeting, tasks needs to me migrated more often than 50us. Is that the case?
main advantage is that there are sysfs entry for that so it can be tuned for each platform
As a general comment, I can understand that an hardcoded 50us value could be not generic at all, however: is there any indication on how to properly dimension this value for a specific target?
Maybe a specific set of synthetics experiments can be used to figure out what is the best value to be used. In that case we should probably report in the documentation how to measure and tune experimentally this value instead of just changing an hardcoded value with another one.
Method 2: set busy_factor to 1:
This decreases load balance inteval time, so it will give more chance for active load balance for migration running task from little core to big core.
Same reasoning as before, how can be sure that the value you are proposing (ie. busy_factor=1) it is really generic enough?
Method 1 can improve prominent performance on one big.LITTLE system (which has CA53x4 + CA72x4 cores), from the Geekbench testing result the score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Am I correct on assuming that potentially different values can give us even better performance but we tried and tested only the two values you are proposing?
For the 1st test, the root cause was that tasks was hot on a CPU and can't be selected to migrate on other CPU because of its hotness so decreasing the sched_migration_cost_ns directly reduce the hotness period during which a task can't migrate on another CPU
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
Moreover, do we have any measure of the impact on energy consumption for the proposed value?
Tested method 2 on Juno as well, but it has very minor performance boosting.
That seems to support the idea that what you are proposing are values "optimal" only for performance on a specific platform. Isn't it?
Signed-off-by: Leo Yan leo.yan@linaro.org
Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared resources. The cpu can be tricked into different per-cpu idle states by disabling the other states. Based on various combinations of measurements with specific cpus busy and disabling idle-states it is possible to extrapolate the idle-state power.
+Performance tunning method +==========================
+Below setting may impact heavily for performance tunning:
+echo 0 > /proc/sys/kernel/sched_migration_cost_ns
+After set sched_migration_cost_ns to 0, it is helpful to spread tasks within +the big cluster. Otherwise when scheduler executes load balance, it calls +can_migrate_task() to check if tasks are cache hot or not and it compares +sched_migration_cost_ns to avoid migrate tasks frequently. This introduce side +effect to easily pack tasks on the same one CPU and introduce latency to +spread tasks within multi-cores, especially if we think about energy awared +scheduling to pack tasks on single CPU.
+echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor
+After set busy_factor to 1, it decreases load balance inteval time. So if we +take min_interval = 8, that means we permit the load balance interval = +busy_factor * min_interval = 8ms. So this will shorten task migration latency, +especially if we want to migrate a running task from little core to big core
+to trigger active load balance.
1.9.1
-- #include <best/regards.h>
Patrick Bellasi
On 13-Oct 15:18, Vincent Guittot wrote:
On 13 October 2016 at 15:05, Patrick Bellasi patrick.bellasi@arm.com wrote:
On 10-Oct 16:35, Leo Yan wrote:
Add extra two performance optimization methods by setting sysfs nodes:
Method 1: set sched_migration_cost_ns to 0:
By default sched_migration_cost_ns = 50000, scheduler calls
That's 50us...
In fact default value is 500000 = 500us not 50000 = 50us
But still, whatever value you pink: 1) it is not considering the caches geometry on a specific target 2) how much hot a task is really depends also on the specific task
So, what Leo is proposing (i.e. to set this value to 0), it seems to me more a specific optimization for a specific platform.
can_migrate_task() to check if tasks are cache hot or not and it compares sched_migration_cost_ns to avoid migrate tasks frequently.
This introduces side effects to easily pack tasks on the same one CPU and introduce latency to spread tasks within multi-cores, especially if we think energy aware scheduling is easily to pack tasks on single CPU. So after task packing on one CPU with high utilization, we can easily spread out tasks after we set sched_migration_cost_ns to 0.
... dunno how exactly this metric is used by the scheduler but, according to its name and you explanation, it seems that in the use-case you are targeting, tasks needs to me migrated more often than 50us. Is that the case?
main advantage is that there are sysfs entry for that so it can be tuned for each platform
Right, thus what we should better report in the documentation IMHO is eventually a recipe to find a suitable value for a specific platforms.
Which is a more generic solution, although still a suboptimal one since we are not considering the specific nature of tasks.
As a general comment, I can understand that an hardcoded 50us value could be not generic at all, however: is there any indication on how to properly dimension this value for a specific target?
Maybe a specific set of synthetics experiments can be used to figure out what is the best value to be used. In that case we should probably report in the documentation how to measure and tune experimentally this value instead of just changing an hardcoded value with another one.
Method 2: set busy_factor to 1:
This decreases load balance inteval time, so it will give more chance for active load balance for migration running task from little core to big core.
Same reasoning as before, how can be sure that the value you are proposing (ie. busy_factor=1) it is really generic enough?
Method 1 can improve prominent performance on one big.LITTLE system (which has CA53x4 + CA72x4 cores), from the Geekbench testing result the score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Am I correct on assuming that potentially different values can give us even better performance but we tried and tested only the two values you are proposing?
For the 1st test, the root cause was that tasks was hot on a CPU and can't be selected to migrate on other CPU because of its hotness so decreasing the sched_migration_cost_ns directly reduce the hotness period during which a task can't migrate on another CPU
Ok, I don't know how exactly this value is impacting on load balancing but, still, what Leo is proposing is to reduce the value from 50us to 0su.... even if the original value should be 500us: does Geekbenck tasks needs to migrate more often than 500us?
If I'm not wrong 500us is quite likely lower than sched_min_granularity_ns (2.25 ms on my Nexus 5X).
Sorry for my lake of knowledge on that code but I would really like to know what are the real reasons for the speedup we get by completely disregarding the migration costs.
Maybe Geekbench is an heavily CPU bounded task with a small working set?
If that's the case, what's the impact of using 0 for sched_min_granularity_ns on tasks which have a bigger working set?
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
I agree with Vincent on that.
Moreover, do we have any measure of the impact on energy consumption for the proposed value?
Tested method 2 on Juno as well, but it has very minor performance boosting.
That seems to support the idea that what you are proposing are values "optimal" only for performance on a specific platform. Isn't it?
Signed-off-by: Leo Yan leo.yan@linaro.org
Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared resources. The cpu can be tricked into different per-cpu idle states by disabling the other states. Based on various combinations of measurements with specific cpus busy and disabling idle-states it is possible to extrapolate the idle-state power.
+Performance tunning method +==========================
+Below setting may impact heavily for performance tunning:
+echo 0 > /proc/sys/kernel/sched_migration_cost_ns
+After set sched_migration_cost_ns to 0, it is helpful to spread tasks within +the big cluster. Otherwise when scheduler executes load balance, it calls +can_migrate_task() to check if tasks are cache hot or not and it compares +sched_migration_cost_ns to avoid migrate tasks frequently. This introduce side +effect to easily pack tasks on the same one CPU and introduce latency to +spread tasks within multi-cores, especially if we think about energy awared +scheduling to pack tasks on single CPU.
+echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor
+After set busy_factor to 1, it decreases load balance inteval time. So if we +take min_interval = 8, that means we permit the load balance interval = +busy_factor * min_interval = 8ms. So this will shorten task migration latency, +especially if we want to migrate a running task from little core to big core
+to trigger active load balance.
1.9.1
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
On Thu, Oct 13, 2016 at 02:35:57PM +0100, Patrick Bellasi wrote:
[...]
Method 1 can improve prominent performance on one big.LITTLE system (which has CA53x4 + CA72x4 cores), from the Geekbench testing result the score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Am I correct on assuming that potentially different values can give us even better performance but we tried and tested only the two values you are proposing?
For the 1st test, the root cause was that tasks was hot on a CPU and can't be selected to migrate on other CPU because of its hotness so decreasing the sched_migration_cost_ns directly reduce the hotness period during which a task can't migrate on another CPU
Ok, I don't know how exactly this value is impacting on load balancing but, still, what Leo is proposing is to reduce the value from 50us to 0su.... even if the original value should be 500us: does Geekbenck tasks needs to migrate more often than 500us?
I think setting to 0 the most benefit is we do not miss any opportunity to migrate tasks if have imbalance in the first place. So it give more chance to load balance within CPUs.
This will be helpful for performance especially we want to spread tasks within big cluster.
If I'm not wrong 500us is quite likely lower than sched_min_granularity_ns (2.25 ms on my Nexus 5X).
Sorry for my lake of knowledge on that code but I would really like to know what are the real reasons for the speedup we get by completely disregarding the migration costs.
Maybe Geekbench is an heavily CPU bounded task with a small working set?
Geekbench generates per CPU's thread, and if all threads can run simultaneously on all CPUs then it can achieve higher score. So usually we want spread out all tasks as possible for it.
If that's the case, what's the impact of using 0 for sched_min_granularity_ns on tasks which have a bigger working set?
Let me explain the phenomenon for this case.
For EAS, if big tasks are migrated to big core and system is under tipping point, so task wakeup path is more likely to pack tasks to one or two CPUs. So if we don't set sched_min_granularity_ns to 0, the load balance may miss the opportunity to spread task to idle CPUs, so multi-threads are running on the same CPU and introduce scheduling latency.
After set sched_min_granularity_ns to 0, it will give more chance for task migration for imbalance. As result, the tasks can be spread out more quickly when load balance happens within CPUs.
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
I agree with Vincent on that.
So what's your suggestion for tracking this? Should write a dedicated doc file? My purpose is to easily tracking them and avoid to lose them; also want other guys to avoid to do duplicate things when they have same issue on their platform.
Or we still should use wiki page to track these sysfs setting?
Moreover, do we have any measure of the impact on energy consumption for the proposed value?
Tested method 2 on Juno as well, but it has very minor performance boosting.
That seems to support the idea that what you are proposing are values "optimal" only for performance on a specific platform. Isn't it?
Signed-off-by: Leo Yan leo.yan@linaro.org
Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared resources. The cpu can be tricked into different per-cpu idle states by disabling the other states. Based on various combinations of measurements with specific cpus busy and disabling idle-states it is possible to extrapolate the idle-state power.
+Performance tunning method +==========================
+Below setting may impact heavily for performance tunning:
+echo 0 > /proc/sys/kernel/sched_migration_cost_ns
+After set sched_migration_cost_ns to 0, it is helpful to spread tasks within +the big cluster. Otherwise when scheduler executes load balance, it calls +can_migrate_task() to check if tasks are cache hot or not and it compares +sched_migration_cost_ns to avoid migrate tasks frequently. This introduce side +effect to easily pack tasks on the same one CPU and introduce latency to +spread tasks within multi-cores, especially if we think about energy awared +scheduling to pack tasks on single CPU.
+echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor
+After set busy_factor to 1, it decreases load balance inteval time. So if we +take min_interval = 8, that means we permit the load balance interval = +busy_factor * min_interval = 8ms. So this will shorten task migration latency, +especially if we want to migrate a running task from little core to big core
+to trigger active load balance.
1.9.1
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
On Thu, Oct 13, 2016 at 03:18:16PM +0200, Vincent Guittot wrote:
On 13 October 2016 at 15:05, Patrick Bellasi patrick.bellasi@arm.com wrote:
On 10-Oct 16:35, Leo Yan wrote:
Add extra two performance optimization methods by setting sysfs nodes:
Method 1: set sched_migration_cost_ns to 0:
By default sched_migration_cost_ns = 50000, scheduler calls
That's 50us...
In fact default value is 500000 = 500us not 50000 = 50us
Sorry, should be 500us.
can_migrate_task() to check if tasks are cache hot or not and it compares sched_migration_cost_ns to avoid migrate tasks frequently.
This introduces side effects to easily pack tasks on the same one CPU and introduce latency to spread tasks within multi-cores, especially if we think energy aware scheduling is easily to pack tasks on single CPU. So after task packing on one CPU with high utilization, we can easily spread out tasks after we set sched_migration_cost_ns to 0.
... dunno how exactly this metric is used by the scheduler but, according to its name and you explanation, it seems that in the use-case you are targeting, tasks needs to me migrated more often than 50us. Is that the case?
main advantage is that there are sysfs entry for that so it can be tuned for each platform
If set this value to 0, the most benefit I can see is if there have idle CPU and there have two tasks are runnable on another CPU, it can give more chance to migrate one of the runnable tasks onto the idle CPU immediately.
As a general comment, I can understand that an hardcoded 50us value could be not generic at all, however: is there any indication on how to properly dimension this value for a specific target?
Maybe a specific set of synthetics experiments can be used to figure out what is the best value to be used. In that case we should probably report in the documentation how to measure and tune experimentally this value instead of just changing an hardcoded value with another one.
Method 2: set busy_factor to 1:
This decreases load balance inteval time, so it will give more chance for active load balance for migration running task from little core to big core.
Same reasoning as before, how can be sure that the value you are proposing (ie. busy_factor=1) it is really generic enough?
Method 1 can improve prominent performance on one big.LITTLE system (which has CA53x4 + CA72x4 cores), from the Geekbench testing result the score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for multi-thread case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Am I correct on assuming that potentially different values can give us even better performance but we tried and tested only the two values you are proposing?
Yes, I only tried these two values. Just like Patrick suggested, the methodology is more important rather than hard-coded value.
For the 1st test, the root cause was that tasks was hot on a CPU and can't be selected to migrate on other CPU because of its hotness so decreasing the sched_migration_cost_ns directly reduce the hotness period during which a task can't migrate on another CPU
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
I think for EAS, these two paramters are more important than traditional SMP load balance. Because EAS will have more chance to pack tasks onto single CPU or into one cluster, so we need utilize the existed machenism to spread out these tasks, so sched_migration_cost_ns and busy_factors are two things we can rely on.
Moreover, do we have any measure of the impact on energy consumption for the proposed value?
From one member's platform, have not observed power impaction for
energy consumption. I will try to use video playback case on Juno to generate out more power data.
Tested method 2 on Juno as well, but it has very minor performance boosting.
That seems to support the idea that what you are proposing are values "optimal" only for performance on a specific platform. Isn't it?
Yes.
Signed-off-by: Leo Yan leo.yan@linaro.org
Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared resources. The cpu can be tricked into different per-cpu idle states by disabling the other states. Based on various combinations of measurements with specific cpus busy and disabling idle-states it is possible to extrapolate the idle-state power.
+Performance tunning method +==========================
+Below setting may impact heavily for performance tunning:
+echo 0 > /proc/sys/kernel/sched_migration_cost_ns
+After set sched_migration_cost_ns to 0, it is helpful to spread tasks within +the big cluster. Otherwise when scheduler executes load balance, it calls +can_migrate_task() to check if tasks are cache hot or not and it compares +sched_migration_cost_ns to avoid migrate tasks frequently. This introduce side +effect to easily pack tasks on the same one CPU and introduce latency to +spread tasks within multi-cores, especially if we think about energy awared +scheduling to pack tasks on single CPU.
+echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor
+After set busy_factor to 1, it decreases load balance inteval time. So if we +take min_interval = 8, that means we permit the load balance interval = +busy_factor * min_interval = 8ms. So this will shorten task migration latency, +especially if we want to migrate a running task from little core to big core
+to trigger active load balance.
1.9.1
-- #include <best/regards.h>
Patrick Bellasi
On Thu, Oct 13, 2016 at 6:43 AM, Leo Yan leo.yan@linaro.org wrote:
On Thu, Oct 13, 2016 at 03:18:16PM +0200, Vincent Guittot wrote:
On 13 October 2016 at 15:05, Patrick Bellasi patrick.bellasi@arm.com
wrote:
On 10-Oct 16:35, Leo Yan wrote:
Add extra two performance optimization methods by setting sysfs nodes:
Method 1: set sched_migration_cost_ns to 0:
By default sched_migration_cost_ns = 50000, scheduler calls
That's 50us...
In fact default value is 500000 = 500us not 50000 = 50us
Sorry, should be 500us.
can_migrate_task() to check if tasks are cache hot or not and it compares sched_migration_cost_ns to avoid migrate tasks frequently.
This introduces side effects to easily pack tasks on the same one
CPU
and introduce latency to spread tasks within multi-cores, especially if we think energy aware scheduling is easily to pack tasks on
single
CPU. So after task packing on one CPU with high utilization, we can easily spread out tasks after we set sched_migration_cost_ns to 0.
... dunno how exactly this metric is used by the scheduler but, according to its name and you explanation, it seems that in the use-case you are targeting, tasks needs to me migrated more often than 50us. Is that the case?
main advantage is that there are sysfs entry for that so it can be tuned for each platform
If set this value to 0, the most benefit I can see is if there have idle CPU and there have two tasks are runnable on another CPU, it can give more chance to migrate one of the runnable tasks onto the idle CPU immediately.
The currently available mechanism to have less latency and spread tasks is to set the prefer_idle flag (spreads tasks in the corresponding cgroup as long as there are idle cores). Is this set in this experiment? Isn't load balancing supposed to be more about throughput and hence "slower" to kick in moving tasks as needed?
As a general comment, I can understand that an hardcoded 50us value could be not generic at all, however: is there any indication on how to properly dimension this value for a specific target?
Maybe a specific set of synthetics experiments can be used to figure out what is the best value to be used. In that case we should probably report in the documentation how to measure and tune experimentally this value instead of just changing an hardcoded value with another one.
Method 2: set busy_factor to 1:
This decreases load balance inteval time, so it will give more
chance
for active load balance for migration running task from little core
to
big core.
Same reasoning as before, how can be sure that the value you are proposing (ie. busy_factor=1) it is really generic enough?
Method 1 can improve prominent performance on one big.LITTLE system (which has CA53x4 + CA72x4 cores), from the Geekbench testing result
the
score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for
multi-thread
case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Am I correct on assuming that potentially different values can give us even better performance but we tried and tested only the two values you are proposing?
Yes, I only tried these two values. Just like Patrick suggested, the methodology is more important rather than hard-coded value.
For the 1st test, the root cause was that tasks was hot on a CPU and can't be selected to migrate on other CPU because of its hotness so decreasing the sched_migration_cost_ns directly reduce the hotness period during which a task can't migrate on another CPU
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
I think for EAS, these two paramters are more important than traditional SMP load balance. Because EAS will have more chance to pack tasks onto single CPU or into one cluster, so we need utilize the existed machenism to spread out these tasks, so sched_migration_cost_ns and busy_factors are two things we can rely on.
Moreover, do we have any measure of the impact on energy consumption for the proposed value?
From one member's platform, have not observed power impaction for energy consumption. I will try to use video playback case on Juno to generate out more power data.
Tested method 2 on Juno as well, but it has very minor performance boosting.
That seems to support the idea that what you are proposing are values "optimal" only for performance on a specific platform. Isn't it?
Yes.
Signed-off-by: Leo Yan leo.yan@linaro.org
Documentation/scheduler/sched-energy.txt | 24
++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt
b/Documentation/scheduler/sched-energy.txt
index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared
resources. The cpu can be tricked
into different per-cpu idle states by disabling the other states.
Based on
various combinations of measurements with specific cpus busy and
disabling
idle-states it is possible to extrapolate the idle-state power.
+Performance tunning method +==========================
+Below setting may impact heavily for performance tunning:
+echo 0 > /proc/sys/kernel/sched_migration_cost_ns
+After set sched_migration_cost_ns to 0, it is helpful to spread
tasks within
+the big cluster. Otherwise when scheduler executes load balance, it
calls
+can_migrate_task() to check if tasks are cache hot or not and it
compares
+sched_migration_cost_ns to avoid migrate tasks frequently. This
introduce side
+effect to easily pack tasks on the same one CPU and introduce
latency to
+spread tasks within multi-cores, especially if we think about energy
awared
+scheduling to pack tasks on single CPU.
+echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor
+After set busy_factor to 1, it decreases load balance inteval time.
So if we
+take min_interval = 8, that means we permit the load balance
interval =
+busy_factor * min_interval = 8ms. So this will shorten task
migration latency,
+especially if we want to migrate a running task from little core to
big core
+to trigger active load balance.
1.9.1
-- #include <best/regards.h>
Patrick Bellasi
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
On 13-Oct 08:50, Andres Oportus wrote:
On Thu, Oct 13, 2016 at 6:43 AM, Leo Yan leo.yan@linaro.org wrote:
On Thu, Oct 13, 2016 at 03:18:16PM +0200, Vincent Guittot wrote:
On 13 October 2016 at 15:05, Patrick Bellasi patrick.bellasi@arm.com
wrote:
On 10-Oct 16:35, Leo Yan wrote:
Add extra two performance optimization methods by setting sysfs nodes:
Method 1: set sched_migration_cost_ns to 0:
By default sched_migration_cost_ns = 50000, scheduler calls
That's 50us...
In fact default value is 500000 = 500us not 50000 = 50us
Sorry, should be 500us.
can_migrate_task() to check if tasks are cache hot or not and it compares sched_migration_cost_ns to avoid migrate tasks frequently.
This introduces side effects to easily pack tasks on the same one
CPU
and introduce latency to spread tasks within multi-cores, especially if we think energy aware scheduling is easily to pack tasks on
single
CPU. So after task packing on one CPU with high utilization, we can easily spread out tasks after we set sched_migration_cost_ns to 0.
... dunno how exactly this metric is used by the scheduler but, according to its name and you explanation, it seems that in the use-case you are targeting, tasks needs to me migrated more often than 50us. Is that the case?
main advantage is that there are sysfs entry for that so it can be tuned for each platform
If set this value to 0, the most benefit I can see is if there have idle CPU and there have two tasks are runnable on another CPU, it can give more chance to migrate one of the runnable tasks onto the idle CPU immediately.
The currently available mechanism to have less latency and spread tasks is to set the prefer_idle flag (spreads tasks in the corresponding cgroup as long as there are idle cores).
That's what AOSP's kernels use for the wakeup path.
Is this set in this experiment? Isn't load balancing supposed to be more about throughput and hence "slower" to kick in moving tasks as needed?
What Leo is addressing here is idle load balance, when a CPU is going to become idle we would like to pull tasks from CPUs which have many "as soon as possible".
Is seems that based on his experiments, the "migration cost" is impacting on the movement of some tasks by introducing latencies. By tuning the migration cost value (actually setting it to 0) _some_ benchmarks have been measured to get an uplift on performances.
What we need to understand (at fist instance) is how much "generic" (i.e. platform and workload independent) is the proposed solution.
As a general comment, I can understand that an hardcoded 50us value could be not generic at all, however: is there any indication on how to properly dimension this value for a specific target?
Maybe a specific set of synthetics experiments can be used to figure out what is the best value to be used. In that case we should probably report in the documentation how to measure and tune experimentally this value instead of just changing an hardcoded value with another one.
Method 2: set busy_factor to 1:
This decreases load balance inteval time, so it will give more
chance
for active load balance for migration running task from little core
to
big core.
Same reasoning as before, how can be sure that the value you are proposing (ie. busy_factor=1) it is really generic enough?
Method 1 can improve prominent performance on one big.LITTLE system (which has CA53x4 + CA72x4 cores), from the Geekbench testing result
the
score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for
multi-thread
case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Am I correct on assuming that potentially different values can give us even better performance but we tried and tested only the two values you are proposing?
Yes, I only tried these two values. Just like Patrick suggested, the methodology is more important rather than hard-coded value.
For the 1st test, the root cause was that tasks was hot on a CPU and can't be selected to migrate on other CPU because of its hotness so decreasing the sched_migration_cost_ns directly reduce the hotness period during which a task can't migrate on another CPU
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
I think for EAS, these two paramters are more important than traditional SMP load balance. Because EAS will have more chance to pack tasks onto single CPU or into one cluster, so we need utilize the existed machenism to spread out these tasks, so sched_migration_cost_ns and busy_factors are two things we can rely on.
Moreover, do we have any measure of the impact on energy consumption for the proposed value?
From one member's platform, have not observed power impaction for energy consumption. I will try to use video playback case on Juno to generate out more power data.
Tested method 2 on Juno as well, but it has very minor performance boosting.
That seems to support the idea that what you are proposing are values "optimal" only for performance on a specific platform. Isn't it?
Yes.
Signed-off-by: Leo Yan leo.yan@linaro.org
Documentation/scheduler/sched-energy.txt | 24
++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt
b/Documentation/scheduler/sched-energy.txt
index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared
resources. The cpu can be tricked
into different per-cpu idle states by disabling the other states.
Based on
various combinations of measurements with specific cpus busy and
disabling
idle-states it is possible to extrapolate the idle-state power.
+Performance tunning method +==========================
+Below setting may impact heavily for performance tunning:
+echo 0 > /proc/sys/kernel/sched_migration_cost_ns
+After set sched_migration_cost_ns to 0, it is helpful to spread
tasks within
+the big cluster. Otherwise when scheduler executes load balance, it
calls
+can_migrate_task() to check if tasks are cache hot or not and it
compares
+sched_migration_cost_ns to avoid migrate tasks frequently. This
introduce side
+effect to easily pack tasks on the same one CPU and introduce
latency to
+spread tasks within multi-cores, especially if we think about energy
awared
+scheduling to pack tasks on single CPU.
+echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor
+After set busy_factor to 1, it decreases load balance inteval time.
So if we
+take min_interval = 8, that means we permit the load balance
interval =
+busy_factor * min_interval = 8ms. So this will shorten task
migration latency,
+especially if we want to migrate a running task from little core to
big core
+to trigger active load balance.
1.9.1
-- #include <best/regards.h>
Patrick Bellasi
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
-- #include <best/regards.h>
Patrick Bellasi
On Thu, Oct 13, 2016 at 9:04 AM, Patrick Bellasi patrick.bellasi@arm.com wrote:
On 13-Oct 08:50, Andres Oportus wrote:
On Thu, Oct 13, 2016 at 6:43 AM, Leo Yan leo.yan@linaro.org wrote:
On Thu, Oct 13, 2016 at 03:18:16PM +0200, Vincent Guittot wrote:
On 13 October 2016 at 15:05, Patrick Bellasi <
patrick.bellasi@arm.com>
wrote:
On 10-Oct 16:35, Leo Yan wrote:
Add extra two performance optimization methods by setting sysfs
nodes:
Method 1: set sched_migration_cost_ns to 0:
By default sched_migration_cost_ns = 50000, scheduler calls
That's 50us...
In fact default value is 500000 = 500us not 50000 = 50us
Sorry, should be 500us.
can_migrate_task() to check if tasks are cache hot or not and it compares sched_migration_cost_ns to avoid migrate tasks
frequently.
This introduces side effects to easily pack tasks on the same
one
CPU
and introduce latency to spread tasks within multi-cores,
especially
if we think energy aware scheduling is easily to pack tasks on
single
CPU. So after task packing on one CPU with high utilization, we
can
easily spread out tasks after we set sched_migration_cost_ns to
... dunno how exactly this metric is used by the scheduler but, according to its name and you explanation, it seems that in the use-case you are targeting, tasks needs to me migrated more often
than
50us. Is that the case?
main advantage is that there are sysfs entry for that so it can be tuned for each platform
If set this value to 0, the most benefit I can see is if there have idle CPU and there have two tasks are runnable on another CPU, it can give more chance to migrate one of the runnable tasks onto the idle CPU immediately.
The currently available mechanism to have less latency and spread tasks
is
to set the prefer_idle flag (spreads tasks in the corresponding cgroup as long as there are idle cores).
That's what AOSP's kernels use for the wakeup path.
wouldn't placement in the wakeup path possibly place a task in an idle cpu to begin with and get the same or higher perf improvement compared to moving the task as part of load balancing?
Is this set in this experiment? Isn't load balancing supposed to be more about throughput and hence "slower" to kick in moving tasks as needed?
What Leo is addressing here is idle load balance, when a CPU is going to become idle we would like to pull tasks from CPUs which have many "as soon as possible".
Is seems that based on his experiments, the "migration cost" is impacting on the movement of some tasks by introducing latencies. By tuning the migration cost value (actually setting it to 0) _some_ benchmarks have been measured to get an uplift on performances.
What we need to understand (at fist instance) is how much "generic" (i.e. platform and workload independent) is the proposed solution.
I agree, I would think that migration cost tuning could improve load balancing behavior but setting it to 0 effectively saying that there is no cost to task movement as part of load balancing seems incorrect. I'm wondering if this is a scenario that could be improved/tuned say in the wakeup path rather than assuming no migration cost.
As a general comment, I can understand that an hardcoded 50us value could be not generic at all, however: is there any indication on
how
to properly dimension this value for a specific target?
Maybe a specific set of synthetics experiments can be used to
figure
out what is the best value to be used. In that case we should probably report in the documentation how to measure and tune experimentally this value instead of just
changing an
hardcoded value with another one.
Method 2: set busy_factor to 1:
This decreases load balance inteval time, so it will give more
chance
for active load balance for migration running task from little
core
to
big core.
Same reasoning as before, how can be sure that the value you are proposing (ie. busy_factor=1) it is really generic enough?
Method 1 can improve prominent performance on one big.LITTLE
system
(which has CA53x4 + CA72x4 cores), from the Geekbench testing
result
the
score can improve performance ~5%.
Tested method 1 with Geekbench on the ARM Juno R2 board for
multi-thread
case, the score can be improved from 2348 to 2368, so can improve performance ~0.84%.
Am I correct on assuming that potentially different values can
give us
even better performance but we tried and tested only the two values you are proposing?
Yes, I only tried these two values. Just like Patrick suggested, the methodology is more important rather than hard-coded value.
For the 1st test, the root cause was that tasks was hot on a CPU and can't be selected to migrate on other CPU because of its hotness so decreasing the sched_migration_cost_ns directly reduce the hotness period during which a task can't migrate on another CPU
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
I think for EAS, these two paramters are more important than traditional SMP load balance. Because EAS will have more chance to pack tasks onto single CPU or into one cluster, so we need utilize the existed machenism to spread out these tasks, so sched_migration_cost_ns and busy_factors are two things we can rely on.
Moreover, do we have any measure of the impact on energy
consumption
for the proposed value?
From one member's platform, have not observed power impaction for energy consumption. I will try to use video playback case on Juno to generate out more power data.
Tested method 2 on Juno as well, but it has very minor performance boosting.
That seems to support the idea that what you are proposing are
values
"optimal" only for performance on a specific platform. Isn't it?
Yes.
Signed-off-by: Leo Yan leo.yan@linaro.org
Documentation/scheduler/sched-energy.txt | 24
++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt
b/Documentation/scheduler/sched-energy.txt
index dab2f90..c0e62fe 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared
resources. The cpu can be tricked
into different per-cpu idle states by disabling the other states.
Based on
various combinations of measurements with specific cpus busy and
disabling
idle-states it is possible to extrapolate the idle-state power.
+Performance tunning method +==========================
+Below setting may impact heavily for performance tunning:
+echo 0 > /proc/sys/kernel/sched_migration_cost_ns
+After set sched_migration_cost_ns to 0, it is helpful to spread
tasks within
+the big cluster. Otherwise when scheduler executes load balance,
it
calls
+can_migrate_task() to check if tasks are cache hot or not and it
compares
+sched_migration_cost_ns to avoid migrate tasks frequently. This
introduce side
+effect to easily pack tasks on the same one CPU and introduce
latency to
+spread tasks within multi-cores, especially if we think about
energy
awared
+scheduling to pack tasks on single CPU.
+echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor
+After set busy_factor to 1, it decreases load balance inteval
time.
So if we
+take min_interval = 8, that means we permit the load balance
interval =
+busy_factor * min_interval = 8ms. So this will shorten task
migration latency,
+especially if we want to migrate a running task from little core
to
big core
+to trigger active load balance.
1.9.1
-- #include <best/regards.h>
Patrick Bellasi
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
-- #include <best/regards.h>
Patrick Bellasi
On 13-Oct 09:15, Andres Oportus wrote:
On Thu, Oct 13, 2016 at 9:04 AM, Patrick Bellasi patrick.bellasi@arm.com wrote:
On 13-Oct 08:50, Andres Oportus wrote:
On Thu, Oct 13, 2016 at 6:43 AM, Leo Yan leo.yan@linaro.org wrote:
On Thu, Oct 13, 2016 at 03:18:16PM +0200, Vincent Guittot wrote:
On 13 October 2016 at 15:05, Patrick Bellasi <
patrick.bellasi@arm.com>
wrote:
On 10-Oct 16:35, Leo Yan wrote: > Add extra two performance optimization methods by setting sysfs
nodes:
> > - Method 1: set sched_migration_cost_ns to 0: > > By default sched_migration_cost_ns = 50000, scheduler calls
That's 50us...
In fact default value is 500000 = 500us not 50000 = 50us
Sorry, should be 500us.
> can_migrate_task() to check if tasks are cache hot or not and it > compares sched_migration_cost_ns to avoid migrate tasks
frequently.
> > This introduces side effects to easily pack tasks on the same
one
CPU
> and introduce latency to spread tasks within multi-cores,
especially
> if we think energy aware scheduling is easily to pack tasks on
single
> CPU. So after task packing on one CPU with high utilization, we
can
> easily spread out tasks after we set sched_migration_cost_ns to
... dunno how exactly this metric is used by the scheduler but, according to its name and you explanation, it seems that in the use-case you are targeting, tasks needs to me migrated more often
than
50us. Is that the case?
main advantage is that there are sysfs entry for that so it can be tuned for each platform
If set this value to 0, the most benefit I can see is if there have idle CPU and there have two tasks are runnable on another CPU, it can give more chance to migrate one of the runnable tasks onto the idle CPU immediately.
The currently available mechanism to have less latency and spread tasks
is
to set the prefer_idle flag (spreads tasks in the corresponding cgroup as long as there are idle cores).
That's what AOSP's kernels use for the wakeup path.
wouldn't placement in the wakeup path possibly place a task in an idle cpu to begin with and get the same or higher perf improvement compared to moving the task as part of load balancing?
Is this set in this experiment? Isn't load balancing supposed to be more about throughput and hence "slower" to kick in moving tasks as needed?
What Leo is addressing here is idle load balance, when a CPU is going to become idle we would like to pull tasks from CPUs which have many "as soon as possible".
Is seems that based on his experiments, the "migration cost" is impacting on the movement of some tasks by introducing latencies. By tuning the migration cost value (actually setting it to 0) _some_ benchmarks have been measured to get an uplift on performances.
What we need to understand (at fist instance) is how much "generic" (i.e. platform and workload independent) is the proposed solution.
I agree, I would think that migration cost tuning could improve load balancing behavior but setting it to 0 effectively saying that there is no cost to task movement as part of load balancing seems incorrect. I'm wondering if this is a scenario that could be improved/tuned say in the wakeup path rather than assuming no migration cost.
Independently of how much smart is the wakeup path there can still be cases in which you end up with a CPU going to be idle while another one has more than on task RUNNABLE in its RQs.
Moreover, consider that in a loaded system, we bails out from the EAS mode and we use the "normal" scheduler.
In both cases the load balancer (both idle balance and active balance) are still valuable opportunities to migrate tasks among CPUs.
As a general comment, I can understand that an hardcoded 50us value could be not generic at all, however: is there any indication on
how
to properly dimension this value for a specific target?
Maybe a specific set of synthetics experiments can be used to
figure
out what is the best value to be used. In that case we should probably report in the documentation how to measure and tune experimentally this value instead of just
changing an
hardcoded value with another one.
> - Method 2: set busy_factor to 1: > > This decreases load balance inteval time, so it will give more
chance
> for active load balance for migration running task from little
core
to
> big core.
Same reasoning as before, how can be sure that the value you are proposing (ie. busy_factor=1) it is really generic enough?
> Method 1 can improve prominent performance on one big.LITTLE
system
> (which has CA53x4 + CA72x4 cores), from the Geekbench testing
result
the
> score can improve performance ~5%. > > Tested method 1 with Geekbench on the ARM Juno R2 board for
multi-thread
> case, the score can be improved from 2348 to 2368, so can improve > performance ~0.84%.
Am I correct on assuming that potentially different values can
give us
even better performance but we tried and tested only the two values you are proposing?
Yes, I only tried these two values. Just like Patrick suggested, the methodology is more important rather than hard-coded value.
For the 1st test, the root cause was that tasks was hot on a CPU and can't be selected to migrate on other CPU because of its hotness so decreasing the sched_migration_cost_ns directly reduce the hotness period during which a task can't migrate on another CPU
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
I think for EAS, these two paramters are more important than traditional SMP load balance. Because EAS will have more chance to pack tasks onto single CPU or into one cluster, so we need utilize the existed machenism to spread out these tasks, so sched_migration_cost_ns and busy_factors are two things we can rely on.
Moreover, do we have any measure of the impact on energy
consumption
for the proposed value?
From one member's platform, have not observed power impaction for energy consumption. I will try to use video playback case on Juno to generate out more power data.
> Tested method 2 on Juno as well, but it has very minor performance > boosting.
That seems to support the idea that what you are proposing are
values
"optimal" only for performance on a specific platform. Isn't it?
Yes.
> Signed-off-by: Leo Yan leo.yan@linaro.org > --- > Documentation/scheduler/sched-energy.txt | 24
++++++++++++++++++++++++
> 1 file changed, 24 insertions(+) > > diff --git a/Documentation/scheduler/sched-energy.txt
b/Documentation/scheduler/sched-energy.txt
> index dab2f90..c0e62fe 100644 > --- a/Documentation/scheduler/sched-energy.txt > +++ b/Documentation/scheduler/sched-energy.txt > @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the shared
resources. The cpu can be tricked
> into different per-cpu idle states by disabling the other states.
Based on
> various combinations of measurements with specific cpus busy and
disabling
> idle-states it is possible to extrapolate the idle-state power. > + > +Performance tunning method > +========================== > + > +Below setting may impact heavily for performance tunning: > + > +echo 0 > /proc/sys/kernel/sched_migration_cost_ns > + > +After set sched_migration_cost_ns to 0, it is helpful to spread
tasks within
> +the big cluster. Otherwise when scheduler executes load balance,
it
calls
> +can_migrate_task() to check if tasks are cache hot or not and it
compares
> +sched_migration_cost_ns to avoid migrate tasks frequently. This
introduce side
> +effect to easily pack tasks on the same one CPU and introduce
latency to
> +spread tasks within multi-cores, especially if we think about
energy
awared
> +scheduling to pack tasks on single CPU. > + > +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain0/busy_factor > +echo 1 > /proc/sys/kernel/sched_domain/cpuX/domain1/busy_factor > + > +After set busy_factor to 1, it decreases load balance inteval
time.
So if we
> +take min_interval = 8, that means we permit the load balance
interval =
> +busy_factor * min_interval = 8ms. So this will shorten task
migration latency,
> +especially if we want to migrate a running task from little core
to
big core
> +to trigger active load balance. > -- > 1.9.1 >
-- #include <best/regards.h>
Patrick Bellasi
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
On Thu, Oct 13, 2016 at 9:22 AM, Patrick Bellasi patrick.bellasi@arm.com wrote:
On 13-Oct 09:15, Andres Oportus wrote:
On Thu, Oct 13, 2016 at 9:04 AM, Patrick Bellasi <
patrick.bellasi@arm.com>
wrote:
On 13-Oct 08:50, Andres Oportus wrote:
On Thu, Oct 13, 2016 at 6:43 AM, Leo Yan leo.yan@linaro.org wrote:
On Thu, Oct 13, 2016 at 03:18:16PM +0200, Vincent Guittot wrote:
On 13 October 2016 at 15:05, Patrick Bellasi <
patrick.bellasi@arm.com>
wrote:
> On 10-Oct 16:35, Leo Yan wrote: >> Add extra two performance optimization methods by setting
sysfs
nodes:
>> >> - Method 1: set sched_migration_cost_ns to 0: >> >> By default sched_migration_cost_ns = 50000, scheduler calls > > That's 50us...
In fact default value is 500000 = 500us not 50000 = 50us
Sorry, should be 500us.
> >> can_migrate_task() to check if tasks are cache hot or not
and it
>> compares sched_migration_cost_ns to avoid migrate tasks
frequently.
>> >> This introduces side effects to easily pack tasks on the
same
one
CPU
>> and introduce latency to spread tasks within multi-cores,
especially
>> if we think energy aware scheduling is easily to pack tasks
on
single
>> CPU. So after task packing on one CPU with high
utilization, we
can
>> easily spread out tasks after we set
sched_migration_cost_ns to
> > ... dunno how exactly this metric is used by the scheduler but, > according to its name and you explanation, it seems that in the > use-case you are targeting, tasks needs to me migrated more
often
than
> 50us. Is that the case?
main advantage is that there are sysfs entry for that so it can
be
tuned for each platform
If set this value to 0, the most benefit I can see is if there have idle CPU and there have two tasks are runnable on another CPU, it
can
give more chance to migrate one of the runnable tasks onto the idle CPU immediately.
The currently available mechanism to have less latency and spread
tasks
is
to set the prefer_idle flag (spreads tasks in the corresponding
cgroup as
long as there are idle cores).
That's what AOSP's kernels use for the wakeup path.
wouldn't placement in the wakeup path possibly place a task in an idle
cpu
to begin with and get the same or higher perf improvement compared to moving the task as part of load balancing?
Is this set in this experiment? Isn't load balancing supposed to be more about throughput and hence
"slower" to
kick in moving tasks as needed?
What Leo is addressing here is idle load balance, when a CPU is going to become idle we would like to pull tasks from CPUs which have many "as soon as possible".
Is seems that based on his experiments, the "migration cost" is impacting on the movement of some tasks by introducing latencies. By tuning the migration cost value (actually setting it to 0) _some_ benchmarks have been measured to get an uplift on performances.
What we need to understand (at fist instance) is how much "generic" (i.e. platform and workload independent) is the proposed solution.
I agree, I would think that migration cost tuning could improve load balancing behavior but setting it to 0 effectively saying that there is
no
cost to task movement as part of load balancing seems incorrect. I'm wondering if this is a scenario that could be improved/tuned say in the wakeup path rather than assuming no migration cost.
Independently of how much smart is the wakeup path there can still be cases in which you end up with a CPU going to be idle while another one has more than on task RUNNABLE in its RQs.
Moreover, consider that in a loaded system, we bails out from the EAS mode and we use the "normal" scheduler.
In both cases the load balancer (both idle balance and active balance) are still valuable opportunities to migrate tasks among CPUs.
Agreed.
> As a general comment, I can understand that an hardcoded 50us
value
> could be not generic at all, however: is there any indication
on
how
> to properly dimension this value for a specific target? > > Maybe a specific set of synthetics experiments can be used to
figure
> out what is the best value to be used. > In that case we should probably report in the documentation
how to
> measure and tune experimentally this value instead of just
changing an
> hardcoded value with another one. > >> - Method 2: set busy_factor to 1: >> >> This decreases load balance inteval time, so it will give
more
chance
>> for active load balance for migration running task from
little
core
to
>> big core. > > Same reasoning as before, how can be sure that the value you
are
> proposing (ie. busy_factor=1) it is really generic enough? > >> Method 1 can improve prominent performance on one big.LITTLE
system
>> (which has CA53x4 + CA72x4 cores), from the Geekbench testing
result
the
>> score can improve performance ~5%. >> >> Tested method 1 with Geekbench on the ARM Juno R2 board for
multi-thread
>> case, the score can be improved from 2348 to 2368, so can
improve
>> performance ~0.84%. > > Am I correct on assuming that potentially different values can
give us
> even better performance but we tried and tested only the two
values
> you are proposing?
Yes, I only tried these two values. Just like Patrick suggested,
the
methodology is more important rather than hard-coded value.
For the 1st test, the root cause was that tasks was hot on a CPU
and
can't be selected to migrate on other CPU because of its hotness
so
decreasing the sched_migration_cost_ns directly reduce the
hotness
period during which a task can't migrate on another CPU
That being said, i'm not sure that this should be put in the Documentation/scheduler/sched-energy.tx
I think for EAS, these two paramters are more important than traditional SMP load balance. Because EAS will have more chance to pack tasks onto single CPU or into one cluster, so we need utilize
the
existed machenism to spread out these tasks, so sched_migration_cost_ns and busy_factors are two things we can rely on.
> Moreover, do we have any measure of the impact on energy
consumption
> for the proposed value?
From one member's platform, have not observed power impaction for energy consumption. I will try to use video playback case on Juno
to
generate out more power data.
>> Tested method 2 on Juno as well, but it has very minor
performance
>> boosting. > > That seems to support the idea that what you are proposing are
values
> "optimal" only for performance on a specific platform. Isn't
it?
Yes.
>> Signed-off-by: Leo Yan leo.yan@linaro.org >> --- >> Documentation/scheduler/sched-energy.txt | 24
++++++++++++++++++++++++
>> 1 file changed, 24 insertions(+) >> >> diff --git a/Documentation/scheduler/sched-energy.txt
b/Documentation/scheduler/sched-energy.txt
>> index dab2f90..c0e62fe 100644 >> --- a/Documentation/scheduler/sched-energy.txt >> +++ b/Documentation/scheduler/sched-energy.txt >> @@ -360,3 +360,27 @@ of the cpu from idle/busy power of the
shared
resources. The cpu can be tricked
>> into different per-cpu idle states by disabling the other
states.
Based on
>> various combinations of measurements with specific cpus busy
and
disabling
>> idle-states it is possible to extrapolate the idle-state
power.
>> + >> +Performance tunning method >> +========================== >> + >> +Below setting may impact heavily for performance tunning: >> + >> +echo 0 > /proc/sys/kernel/sched_migration_cost_ns >> + >> +After set sched_migration_cost_ns to 0, it is helpful to
spread
tasks within
>> +the big cluster. Otherwise when scheduler executes load
balance,
it
calls
>> +can_migrate_task() to check if tasks are cache hot or not
and it
compares
>> +sched_migration_cost_ns to avoid migrate tasks frequently.
This
introduce side
>> +effect to easily pack tasks on the same one CPU and introduce
latency to
>> +spread tasks within multi-cores, especially if we think about
energy
awared
>> +scheduling to pack tasks on single CPU. >> + >> +echo 1 > /proc/sys/kernel/sched_domain/
cpuX/domain0/busy_factor
>> +echo 1 > /proc/sys/kernel/sched_domain/
cpuX/domain1/busy_factor
>> + >> +After set busy_factor to 1, it decreases load balance inteval
time.
So if we
>> +take min_interval = 8, that means we permit the load balance
interval =
>> +busy_factor * min_interval = 8ms. So this will shorten task
migration latency,
>> +especially if we want to migrate a running task from little
core
to
big core
>> +to trigger active load balance. >> -- >> 1.9.1 >> > > -- > #include <best/regards.h> > > Patrick Bellasi
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
Energy aware scheduling sets tipping point when any CPU in the system is overutilized. So there have several occasions to set root domain's overutilized flag to indicate system is over tipping point, like scheduler tick, load balance, enqueue task, on the other hand the scheduler only utilize load balance's function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will not be updated until the CPU is waken up. In some worse case, the CPU may stay in idle states for very long time (even may in second level), so before the CPU enter idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized" and will not switch to state for under tipping point. As result, a very small task stays on big core for long time due system cannot go back to energy aware path.
This patch is to check CPU idle state in function update_sg_lb_stats(), so if CPU is in idle state then will simply consider the CPU is not overutilized. So avoid to set tipping point by idle CPUs.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 937eca2..43eae09 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7409,7 +7409,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
- if (cpu_overutilized(i)) { + if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i); -- 1.9.1
On 10-Oct 16:35, Leo Yan wrote:
Energy aware scheduling sets tipping point when any CPU in the system is overutilized. So there have several occasions to set root domain's overutilized flag to indicate system is over tipping point, like scheduler tick, load balance, enqueue task, on the other hand the scheduler only utilize load balance's function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will not be updated until the CPU is waken up. In some worse case, the CPU may stay in idle states for very long time (even may in second level), so before the CPU enter idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized" and will not switch to state for under tipping point. As result, a very small task stays on big core for long time due system cannot go back to energy aware path.
What happen instead if a busy CPU has just entered idle but it's likely to exit quite soon?
This patch is to check CPU idle state in function update_sg_lb_stats(), so if CPU is in idle state then will simply consider the CPU is not overutilized. So avoid to set tipping point by idle CPUs.
Maybe it's possible, just for idle CPUs marked as overutilized, to trigger at this point an update_cfs_rq_load_avg and than verify if the utilization signal has decayed enough for the CPU to be considered not overutilized anymore?
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 937eca2..43eae09 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7409,7 +7409,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i)) {
if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
Hi Leo, Patrick,
Do we have c-state info at this time? We could perhaps only 'do something' (ignore overutilised or update load averages, whatever is best) for idle CPUs which have requested the deepest idle state - on the grounds that shallow idle states indicate we expect to be leaving idle very soon. We would also need a good definition of shallow idle states of course - on some platforms I guess the exit latency of CPU down might be reported to be low enough that we enter a lot, which makes it a bit spurious to even bother checking.
Leo, do you have a feel for if this is a very rare event or something we can repro with a test case easily?
Best Regards,
--Chris
________________________________ From: Patrick Bellasi patrick.bellasi@arm.com Sent: 11 October 2016 12:35 To: leo.yan@linaro.org Cc: eas-dev@lists.linaro.org; Dietmar Eggemann; Morten Rasmussen; Robin Randhawa; Juri Lelli; Chris Redpath; Vincent Guittot; Steve Muckle Subject: Re: [PATCH v1 7/7] sched/fair: consider CPU overutilized only when it is not idle
On 10-Oct 16:35, Leo Yan wrote:
Energy aware scheduling sets tipping point when any CPU in the system is overutilized. So there have several occasions to set root domain's overutilized flag to indicate system is over tipping point, like scheduler tick, load balance, enqueue task, on the other hand the scheduler only utilize load balance's function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will not be updated until the CPU is waken up. In some worse case, the CPU may stay in idle states for very long time (even may in second level), so before the CPU enter idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized" and will not switch to state for under tipping point. As result, a very small task stays on big core for long time due system cannot go back to energy aware path.
What happen instead if a busy CPU has just entered idle but it's likely to exit quite soon?
This patch is to check CPU idle state in function update_sg_lb_stats(), so if CPU is in idle state then will simply consider the CPU is not overutilized. So avoid to set tipping point by idle CPUs.
Maybe it's possible, just for idle CPUs marked as overutilized, to trigger at this point an update_cfs_rq_load_avg and than verify if the utilization signal has decayed enough for the CPU to be considered not overutilized anymore?
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 937eca2..43eae09 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7409,7 +7409,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i)) {
if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Chris,
On Tue, Oct 11, 2016 at 12:39:05PM +0000, Chris Redpath wrote:
Hi Leo, Patrick,
Do we have c-state info at this time? We could perhaps only 'do something' (ignore overutilised or update load averages, whatever is best) for idle CPUs which have requested the deepest idle state - on the grounds that shallow idle states indicate we expect to be leaving idle very soon. We would also need a good definition of shallow idle states of course - on some platforms I guess the exit latency of CPU down might be reported to be low enough that we enter a lot, which makes it a bit spurious to even bother checking.
Yes. We can get c-state related info by function idle_get_state_idx() and idle_get_state(). But I think you are suggesting to 'do something' for deepest idle state CPUs and leave for shallow idle state CPUs; I think this will leave some cornor cases for the shallow idle state CPUs may stay in idle state for long time.
Leo, do you have a feel for if this is a very rare event or something we can repro with a test case easily?
I generated one rt-app case which can reproduce it quickly on Hikey, please note I saw this issue is much easier to reproduce in Android rather than generic Linux.
Please see the slides [1]: http://www.slideshare.net/linaroorg/las16tr04-using-tracing-to-tune-and-opti...
In the slides the second example is to reproduce this issue with LISA rt-app scripts, the notebook file to generate rt-app workload in the folder: https://fileserver.linaro.org/owncloud/index.php/s/5gpVpzN0FdxMmGl?path=%2Fs...
Thanks, Leo Yan
On Tue, Oct 11, 2016 at 12:35:26PM +0100, Patrick Bellasi wrote:
On 10-Oct 16:35, Leo Yan wrote:
Energy aware scheduling sets tipping point when any CPU in the system is overutilized. So there have several occasions to set root domain's overutilized flag to indicate system is over tipping point, like scheduler tick, load balance, enqueue task, on the other hand the scheduler only utilize load balance's function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will not be updated until the CPU is waken up. In some worse case, the CPU may stay in idle states for very long time (even may in second level), so before the CPU enter idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized" and will not switch to state for under tipping point. As result, a very small task stays on big core for long time due system cannot go back to energy aware path.
What happen instead if a busy CPU has just entered idle but it's likely to exit quite soon?
For the case if CPUs can update their utilization timely, the small task usually can migrate back to LITTLE core quickly.
This patch is to check CPU idle state in function update_sg_lb_stats(), so if CPU is in idle state then will simply consider the CPU is not overutilized. So avoid to set tipping point by idle CPUs.
Maybe it's possible, just for idle CPUs marked as overutilized, to trigger at this point an update_cfs_rq_load_avg and than verify if the utilization signal has decayed enough for the CPU to be considered not overutilized anymore?
Yes. Essentially this issue is caused by the util_avg signal has not been updated timely, so it's better to find method to update idle CPU's utilization properly.
IIRC Morten before in one email reminded me if want to update idle CPU's load_avg, should acquire rq's lock. Another thing should consider is, idle CPU's stale utilization is not only impacted for this case, it also introduce misunderstanding in wakenup path for CPU's selection, so I sent another patch before to update idle CPU's utilization with function update_blocked_averages [1].
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000567.html
But I think the patch [1] is hard to be accepted due it introduces race condition for rq's locks. So this is why I go back to use most simple method to resolve part of the issue in this patch.
What's your suggestion for this?
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 937eca2..43eae09 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7409,7 +7409,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i)) {
if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
On 11-Oct 21:37, Leo Yan wrote:
On Tue, Oct 11, 2016 at 12:35:26PM +0100, Patrick Bellasi wrote:
On 10-Oct 16:35, Leo Yan wrote:
Energy aware scheduling sets tipping point when any CPU in the system is overutilized. So there have several occasions to set root domain's overutilized flag to indicate system is over tipping point, like scheduler tick, load balance, enqueue task, on the other hand the scheduler only utilize load balance's function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will not be updated until the CPU is waken up. In some worse case, the CPU may stay in idle states for very long time (even may in second level), so before the CPU enter idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized" and will not switch to state for under tipping point. As result, a very small task stays on big core for long time due system cannot go back to energy aware path.
What happen instead if a busy CPU has just entered idle but it's likely to exit quite soon?
For the case if CPUs can update their utilization timely, the small task usually can migrate back to LITTLE core quickly.
I was referring mainly to the case for example of an 80% task with a short period (e.g. 16ms), which has just gone to sleep. Such a task has a PELT signals which varies in [800:860] once stable.
In this case the CPU can appear to be IDLE but as soon as the task wakeup we will not have dacayed its utilization below the tipping point.
This patch is to check CPU idle state in function update_sg_lb_stats(), so if CPU is in idle state then will simply consider the CPU is not overutilized. So avoid to set tipping point by idle CPUs.
Maybe it's possible, just for idle CPUs marked as overutilized, to trigger at this point an update_cfs_rq_load_avg and than verify if the utilization signal has decayed enough for the CPU to be considered not overutilized anymore?
Yes. Essentially this issue is caused by the util_avg signal has not been updated timely, so it's better to find method to update idle CPU's utilization properly.
IIRC Morten before in one email reminded me if want to update idle CPU's load_avg, should acquire rq's lock.
True, but that's an idle CPU, thus it should not be a main issue. Unless we already have a lock for another CPU, but I don't think that's our case (see after).
Another thing should consider is, idle CPU's stale utilization is not only impacted for this case, it also introduce misunderstanding in wakenup path for CPU's selection, so I sent another patch before to update idle CPU's utilization with function update_blocked_averages [1].
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000567.html
Exactly, does it makes sense to use the same update_blocked_averages() for an idle CPU at this point?
But I think the patch [1] is hard to be accepted due it introduces race condition for rq's locks. So this is why I go back to use most simple method to resolve part of the issue in this patch.
AFAIKS, at least on a 3.18 kernel, these are the calls which ends-up to update_sg_lb_stats:
idle_balance /* release this_cpu lock */ rebalance_domanins /* does not hold any RQ lock */ load_balance find_busiest_group update_sg_lb_stats
To me it seems that we always enter the update_sg_lb_stats without holding any RQ lock. Am I missing something?
Otherwise, what are the races we expect if we try to get the RQ lock for an idle CPU from within update_sg_lb_stats?
What's your suggestion for this?
Maybe Dietmar's observations regarding [1] was related to the fact that in that patch we was trying to update the blocked loads from the wakeup path?
Whereas here we are in the balance path, thus quite likely the overheads introduced to update idle CPUs utilization can be considered acceptable, especially considering that otherwise we miss optimization opportunities.
Cheers Patrick
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 937eca2..43eae09 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7409,7 +7409,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i)) {
if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
On Tue, Oct 11, 2016 at 04:58:28PM +0100, Patrick Bellasi wrote:
On 11-Oct 21:37, Leo Yan wrote:
On Tue, Oct 11, 2016 at 12:35:26PM +0100, Patrick Bellasi wrote:
On 10-Oct 16:35, Leo Yan wrote:
Energy aware scheduling sets tipping point when any CPU in the system is overutilized. So there have several occasions to set root domain's overutilized flag to indicate system is over tipping point, like scheduler tick, load balance, enqueue task, on the other hand the scheduler only utilize load balance's function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will not be updated until the CPU is waken up. In some worse case, the CPU may stay in idle states for very long time (even may in second level), so before the CPU enter idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized" and will not switch to state for under tipping point. As result, a very small task stays on big core for long time due system cannot go back to energy aware path.
What happen instead if a busy CPU has just entered idle but it's likely to exit quite soon?
For the case if CPUs can update their utilization timely, the small task usually can migrate back to LITTLE core quickly.
I was referring mainly to the case for example of an 80% task with a short period (e.g. 16ms), which has just gone to sleep. Such a task has a PELT signals which varies in [800:860] once stable.
In this case the CPU can appear to be IDLE but as soon as the task wakeup we will not have dacayed its utilization below the tipping point.
For this case, if load balance has cleared overutilied flag for idle CPUs at previous time, I think we still can rely on the 80% task is enqueued and at that time point it will set 'overutilized' flag again.
This patch is to check CPU idle state in function update_sg_lb_stats(), so if CPU is in idle state then will simply consider the CPU is not overutilized. So avoid to set tipping point by idle CPUs.
Maybe it's possible, just for idle CPUs marked as overutilized, to trigger at this point an update_cfs_rq_load_avg and than verify if the utilization signal has decayed enough for the CPU to be considered not overutilized anymore?
Yes. Essentially this issue is caused by the util_avg signal has not been updated timely, so it's better to find method to update idle CPU's utilization properly.
IIRC Morten before in one email reminded me if want to update idle CPU's load_avg, should acquire rq's lock.
True, but that's an idle CPU, thus it should not be a main issue. Unless we already have a lock for another CPU, but I don't think that's our case (see after).
Another thing should consider is, idle CPU's stale utilization is not only impacted for this case, it also introduce misunderstanding in wakenup path for CPU's selection, so I sent another patch before to update idle CPU's utilization with function update_blocked_averages [1].
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000567.html
Exactly, does it makes sense to use the same update_blocked_averages() for an idle CPU at this point?
Using update_blocked_averages() is safe I think, and your suggestion to update idle CPUs utilization obviously has less workload than my patch [1].
It's deserved to try this method, BTW, I'm just curious do you know how to measure the scheduler workloads? I have no concept how much extra workload introduced by this patch, so want to compare some performance before and after applied the patch.
But I think the patch [1] is hard to be accepted due it introduces race condition for rq's locks. So this is why I go back to use most simple method to resolve part of the issue in this patch.
AFAIKS, at least on a 3.18 kernel, these are the calls which ends-up to update_sg_lb_stats:
idle_balance /* release this_cpu lock */ rebalance_domanins /* does not hold any RQ lock */ load_balance find_busiest_group update_sg_lb_stats
To me it seems that we always enter the update_sg_lb_stats without holding any RQ lock. Am I missing something?
Otherwise, what are the races we expect if we try to get the RQ lock for an idle CPU from within update_sg_lb_stats?
For example, there have two CPUs (e.g. CPU0 and CPU1) to update CPU2's util_avg and load_avg:
CPU0 CPU1
idle_balance ttwu_queue(CPU2's rq) /* release this_cpu lock */ /* acquire this_cpu lock */ rebalance_domanins ttwu_do_activate /* does not hold any RQ lock */ ttwu_activate load_balance activate_task find_busiest_group enqueue_task update_sg_lb_stats enqueue_task_fair update_blocked_averages(CPU2's rq) enqueue_entity `-----------------------------> update_load_avg race condition /* release this_cpu lock */
What's your suggestion for this?
Maybe Dietmar's observations regarding [1] was related to the fact that in that patch we was trying to update the blocked loads from the wakeup path?
Dietmar is referring select_energy_cpu_brute() and capacity_spare_wake(), these two functions remove the wakeup task's util from previous CPU; but they are no matter with idle CPU's blocked loads.
Whereas here we are in the balance path, thus quite likely the overheads introduced to update idle CPUs utilization can be considered acceptable, especially considering that otherwise we miss optimization opportunities.
Agree. I will generate a new patch. Thanks a lot for very detailed reviewing and suggestion :)
Thanks, Leo Yan
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 937eca2..43eae09 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7409,7 +7409,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i)) {
if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
On 12-Oct 10:33, Leo Yan wrote:
On Tue, Oct 11, 2016 at 04:58:28PM +0100, Patrick Bellasi wrote:
On 11-Oct 21:37, Leo Yan wrote:
On Tue, Oct 11, 2016 at 12:35:26PM +0100, Patrick Bellasi wrote:
On 10-Oct 16:35, Leo Yan wrote:
Energy aware scheduling sets tipping point when any CPU in the system is overutilized. So there have several occasions to set root domain's overutilized flag to indicate system is over tipping point, like scheduler tick, load balance, enqueue task, on the other hand the scheduler only utilize load balance's function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will not be updated until the CPU is waken up. In some worse case, the CPU may stay in idle states for very long time (even may in second level), so before the CPU enter idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized" and will not switch to state for under tipping point. As result, a very small task stays on big core for long time due system cannot go back to energy aware path.
What happen instead if a busy CPU has just entered idle but it's likely to exit quite soon?
For the case if CPUs can update their utilization timely, the small task usually can migrate back to LITTLE core quickly.
I was referring mainly to the case for example of an 80% task with a short period (e.g. 16ms), which has just gone to sleep. Such a task has a PELT signals which varies in [800:860] once stable.
In this case the CPU can appear to be IDLE but as soon as the task wakeup we will not have dacayed its utilization below the tipping point.
For this case, if load balance has cleared overutilied flag for idle CPUs at previous time, I think we still can rely on the 80% task is enqueued and at that time point it will set 'overutilized' flag again.
Mmm... don't we risk to schedule the 80% task using the energy-aware path instead of select_idle_siblings?
The main issue to me is that we are going back to energy aware while the system, in this specific scenario, is overutilized. Dunno what are the implications, but that's a critical point to be considered... what about scheduler stability? Don't we risk to increase the frequency of entering/exiting EAS mode?
This patch is to check CPU idle state in function update_sg_lb_stats(), so if CPU is in idle state then will simply consider the CPU is not overutilized. So avoid to set tipping point by idle CPUs.
Maybe it's possible, just for idle CPUs marked as overutilized, to trigger at this point an update_cfs_rq_load_avg and than verify if the utilization signal has decayed enough for the CPU to be considered not overutilized anymore?
Yes. Essentially this issue is caused by the util_avg signal has not been updated timely, so it's better to find method to update idle CPU's utilization properly.
IIRC Morten before in one email reminded me if want to update idle CPU's load_avg, should acquire rq's lock.
True, but that's an idle CPU, thus it should not be a main issue. Unless we already have a lock for another CPU, but I don't think that's our case (see after).
Another thing should consider is, idle CPU's stale utilization is not only impacted for this case, it also introduce misunderstanding in wakenup path for CPU's selection, so I sent another patch before to update idle CPU's utilization with function update_blocked_averages [1].
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000567.html
Exactly, does it makes sense to use the same update_blocked_averages() for an idle CPU at this point?
Using update_blocked_averages() is safe I think, and your suggestion to update idle CPUs utilization obviously has less workload than my patch [1].
It's deserved to try this method, BTW, I'm just curious do you know how to measure the scheduler workloads? I have no concept how much extra workload introduced by this patch, so want to compare some performance before and after applied the patch.
Probably you can collect some interesting stats using our new LISA's functions profiling support. This should allow you to profile the time spent in update_sg_lb_stats (or any other function).
Please note that you should probably change the definition of this function from: static inline void update_sg_lb_stats to be: noinline void update_sg_lb_stats
The usage of "noinline" will ensure that the function is available to ftrace for profiling.
Here is an example notebook to get started: https://github.com/ARM-software/lisa/blob/master/ipynb/examples/trace_analys...
You can easily update the notebook to get a comparison among two runs. Here is a notebook I've used to compare two different versions of energy_diff: https://gist.github.com/derkling/fd9d36e004da977cd3be0001c8abaa96
But I think the patch [1] is hard to be accepted due it introduces race condition for rq's locks. So this is why I go back to use most simple method to resolve part of the issue in this patch.
AFAIKS, at least on a 3.18 kernel, these are the calls which ends-up to update_sg_lb_stats:
idle_balance /* release this_cpu lock */ rebalance_domanins /* does not hold any RQ lock */ load_balance find_busiest_group update_sg_lb_stats
To me it seems that we always enter the update_sg_lb_stats without holding any RQ lock. Am I missing something?
Otherwise, what are the races we expect if we try to get the RQ lock for an idle CPU from within update_sg_lb_stats?
For example, there have two CPUs (e.g. CPU0 and CPU1) to update CPU2's util_avg and load_avg:
CPU0 CPU1
idle_balance ttwu_queue(CPU2's rq) /* release this_cpu lock */ /* acquire this_cpu lock */ rebalance_domanins ttwu_do_activate /* does not hold any RQ lock */ ttwu_activate load_balance activate_task find_busiest_group enqueue_task update_sg_lb_stats enqueue_task_fair update_blocked_averages(CPU2's rq) enqueue_entity `-----------------------------> update_load_avg race condition /* release this_cpu lock */
Ok, that's a race but it does not lead to a deadlock. Either CPU0 or CPU1 will update the CPU2 stats and, than, the other CPU will get the lock and just bails out because the stats have already been updated.
Is that an issue? Maybe I'm missing something...
What's your suggestion for this?
Maybe Dietmar's observations regarding [1] was related to the fact that in that patch we was trying to update the blocked loads from the wakeup path?
Dietmar is referring select_energy_cpu_brute() and capacity_spare_wake(), these two functions remove the wakeup task's util from previous CPU; but they are no matter with idle CPU's blocked loads.
Ok, but mainly these functions are part of the critical path... that's the main concern I guess.
Whereas here we are in the balance path, thus quite likely the overheads introduced to update idle CPUs utilization can be considered acceptable, especially considering that otherwise we miss optimization opportunities.
Agree. I will generate a new patch. Thanks a lot for very detailed reviewing and suggestion :)
Thanks, Leo Yan
Cheers Patrick
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 937eca2..43eae09 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7409,7 +7409,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
if (cpu_overutilized(i)) {
if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i);
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
-- #include <best/regards.h>
Patrick Bellasi
On Wed, Oct 12, 2016 at 12:35:00PM +0100, Patrick Bellasi wrote:
[...]
On 12-Oct 10:33, Leo Yan wrote:
On Tue, Oct 11, 2016 at 04:58:28PM +0100, Patrick Bellasi wrote:
On 11-Oct 21:37, Leo Yan wrote:
On Tue, Oct 11, 2016 at 12:35:26PM +0100, Patrick Bellasi wrote:
On 10-Oct 16:35, Leo Yan wrote:
Energy aware scheduling sets tipping point when any CPU in the system is overutilized. So there have several occasions to set root domain's overutilized flag to indicate system is over tipping point, like scheduler tick, load balance, enqueue task, on the other hand the scheduler only utilize load balance's function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear this flag after system is under tipping point,
For idle CPU, it will keep stale utilization value and this value will not be updated until the CPU is waken up. In some worse case, the CPU may stay in idle states for very long time (even may in second level), so before the CPU enter idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized" and will not switch to state for under tipping point. As result, a very small task stays on big core for long time due system cannot go back to energy aware path.
What happen instead if a busy CPU has just entered idle but it's likely to exit quite soon?
For the case if CPUs can update their utilization timely, the small task usually can migrate back to LITTLE core quickly.
I was referring mainly to the case for example of an 80% task with a short period (e.g. 16ms), which has just gone to sleep. Such a task has a PELT signals which varies in [800:860] once stable.
In this case the CPU can appear to be IDLE but as soon as the task wakeup we will not have dacayed its utilization below the tipping point.
For this case, if load balance has cleared overutilied flag for idle CPUs at previous time, I think we still can rely on the 80% task is enqueued and at that time point it will set 'overutilized' flag again.
Mmm... don't we risk to schedule the 80% task using the energy-aware path instead of select_idle_siblings?
The main issue to me is that we are going back to energy aware while the system, in this specific scenario, is overutilized. Dunno what are the implications, but that's a critical point to be considered... what about scheduler stability? Don't we risk to increase the frequency of entering/exiting EAS mode?
This is good point. I will try your proposal to update idle CPUs utilization in function update_sg_lb_stats() so can let idle CPUs signal also can decay step by step.
This patch is to check CPU idle state in function update_sg_lb_stats(), so if CPU is in idle state then will simply consider the CPU is not overutilized. So avoid to set tipping point by idle CPUs.
Maybe it's possible, just for idle CPUs marked as overutilized, to trigger at this point an update_cfs_rq_load_avg and than verify if the utilization signal has decayed enough for the CPU to be considered not overutilized anymore?
Yes. Essentially this issue is caused by the util_avg signal has not been updated timely, so it's better to find method to update idle CPU's utilization properly.
IIRC Morten before in one email reminded me if want to update idle CPU's load_avg, should acquire rq's lock.
True, but that's an idle CPU, thus it should not be a main issue. Unless we already have a lock for another CPU, but I don't think that's our case (see after).
Another thing should consider is, idle CPU's stale utilization is not only impacted for this case, it also introduce misunderstanding in wakenup path for CPU's selection, so I sent another patch before to update idle CPU's utilization with function update_blocked_averages [1].
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000567.html
Exactly, does it makes sense to use the same update_blocked_averages() for an idle CPU at this point?
Using update_blocked_averages() is safe I think, and your suggestion to update idle CPUs utilization obviously has less workload than my patch [1].
It's deserved to try this method, BTW, I'm just curious do you know how to measure the scheduler workloads? I have no concept how much extra workload introduced by this patch, so want to compare some performance before and after applied the patch.
Probably you can collect some interesting stats using our new LISA's functions profiling support. This should allow you to profile the time spent in update_sg_lb_stats (or any other function).
Please note that you should probably change the definition of this function from: static inline void update_sg_lb_stats to be: noinline void update_sg_lb_stats
The usage of "noinline" will ensure that the function is available to ftrace for profiling.
Here is an example notebook to get started: https://github.com/ARM-software/lisa/blob/master/ipynb/examples/trace_analys...
You can easily update the notebook to get a comparison among two runs. Here is a notebook I've used to compare two different versions of energy_diff: https://gist.github.com/derkling/fd9d36e004da977cd3be0001c8abaa96
Thanks for sharing. Will do this.
But I think the patch [1] is hard to be accepted due it introduces race condition for rq's locks. So this is why I go back to use most simple method to resolve part of the issue in this patch.
AFAIKS, at least on a 3.18 kernel, these are the calls which ends-up to update_sg_lb_stats:
idle_balance /* release this_cpu lock */ rebalance_domanins /* does not hold any RQ lock */ load_balance find_busiest_group update_sg_lb_stats
To me it seems that we always enter the update_sg_lb_stats without holding any RQ lock. Am I missing something?
Otherwise, what are the races we expect if we try to get the RQ lock for an idle CPU from within update_sg_lb_stats?
For example, there have two CPUs (e.g. CPU0 and CPU1) to update CPU2's util_avg and load_avg:
CPU0 CPU1
idle_balance ttwu_queue(CPU2's rq) /* release this_cpu lock */ /* acquire this_cpu lock */ rebalance_domanins ttwu_do_activate /* does not hold any RQ lock */ ttwu_activate load_balance activate_task find_busiest_group enqueue_task update_sg_lb_stats enqueue_task_fair update_blocked_averages(CPU2's rq) enqueue_entity `-----------------------------> update_load_avg race condition /* release this_cpu lock */
Ok, that's a race but it does not lead to a deadlock. Either CPU0 or CPU1 will update the CPU2 stats and, than, the other CPU will get the lock and just bails out because the stats have already been updated.
Is that an issue? Maybe I'm missing something...
You are right, this will not introduce deadlock issue; but system will have more race condition for acquiring/releasing rq's lock, so it will introduce potential performance downgrade by it.
[...]
Thanks, Leo Yan