This patch series is essentially based on Morten's patch "sched/fair: Compute task/cpu utilization at wake-up more correctly"; so want to achieve more accurate estimation for CPU utilization and choose proper CPU as possible.
Before we have two mainly issues for CPU utilization: - without Morten's patch, the previous CPU for task running has stale utilization for the task; so after the task is waken up, if we add previous CPU utilization and task utilization, actually part of task utilization has been calculated twice. As result, previous CPU has less chance to be choosed for the task.
So patch "sched/fair: use cpu_util_wake() for energy awared path" is to based on Morten's patch to calibrate previous CPU utilization value if the task has run on it.
- Another well known issue is the idle CPU's utilization will keep an old value after CPU enter idle states. So idle CPU utilization will not change until it's waken up again. This will introduce misunderstanding when select target CPU.
In the kernel, function update_blocked_averages() can be directly called to update idle CPUs utilization value. But this function will acquire CPU's rq lock, so this will introduce race condition between CPUs. This is the mainly concern which may introduce potential performance issue, so this only will be done when CPU is idle and CPU utilization value has not been decayed to 0.
Leo Yan (3): sched/fair: use cpu_util_wake() for energy awared path sched/fair: add trace point for sched_new_util sched/fair: update idle CPUs utilization when wake task
Morten Rasmussen (1): sched/fair: Compute task/cpu utilization at wake-up more correctly
include/trace/events/sched.h | 25 ++++++++++++++ kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 104 insertions(+), 1 deletion(-)
-- 1.9.1
From: Morten Rasmussen morten.rasmussen@arm.com
At task wake-up load-tracking isn't updated until the task is enqueued. The task's own view of its utilization contribution may therefore not be aligned with its contribution to the cfs_rq load-tracking which may have been updated in the meantime. Basically, the task's own utilization hasn't yet accounted for the sleep decay, while the cfs_rq may have (partially). Estimating the cfs_rq utilization in case the task is migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg is therefore incorrect as the two load-tracking signals aren't time synchronized (different last update).
To solve this problem, this patch introduces task_util_wake() which computes the decayed task utilization based on the last update of the previous cpu's last load-tracking update. It is done without having to take the rq lock, similar to how it is done in remove_entity_load_avg().
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 17dcd8e..0385723 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5414,6 +5414,75 @@ static bool need_filter_task(struct task_struct *p) return 0; }
+/* + * task_util_wake: Returns an updated estimate of the utilization contribution + * of a waking task. At wake-up the task blocked utilization contribution + * (cfs_rq->avg) may have decayed while the utilization tracking of the task + * (se->avg) hasn't yet. + * Note that this estimate isn't perfectly accurate as the 1ms boundaries used + * for updating util_avg in __update_load_avg() are not considered here. This + * results in an error of up to 1ms utilization decay/accumulation which leads + * to an absolute util_avg error margin of 1024*1024/LOAD_AVG_MAX ~= 22 + * (for LOAD_AVG_MAX = 47742). + */ +static inline int task_util_wake(struct task_struct *p) +{ + struct cfs_rq *prev_cfs_rq = &task_rq(p)->cfs; + struct sched_avg *psa = &p->se.avg; + u64 cfs_rq_last_update, p_last_update, delta; + u32 util_decayed; + + p_last_update = psa->last_update_time; + + /* + * Task on rq (exec()) should be load-tracking aligned already. + * New tasks have no history and should use the init value. + */ + if (p->se.on_rq || !p_last_update) + return task_util(p); + + cfs_rq_last_update = cfs_rq_last_update_time(prev_cfs_rq); + delta = cfs_rq_last_update - p_last_update; + + if ((s64)delta <= 0) + return task_util(p); + + delta >>= 20; + + if (!delta) + return task_util(p); + + util_decayed = decay_load((u64)psa->util_sum, delta); + util_decayed /= LOAD_AVG_MAX; + + /* + * psa->util_avg can be slightly out of date as it is only updated + * when a 1ms boundary is crossed. + * See 'decayed' in __update_load_avg() + */ + util_decayed = min_t(unsigned long, util_decayed, task_util(p)); + + return util_decayed; +} + +/* + * cpu_util_wake: Compute cpu utilization with any contributions from + * the waking task p removed. + */ +static int cpu_util_wake(int cpu, struct task_struct *p) +{ + unsigned long util, capacity; + + /* Task has no contribution or is new */ + if (cpu != task_cpu(p) || !p->se.avg.last_update_time) + return cpu_util(cpu); + + capacity = capacity_orig_of(cpu); + util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util_wake(p), 0); + + return (util >= capacity) ? capacity : util; +} + #ifdef CONFIG_SCHED_TUNE
static long -- 1.9.1
Function energy_aware_wake_cpu() selects CPU for task with two levels; on the first level it will firstly select schedule domain which can meet performance requirement, and on the second level it will choose CPU. When it choose CPU it will add CPU utilization and task utilization together and select one possible CPU with lowest OPP. The previous CPU of the task usually has kept historic utilization value for the task, so the previous CPU will get higher value than its real value; as result the previous CPU will be taken as the CPU which will introduce higher OPP so the task will be migrated to other CPUs. But the migration actually is not necessary and we cannot get benefit for cache hot with previous CPU.
Based on Morten's patch to calibrate previous CPU utilization for waken up task, we can get consistent utilization value for the CPU. So finally this will help to keep the task on the previous CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0385723..9d8a6fd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5985,13 +5985,14 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
int cap_idx; + int cpu_wake_util = cpu_util_wake(i, p);
/* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due to the double * accounting. However, the blocked utilization may be zero. */ - new_util = cpu_util(i) + task_util_boosted; + new_util = cpu_wake_util + task_util_boosted;
/* * Ensure minimum capacity to grant the required boost. -- 1.9.1
Hi Leo,
On 09/08/2016 04:17 PM, Leo Yan wrote:
Function energy_aware_wake_cpu() selects CPU for task with two levels; on the first level it will firstly select schedule domain which can meet performance requirement, and on the second level it will choose CPU. When it choose CPU it will add CPU utilization and task utilization together and select one possible CPU with lowest OPP. The previous CPU of the task usually has kept historic utilization value for the task, so the previous CPU will get higher value than its real value; as result the previous CPU will be taken as the CPU which will introduce higher OPP so the task will be migrated to other CPUs. But the migration actually is not necessary and we cannot get benefit for cache hot with previous CPU.
Based on Morten's patch to calibrate previous CPU utilization for waken up task, we can get consistent utilization value for the CPU. So finally this will help to keep the task on the previous CPU.
The patch is already integrated into EAS RFCv6.
current snapshot:
git://linux-arm.org/linux-power.git eas/next/integration_20160902_1524
Look for select_energy_cpu_brute() and capacity_spare_wake().
I don't know right now why you want to integrate this one on something which bases on EASv5.2? Is it for the product code line?
We can talk about this further during tomorrows meeting.
-- Dietmar
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0385723..9d8a6fd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5985,13 +5985,14 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
int cap_idx;
int cpu_wake_util = cpu_util_wake(i, p); /* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due to the double * accounting. However, the blocked utilization may be zero. */
new_util = cpu_util(i) + task_util_boosted;
new_util = cpu_wake_util + task_util_boosted; /* * Ensure minimum capacity to grant the required boost.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Add trace point for sched_new_util, which is used for estimation CPU utilization plus task utilization for CPU selection.
Signed-off-by: Leo Yan leo.yan@linaro.org --- include/trace/events/sched.h | 25 +++++++++++++++++++++++++ kernel/sched/fair.c | 2 ++ 2 files changed, 27 insertions(+)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index c50310a..a410e2b2 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -707,6 +707,31 @@ TRACE_EVENT(sched_load_avg_cpu, );
/* + * Tracepoint for CPU estimation util when wake up one task + */ +TRACE_EVENT(sched_new_util, + + TP_PROTO(int cpu, int util_wake, int new_util), + + TP_ARGS(cpu, util_wake, new_util), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( int, util_wake ) + __field( int, new_util ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->util_wake = util_wake; + __entry->new_util = new_util; + ), + + TP_printk("cpu=%d util_wake=%u new_util=%u", + __entry->cpu, __entry->util_wake, __entry->new_util) +); + +/* * Tracepoint for sched_tune_config settings */ TRACE_EVENT(sched_tune_config, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9d8a6fd..e0b50ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5994,6 +5994,8 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) */ new_util = cpu_wake_util + task_util_boosted;
+ trace_sched_new_util(i, cpu_wake_util, new_util); + /* * Ensure minimum capacity to grant the required boost. * The target CPU can be already at a capacity level higher -- 1.9.1
When wake up task, the CPU selection is heavily relying on CPU utilization. But in some cases before the CPU enter idle state, its utilization is a high value and it will not be updated during CPU's idle state. So this will introduce misunderstanding when select CPU.
This patch is to update idle CPUs utilization value until the value is decayed to 0. And because after the CPU enter idle state, it has no any task on its run queue, so should not introduce big load when iterate cfs rq's hierarchy.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e0b50ca..8e8767b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5955,6 +5955,12 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) sg = sd->groups; sg_target = sg;
+ /* update idle CPUs utilization */ + for_each_online_cpu(i) { + if (idle_cpu(i) && cpu_util(i)) + update_blocked_averages(i); + } + if (sysctl_sched_is_big_little) {
/* -- 1.9.1
On 09/08/2016 04:17 PM, Leo Yan wrote:
When wake up task, the CPU selection is heavily relying on CPU utilization. But in some cases before the CPU enter idle state, its utilization is a high value and it will not be updated during CPU's idle state. So this will introduce misunderstanding when select CPU.
This patch is to update idle CPUs utilization value until the value is decayed to 0. And because after the CPU enter idle state, it has no any task on its run queue, so should not introduce big load when iterate cfs rq's hierarchy.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e0b50ca..8e8767b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5955,6 +5955,12 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) sg = sd->groups; sg_target = sg;
/* update idle CPUs utilization */
for_each_online_cpu(i) {
if (idle_cpu(i) && cpu_util(i))
update_blocked_averages(i);
}
if (sysctl_sched_is_big_little) { /*
update_blocked_averages() call in wakeup path is probably a little bit too harsh. Wasn't working on something which does the updates in load-balance context? In this case we should focus exploring this possibility because I guess it has more change to get mainlined. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.