Here are some patches that are generally minor changes and I am posting them together. Patches 1/5 and 2/5 are related to skipping cpufreq updates for the dequeue of the last task before the CPU enters idle. That's just a rebase of [1] mostly. Patches 3/5 and 4/5 fix some minor things I noticed after the remote cpufreq update work. and patch 5/5 is just a small clean up of find_idlest_group. Let me know your thoughts and thanks. I've based these patches on peterz's queue.git master branch.
[1] https://patchwork.kernel.org/patch/9936555/
Joel Fernandes (5): Revert "sched/fair: Drop always true parameter of update_cfs_rq_load_avg()" sched/fair: Skip frequency update if CPU about to idle cpufreq: schedutil: Use idle_calls counter of the remote CPU sched/fair: Correct obsolete comment about cpufreq_update_util sched/fair: remove impossible condition from find_idlest_group_cpu
include/linux/tick.h | 1 + kernel/sched/cpufreq_schedutil.c | 2 +- kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++------------ kernel/sched/sched.h | 1 + kernel/time/tick-sched.c | 13 ++++++++++++ 5 files changed, 47 insertions(+), 14 deletions(-)
-- 2.15.0.rc2.357.g7e34df9404-goog
This reverts commit 3a123bbbb10d54dbdde6ccbbd519c74c91ba2f52.
It needs a revert for controlling whether cpufreq is notified about updating frequency during an update to the utilization.
Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Joel Fernandes joelaf@google.com --- kernel/sched/fair.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b8e749..f97693fe8b6e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -784,7 +784,7 @@ void post_init_entity_util_avg(struct sched_entity *se) /* * For !fair tasks do: * - update_cfs_rq_load_avg(now, cfs_rq); + update_cfs_rq_load_avg(now, cfs_rq, false); attach_entity_load_avg(cfs_rq, se); switched_from_fair(rq, p); * @@ -3596,6 +3596,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum * update_cfs_rq_load_avg - update the cfs_rq's load/util averages * @now: current time, as per cfs_rq_clock_task() * @cfs_rq: cfs_rq to update + * @update_freq: should we call cfs_rq_util_change() or will the call do so * * The cfs_rq avg is the direct sum of all its entities (blocked and runnable) * avg. The immediate corollary is that all (fair) tasks must be attached, see @@ -3609,7 +3610,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum * call update_tg_load_avg() when this function returns true. */ static inline int -update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) +update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) { unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0; struct sched_avg *sa = &cfs_rq->avg; @@ -3646,7 +3647,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) cfs_rq->load_last_update_time_copy = sa->last_update_time; #endif
- if (decayed) + if (update_freq && decayed) cfs_rq_util_change(cfs_rq);
return decayed; @@ -3740,7 +3741,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) __update_load_avg_se(now, cpu, cfs_rq, se);
- decayed = update_cfs_rq_load_avg(now, cfs_rq); + decayed = update_cfs_rq_load_avg(now, cfs_rq, true); decayed |= propagate_entity_load_avg(se);
if (!se->avg.last_update_time && (flags & DO_ATTACH)) { @@ -3830,7 +3831,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf); #else /* CONFIG_SMP */
static inline int -update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) +update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) { return 0; } @@ -7318,7 +7319,7 @@ static void update_blocked_averages(int cpu) if (throttled_hierarchy(cfs_rq)) continue;
- if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq)) + if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true)) update_tg_load_avg(cfs_rq, 0);
/* Propagate pending load changes to the parent, if any: */ @@ -7391,7 +7392,7 @@ static inline void update_blocked_averages(int cpu)
rq_lock_irqsave(rq, &rf); update_rq_clock(rq); - update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); + update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true); rq_unlock_irqrestore(rq, &rf); }
-- 2.15.0.rc2.357.g7e34df9404-goog
Updating CPU frequency on last dequeue of a CPU is useless. Because the utilization since CPU came out of idle can increase till the last dequeue, this means we are requesting for a higher frequency before entering idle which is not very meaningful or useful. It causes unwanted wakeups of the schedutil governor kthread in slow-switch systems resulting in large number of wake ups that could have been avoided. In an Android application playing music where the music app's thread wakes up and sleeps periodically on an Android device, its seen that the frequency increases slightly on the dequeue and is reduced when the task wakes up again. This oscillation continues between 300Mhz and 350Mhz, and while the task is running, its at 300MHz the whole time. This is pointless. Adding to that, these are unnecessary wake ups. Infact most of the time when the sugov thread wakes up, all the CPUs are idle - so it can hurt power by disturbing the cluster when it is idling.
This patch prevents a frequency update on the last dequeue. With this the number of schedutil governor thread wake ups are reduces more than 2 times (1389 -> 527).
Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Joel Fernandes joelaf@google.com --- kernel/sched/fair.c | 25 ++++++++++++++++++++++--- kernel/sched/sched.h | 1 + 2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f97693fe8b6e..4c06e52935d3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3725,6 +3725,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s #define UPDATE_TG 0x1 #define SKIP_AGE_LOAD 0x2 #define DO_ATTACH 0x4 +#define SKIP_CPUFREQ 0x8
/* Update task and its cfs_rq load average */ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) @@ -3741,7 +3742,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) __update_load_avg_se(now, cpu, cfs_rq, se);
- decayed = update_cfs_rq_load_avg(now, cfs_rq, true); + decayed = update_cfs_rq_load_avg(now, cfs_rq, !(flags & SKIP_CPUFREQ)); decayed |= propagate_entity_load_avg(se);
if (!se->avg.last_update_time && (flags & DO_ATTACH)) { @@ -3839,6 +3840,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) #define UPDATE_TG 0x0 #define SKIP_AGE_LOAD 0x0 #define DO_ATTACH 0x0 +#define SKIP_CPUFREQ 0x0
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1) { @@ -4060,6 +4062,8 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq); static void dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { + int update_flags; + /* * Update run-time statistics of the 'current'. */ @@ -4073,7 +4077,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * - For group entity, update its weight to reflect the new share * of its group cfs_rq. */ - update_load_avg(cfs_rq, se, UPDATE_TG); + update_flags = UPDATE_TG; + + if (flags & DEQUEUE_IDLE) + update_flags |= SKIP_CPUFREQ; + + update_load_avg(cfs_rq, se, update_flags); dequeue_runnable_load_avg(cfs_rq, se);
update_stats_dequeue(cfs_rq, se, flags); @@ -5220,6 +5229,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) struct sched_entity *se = &p->se; int task_sleep = flags & DEQUEUE_SLEEP;
+ if (task_sleep && rq->nr_running == 1) + flags |= DEQUEUE_IDLE; + for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); dequeue_entity(cfs_rq, se, flags); @@ -5250,13 +5262,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) }
for_each_sched_entity(se) { + int update_flags; + cfs_rq = cfs_rq_of(se); cfs_rq->h_nr_running--;
if (cfs_rq_throttled(cfs_rq)) break;
- update_load_avg(cfs_rq, se, UPDATE_TG); + update_flags = UPDATE_TG; + + if (flags & DEQUEUE_IDLE) + update_flags |= SKIP_CPUFREQ; + + update_load_avg(cfs_rq, se, update_flags); update_cfs_group(se); }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8aa24b41f652..68f5cd102744 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1394,6 +1394,7 @@ extern const u32 sched_prio_to_wmult[40]; #define DEQUEUE_SAVE 0x02 /* matches ENQUEUE_RESTORE */ #define DEQUEUE_MOVE 0x04 /* matches ENQUEUE_MOVE */ #define DEQUEUE_NOCLOCK 0x08 /* matches ENQUEUE_NOCLOCK */ +#define DEQUEUE_IDLE 0x10
#define ENQUEUE_WAKEUP 0x01 #define ENQUEUE_RESTORE 0x02 -- 2.15.0.rc2.357.g7e34df9404-goog
On 28-10-17, 02:59, Joel Fernandes wrote:
Updating CPU frequency on last dequeue of a CPU is useless. Because the utilization since CPU came out of idle can increase till the last dequeue, this means we are requesting for a higher frequency before entering idle which is not very meaningful or useful. It causes unwanted wakeups of the schedutil governor kthread in slow-switch systems resulting in large number of wake ups that could have been avoided. In an Android application playing music where the music app's thread wakes up and sleeps periodically on an Android device, its seen that the frequency increases slightly on the dequeue and is reduced when the task wakes up again. This oscillation continues between 300Mhz and 350Mhz, and while the task is running, its at 300MHz the whole time. This is pointless. Adding to that, these are unnecessary wake ups. Infact most of the time when the sugov thread wakes up, all the CPUs are idle - so it can hurt power by disturbing the cluster when it is idling.
This patch prevents a frequency update on the last dequeue. With this the number of schedutil governor thread wake ups are reduces more than 2 times (1389 -> 527).
Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Joel Fernandes joelaf@google.com
kernel/sched/fair.c | 25 ++++++++++++++++++++++--- kernel/sched/sched.h | 1 + 2 files changed, 23 insertions(+), 3 deletions(-)
So you are doing this only for CFS, isn't that required for RT/DL as well?
Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this:
if (cpu_idle()) "Don't change the freq";
Will something like that work?
-- viresh
Hi Viresh,
On Mon, Oct 30, 2017 at 5:07 AM, Viresh Kumar viresh.kumar@linaro.org wrote:
On 28-10-17, 02:59, Joel Fernandes wrote:
Updating CPU frequency on last dequeue of a CPU is useless. Because the utilization since CPU came out of idle can increase till the last dequeue, this means we are requesting for a higher frequency before entering idle which is not very meaningful or useful. It causes unwanted wakeups of the schedutil governor kthread in slow-switch systems resulting in large number of wake ups that could have been avoided. In an Android application playing music where the music app's thread wakes up and sleeps periodically on an Android device, its seen that the frequency increases slightly on the dequeue and is reduced when the task wakes up again. This oscillation continues between 300Mhz and 350Mhz, and while the task is running, its at 300MHz the whole time. This is pointless. Adding to that, these are unnecessary wake ups. Infact most of the time when the sugov thread wakes up, all the CPUs are idle - so it can hurt power by disturbing the cluster when it is idling.
This patch prevents a frequency update on the last dequeue. With this the number of schedutil governor thread wake ups are reduces more than 2 times (1389 -> 527).
Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Joel Fernandes joelaf@google.com
kernel/sched/fair.c | 25 ++++++++++++++++++++++--- kernel/sched/sched.h | 1 + 2 files changed, 23 insertions(+), 3 deletions(-)
So you are doing this only for CFS, isn't that required for RT/DL as well?
Yes I agree we should. I remember this patch from Patrick doing something similar for RT: https://patchwork.kernel.org/patch/9825461/
That patch prevents cpufreq update from dequeue_task_rt path since we no longer would trigger it from update_curr_rt all the time. Is this an acceptable solution for RT?
Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this:
if (cpu_idle()) "Don't change the freq";
Will something like that work?
I thought about this and I think it wont work very well. In the dequeue path we're still running the task being dequeued so the CPU is not yet idle. What is needed here IMO is a notion that the CPU is possibly about to idle and we can get predict that from the dequeue path of the CFS class.
Also just looking at whether the CPU is currently idle or not in the governor doesn't help to differentiate between say the dequeue path / tick path. Both of which can occur when the CPU is not idle.
Any thoughts about this?
thanks,
- Joel
On 10/30/2017 12:02 PM, Joel Fernandes wrote:
Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this:
if (cpu_idle()) "Don't change the freq";
Will something like that work?
I thought about this and I think it wont work very well. In the dequeue path we're still running the task being dequeued so the CPU is not yet idle. What is needed here IMO is a notion that the CPU is possibly about to idle and we can get predict that from the dequeue path of the CFS class.
Also just looking at whether the CPU is currently idle or not in the governor doesn't help to differentiate between say the dequeue path / tick path. Both of which can occur when the CPU is not idle.
Any thoughts about this?
Also if it really is the case that this bit of policy is universally desirable, I'd think it is better to do this in the scheduler and avoid a needless trip through a fn pointer out to schedutil for performance reasons.
On Wed, Nov 1, 2017 at 12:35 PM, Steve Muckle smuckle@google.com wrote:
On 10/30/2017 12:02 PM, Joel Fernandes wrote:
Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this:
if (cpu_idle()) "Don't change the freq";
Will something like that work?
I thought about this and I think it wont work very well. In the dequeue path we're still running the task being dequeued so the CPU is not yet idle. What is needed here IMO is a notion that the CPU is possibly about to idle and we can get predict that from the dequeue path of the CFS class.
Also just looking at whether the CPU is currently idle or not in the governor doesn't help to differentiate between say the dequeue path / tick path. Both of which can occur when the CPU is not idle.
Any thoughts about this?
Also if it really is the case that this bit of policy is universally desirable, I'd think it is better to do this in the scheduler and avoid a needless trip through a fn pointer out to schedutil for performance reasons.
I agree.
Peter, what do you think? Are you Ok with the approach of this patch (preventing of the cpufreq update call during dequeue)?
thanks,
- Joel
Hi Joel,
On 4 November 2017 at 06:44, Joel Fernandes joelaf@google.com wrote:
On Wed, Nov 1, 2017 at 12:35 PM, Steve Muckle smuckle@google.com wrote:
On 10/30/2017 12:02 PM, Joel Fernandes wrote:
Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this:
if (cpu_idle()) "Don't change the freq";
Will something like that work?
I thought about this and I think it wont work very well. In the dequeue path we're still running the task being dequeued so the CPU is not yet idle. What is needed here IMO is a notion that the CPU is possibly about to idle and we can get predict that from the dequeue path of the CFS class.
Also just looking at whether the CPU is currently idle or not in the governor doesn't help to differentiate between say the dequeue path / tick path. Both of which can occur when the CPU is not idle.
Any thoughts about this?
Also if it really is the case that this bit of policy is universally desirable, I'd think it is better to do this in the scheduler and avoid a needless trip through a fn pointer out to schedutil for performance reasons.
I agree.
Peter, what do you think? Are you Ok with the approach of this patch (preventing of the cpufreq update call during dequeue)?
I'm not really convinced that we should do change OPP at dequeue. Although i agree that it makes perfect sense to prevent increasing OPP just before going idle for mp3 playback, i'm not sure that this is always the case especially for performance oriented use case because we delay the OPP increase and as a result the responsiveness of the CPU In fact this decision really depends about how long we are going to sleep. If the cpu will wakes up in few ms, it's worth already increasing the frequency when utilization is above the threshold because it will not decreases enough to go back to lower OPP. At the opposite, if the cpu will not wake up shortly skipping OPP change can make sense.
Regarding the reduction of the number of OPP changes, will the util_est feature provide the same amount of reduction by directly providing the estimated max utilization ?
Just to say that IMHO skipping or not OPP change at dequeue is a policy decision and not a generic one
thanks,
- Joel
On Mon, Nov 06, 2017 at 09:00:45AM +0100, Vincent Guittot wrote:
Hi Joel,
On 4 November 2017 at 06:44, Joel Fernandes joelaf@google.com wrote:
On Wed, Nov 1, 2017 at 12:35 PM, Steve Muckle smuckle@google.com wrote:
On 10/30/2017 12:02 PM, Joel Fernandes wrote:
Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this:
if (cpu_idle()) "Don't change the freq";
Will something like that work?
I thought about this and I think it wont work very well. In the dequeue path we're still running the task being dequeued so the CPU is not yet idle. What is needed here IMO is a notion that the CPU is possibly about to idle and we can get predict that from the dequeue path of the CFS class.
Also just looking at whether the CPU is currently idle or not in the governor doesn't help to differentiate between say the dequeue path / tick path. Both of which can occur when the CPU is not idle.
Any thoughts about this?
Also if it really is the case that this bit of policy is universally desirable, I'd think it is better to do this in the scheduler and avoid a needless trip through a fn pointer out to schedutil for performance reasons.
I agree.
Peter, what do you think? Are you Ok with the approach of this patch (preventing of the cpufreq update call during dequeue)?
I'm not really convinced that we should do change OPP at dequeue. Although i agree that it makes perfect sense to prevent increasing OPP just before going idle for mp3 playback, i'm not sure that this is always the case especially for performance oriented use case because we delay the OPP increase and as a result the responsiveness of the CPU In fact this decision really depends about how long we are going to sleep. If the cpu will wakes up in few ms, it's worth already increasing the frequency when utilization is above the threshold because it will not decreases enough to go back to lower OPP. At the opposite, if the cpu will not wake up shortly skipping OPP change can make sense.
Regarding the reduction of the number of OPP changes, will the util_est feature provide the same amount of reduction by directly providing the estimated max utilization ?
Just to say that IMHO skipping or not OPP change at dequeue is a policy decision and not a generic one
Indeed. Otherwise we may end up in a situation of handling corner cases in a scheduler when it comes to OPP updates. I also agree that it is up to policy to do update or not.
-- Uladzislau Rezki
Hi Vincent,
On Mon, Nov 6, 2017 at 12:00 AM, Vincent Guittot vincent.guittot@linaro.org wrote:
Hi Joel,
On 4 November 2017 at 06:44, Joel Fernandes joelaf@google.com wrote:
On Wed, Nov 1, 2017 at 12:35 PM, Steve Muckle smuckle@google.com wrote:
On 10/30/2017 12:02 PM, Joel Fernandes wrote:
Also, this more looks like a policy decision. Will it be better to put that directly into schedutil? Like this:
if (cpu_idle()) "Don't change the freq";
Will something like that work?
I thought about this and I think it wont work very well. In the dequeue path we're still running the task being dequeued so the CPU is not yet idle. What is needed here IMO is a notion that the CPU is possibly about to idle and we can get predict that from the dequeue path of the CFS class.
Also just looking at whether the CPU is currently idle or not in the governor doesn't help to differentiate between say the dequeue path / tick path. Both of which can occur when the CPU is not idle.
Any thoughts about this?
Also if it really is the case that this bit of policy is universally desirable, I'd think it is better to do this in the scheduler and avoid a needless trip through a fn pointer out to schedutil for performance reasons.
I agree.
Peter, what do you think? Are you Ok with the approach of this patch (preventing of the cpufreq update call during dequeue)?
I'm not really convinced that we should do change OPP at dequeue.
You mean "should not do", right?
Although i agree that it makes perfect sense to prevent increasing OPP just before going idle for mp3 playback, i'm not sure that this is always the case especially for performance oriented use case because we delay the OPP increase and as a result the responsiveness of the CPU
Actually I think another way to look at it is this is no worse performance-wise than if the task was running all the time (didn't sleep) - the next granularity for increasing the frequency would then be the next Tick.. So, I am not sure practically it makes any difference to the performance of the task. Even if sleep was short, we would update frequency on next enqueue/tick so I think we're still fine from a time-granularity standpoint.
In fact this decision really depends about how long we are going to sleep. If the cpu will wakes up in few ms, it's worth already increasing the frequency when utilization is above the threshold because it will not decreases enough to go back to lower OPP. At the opposite, if the cpu will not wake up shortly skipping OPP change can make sense.
Regarding the reduction of the number of OPP changes, will the util_est feature provide the same amount of reduction by directly providing the estimated max utilization ?
Yes, I think util_est can help reduce the oscillations which cause this issue but other than the fact that util_est is not currently mainlined, I think util_est will still have the same issue if the util_est's estimation itself oscillates so I think util_est is a mitigation than a solution. Do you agree?
Just to say that IMHO skipping or not OPP change at dequeue is a policy decision and not a generic one
Yes I agree, but also this means we have to expose scheduler internals to the governor to detect this case. Are you suggesting if this was to be implemented - that we pass a flag to governor and make it to do the decision there based on policy? As I was discussing in earlier thread with Viresh, simply checking if CPU is idle in the governor isn't good enough and since this issue is about the scheduler making cpufreq update call when it didn't need to, avoiding such a call would make sense to me.
thanks,
- Joel
Since the recent remote cpufreq callback work, its possible that a cpufreq update is triggered from a remote CPU. For single policies however, the current code uses the local CPU when trying to determine if the remote sg_cpu entered idle or is busy. This is incorrect. To remedy this, compare with the nohz tick idle_calls counter of the remote CPU.
Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Frederic Weisbecker fweisbec@gmail.com Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Joel Fernandes joelaf@google.com --- include/linux/tick.h | 1 + kernel/sched/cpufreq_schedutil.c | 2 +- kernel/time/tick-sched.c | 13 +++++++++++++ 3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/linux/tick.h b/include/linux/tick.h index fe01e68bf520..31bf2ae3e9a7 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -118,6 +118,7 @@ extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); extern unsigned long tick_nohz_get_idle_calls(void); +extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); #else /* !CONFIG_NO_HZ_COMMON */ diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 9209d83ecdcf..1383aa9efb4f 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -244,7 +244,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, unsigned long *util, #ifdef CONFIG_NO_HZ_COMMON static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { - unsigned long idle_calls = tick_nohz_get_idle_calls(); + unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu); bool ret = idle_calls == sg_cpu->saved_idle_calls;
sg_cpu->saved_idle_calls = idle_calls; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 7b258c59d78a..91d5e7be805b 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1011,6 +1011,19 @@ ktime_t tick_nohz_get_sleep_length(void) return ts->sleep_length; }
+/** + * tick_nohz_get_idle_calls_cpu - return the current idle calls counter value + * for a particular CPU. + * + * Called from the schedutil frequency scaling governor in scheduler context. + */ +unsigned long tick_nohz_get_idle_calls_cpu(int cpu) +{ + struct tick_sched *ts = tick_get_tick_sched(cpu); + + return ts->idle_calls; +} + /** * tick_nohz_get_idle_calls - return the current idle calls counter value * -- 2.15.0.rc2.357.g7e34df9404-goog
On 28-10-17, 02:59, Joel Fernandes wrote:
Since the recent remote cpufreq callback work, its possible that a cpufreq update is triggered from a remote CPU. For single policies however, the current code uses the local CPU when trying to determine if the remote sg_cpu entered idle or is busy. This is incorrect. To remedy this, compare with the nohz tick idle_calls counter of the remote CPU.
Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Frederic Weisbecker fweisbec@gmail.com Cc: Thomas Gleixner tglx@linutronix.de
Fixes: 674e75411fc2 ("sched: cpufreq: Allow remote cpufreq callbacks")
Signed-off-by: Joel Fernandes joelaf@google.com
include/linux/tick.h | 1 + kernel/sched/cpufreq_schedutil.c | 2 +- kernel/time/tick-sched.c | 13 +++++++++++++ 3 files changed, 15 insertions(+), 1 deletion(-)
Acked-by: Viresh Kumar viresh.kumar@linaro.org
-- viresh
Since the remote cpufreq callback work, the cpufreq_update_util call can happen from remote CPUs. The comment about local CPUs is thus obsolete. Update it accordingly.
Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Joel Fernandes joelaf@google.com --- kernel/sched/fair.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4c06e52935d3..5c49fdb4c508 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3018,9 +3018,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be - * a real problem -- added to that it only calls on the local - * CPU, so if we enqueue remotely we'll miss an update, but - * the next tick/schedule should update. + * a real problem. * * It will not get called when we go idle, because the idle * thread is a different class (!fair), nor will the utilization -- 2.15.0.rc2.357.g7e34df9404-goog
You have prefixed most of the Cc'd names with "Cc: " somehow :)
On 28-10-17, 02:59, Joel Fernandes wrote:
Since the remote cpufreq callback work, the cpufreq_update_util call can happen from remote CPUs. The comment about local CPUs is thus obsolete. Update it accordingly.
We normally keep the column width as 72 in commit logs instead of 80, as with 'git log' this is indented by a tab and then we would cross 80 columns.
Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Joel Fernandes joelaf@google.com
kernel/sched/fair.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4c06e52935d3..5c49fdb4c508 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3018,9 +3018,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be
* a real problem -- added to that it only calls on the local
* CPU, so if we enqueue remotely we'll miss an update, but
* the next tick/schedule should update.
* a real problem.
- It will not get called when we go idle, because the idle
- thread is a different class (!fair), nor will the utilization
Reviewed-by: Viresh Kumar viresh.kumar@linaro.org
-- viresh
Hi Viresh,
On Mon, Oct 30, 2017 at 2:22 AM, Viresh Kumar viresh.kumar@linaro.org wrote:
You have prefixed most of the Cc'd names with "Cc: " somehow :)
Yeah :( What happened is I decided to play with using a text file to input --cc-cmd for git send-patch. Turns out I was too careless with forgetting to remove the "Cc: " prefix when I created the CC list. Anyway lesson learnt :)
On 28-10-17, 02:59, Joel Fernandes wrote:
Since the remote cpufreq callback work, the cpufreq_update_util call can happen from remote CPUs. The comment about local CPUs is thus obsolete. Update it accordingly.
We normally keep the column width as 72 in commit logs instead of 80, as with 'git log' this is indented by a tab and then we would cross 80 columns.
Ok, I'll keep that in mind and use 72 characters for future patches.
Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Joel Fernandes joelaf@google.com
kernel/sched/fair.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4c06e52935d3..5c49fdb4c508 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3018,9 +3018,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be
* a real problem -- added to that it only calls on the local
* CPU, so if we enqueue remotely we'll miss an update, but
* the next tick/schedule should update.
* a real problem. * * It will not get called when we go idle, because the idle * thread is a different class (!fair), nor will the utilization
Reviewed-by: Viresh Kumar viresh.kumar@linaro.org
Thank you!
- Joel
find_idlest_group_cpu goes through CPUs of a group previous selected by find_idlest_group. find_idlest_group returns NULL if the local group is the selected one and doesn't execute find_idlest_group_cpu if the group to which 'cpu' belongs to is chosen. So we're always guaranteed to call find_idlest_group_cpu with a group to which cpu is non-local. This makes one of the conditions in find_idlest_group_cpu an impossible one, which we can get rid off.
Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Brendan Jackman brendan.jackman@arm.com Cc: Dietmar dietmar.eggemann@arm.com Signed-off-by: Joel Fernandes joelaf@google.com --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5c49fdb4c508..740602ce799f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5922,7 +5922,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this } } else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(cpu_rq(i)); - if (load < min_load || (load == min_load && i == this_cpu)) { + if (load < min_load) { min_load = load; least_loaded_cpu = i; } -- 2.15.0.rc2.357.g7e34df9404-goog
On Sat, Oct 28 2017 at 09:59, Joel Fernandes wrote:
find_idlest_group_cpu goes through CPUs of a group previous selected by find_idlest_group. find_idlest_group returns NULL if the local group is the selected one and doesn't execute find_idlest_group_cpu if the group to which 'cpu' belongs to is chosen. So we're always guaranteed to call find_idlest_group_cpu with a group to which cpu is non-local. This makes one of the conditions in find_idlest_group_cpu an impossible one, which we can get rid off.
Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Brendan Jackman brendan.jackman@arm.com Cc: Dietmar dietmar.eggemann@arm.com Signed-off-by: Joel Fernandes joelaf@google.com
FWIW:
Reviewed-by: Brendan Jackman brendan.jackman@arm.com
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5c49fdb4c508..740602ce799f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5922,7 +5922,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this } } else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(cpu_rq(i));
if (load < min_load || (load == min_load && i == this_cpu)) {
if (load < min_load) { min_load = load; least_loaded_cpu = i; }
On 28 October 2017 at 11:59, Joel Fernandes joelaf@google.com wrote:
find_idlest_group_cpu goes through CPUs of a group previous selected by find_idlest_group. find_idlest_group returns NULL if the local group is the selected one and doesn't execute find_idlest_group_cpu if the group to which 'cpu' belongs to is chosen. So we're always guaranteed to call find_idlest_group_cpu with a group to which cpu is non-local. This makes one of
Is this still true in case of overlapping topology ?
the conditions in find_idlest_group_cpu an impossible one, which we can get rid off.
Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Brendan Jackman brendan.jackman@arm.com Cc: Dietmar dietmar.eggemann@arm.com Signed-off-by: Joel Fernandes joelaf@google.com
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5c49fdb4c508..740602ce799f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5922,7 +5922,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this } } else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(cpu_rq(i));
if (load < min_load || (load == min_load && i == this_cpu)) {
if (load < min_load) { min_load = load; least_loaded_cpu = i; }
-- 2.15.0.rc2.357.g7e34df9404-goog
Hi Vincent,
On Mon, Oct 30, 2017 at 9:00 AM, Vincent Guittot vincent.guittot@linaro.org wrote:
On 28 October 2017 at 11:59, Joel Fernandes joelaf@google.com wrote:
find_idlest_group_cpu goes through CPUs of a group previous selected by find_idlest_group. find_idlest_group returns NULL if the local group is the selected one and doesn't execute find_idlest_group_cpu if the group to which 'cpu' belongs to is chosen. So we're always guaranteed to call find_idlest_group_cpu with a group to which cpu is non-local. This makes one of
Is this still true in case of overlapping topology ?
Yes, I believe so. As per the code, find_idlest_group will only return a group to which this_cpu doesn't belong to. So incase an overlapping group was returned by find_idlest_group (instead of NULL), then none of the groups (among the set of overlapping groups) is local to this_cpu. Incase NULL is returned, then find_idlest_group_cpu doesn't execute at all.
Do you agree?
thanks,
- Joel
[..]
On 30 October 2017 at 17:19, Joel Fernandes joelaf@google.com wrote:
Hi Vincent,
On Mon, Oct 30, 2017 at 9:00 AM, Vincent Guittot vincent.guittot@linaro.org wrote:
On 28 October 2017 at 11:59, Joel Fernandes joelaf@google.com wrote:
find_idlest_group_cpu goes through CPUs of a group previous selected by find_idlest_group. find_idlest_group returns NULL if the local group is the selected one and doesn't execute find_idlest_group_cpu if the group to which 'cpu' belongs to is chosen. So we're always guaranteed to call find_idlest_group_cpu with a group to which cpu is non-local. This makes one of
Is this still true in case of overlapping topology ?
Yes, I believe so. As per the code, find_idlest_group will only return a group to which this_cpu doesn't belong to. So incase an overlapping group was returned by find_idlest_group (instead of NULL), then none of the groups (among the set of overlapping groups) is local to this_cpu. Incase NULL is returned, then find_idlest_group_cpu doesn't execute at all.
Do you agree?
Yes you're right. we skip of all groups to which this_cpu belong to
thanks,
- Joel
[..]