The current update of the rq's load can be erroneous when RT tasks are involved
The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up.
A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance.
When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time.
As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
A new idle_exit function is called when the prev task is the idle function so the elapsed time will be accounted as idle time in the rq's load.
Changes since V3: - Remove dependancy with CONFIG_FAIR_GROUP_SCHED - Add a new idle_enter function and create a post_schedule callback for idle class - Remove the update_runnable_avg from idle_balance
Changes since V2: - remove useless definition for UP platform - rebased on top of Steven Rostedt's patches : https://lkml.org/lkml/2013/2/12/558
Changes since V1: - move code out of schedule function and create a pre_schedule callback for idle class instead.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 23 +++++++++++++++++++++-- kernel/sched/idle_task.c | 10 ++++++++++ kernel/sched/sched.h | 12 ++++++++++++ 3 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0fcdbff..1851ca8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1562,6 +1562,27 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); } /* migrations, e.g. sleep=0 leave decay_count == 0 */ } + +/* + * Update the rq's load with the elapsed running time before entering + * idle. if the last scheduled task is not a CFS task, idle_enter will + * be the only way to update the runnable statistic. + */ +void idle_enter(struct rq *this_rq) +{ + update_rq_runnable_avg(this_rq, 1); +} + +/* + * Update the rq's load with the elapsed idle time before a task is + * scheduled. if the newly scheduled task is not a CFS task, idle_exit will + * be the only way to update the runnable statistic. + */ +void idle_exit(struct rq *this_rq) +{ + update_rq_runnable_avg(this_rq, 0); +} + #else static inline void update_entity_load_avg(struct sched_entity *se, int update_cfs_rq) {} @@ -5219,8 +5240,6 @@ void idle_balance(int this_cpu, struct rq *this_rq) if (this_rq->avg_idle < sysctl_sched_migration_cost) return;
- update_rq_runnable_avg(this_rq, 1); - /* * Drop the rq->lock, but keep preempt disabled. */ diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index 66b5220..0775261 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -14,8 +14,17 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int flags) return task_cpu(p); /* IDLE tasks as never migrated */ }
+static void pre_schedule_idle(struct rq *rq, struct task_struct *prev) +{ + /* Update rq's load with elapsed idle time */ + idle_exit(rq); +} + static void post_schedule_idle(struct rq *rq) { + /* Update rq's load with elapsed running time */ + idle_enter(rq); + idle_balance(smp_processor_id(), rq); } #endif /* CONFIG_SMP */ @@ -95,6 +104,7 @@ const struct sched_class idle_sched_class = {
#ifdef CONFIG_SMP .select_task_rq = select_task_rq_idle, + .pre_schedule = pre_schedule_idle, .post_schedule = post_schedule_idle, #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fc88644..ff4b029 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -878,6 +878,18 @@ extern const struct sched_class idle_sched_class; extern void trigger_load_balance(struct rq *rq, int cpu); extern void idle_balance(int this_cpu, struct rq *this_rq);
+/* + * Only depends on SMP, FAIR_GROUP_SCHED may be removed when runnable_avg + * becomes useful in lb + */ +#if defined(CONFIG_FAIR_GROUP_SCHED) +extern void idle_enter(struct rq *this_rq); +extern void idle_exit(struct rq *this_rq); +#else +static inline void idle_enter(struct rq *this_rq) {} +static inline void idle_exit(struct rq *this_rq) {} +#endif + #else /* CONFIG_SMP */
static inline void idle_balance(int cpu, struct rq *rq)
On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++++++++-- kernel/sched/idle_task.c | 10 ++++++++++ kernel/sched/sched.h | 12 ++++++++++++ 3 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0fcdbff..1851ca8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1562,6 +1562,27 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); } /* migrations, e.g. sleep=0 leave decay_count == 0 */ }
+/*
- Update the rq's load with the elapsed running time before entering
- idle. if the last scheduled task is not a CFS task, idle_enter
will
- be the only way to update the runnable statistic.
- */
+void idle_enter(struct rq *this_rq) +{
update_rq_runnable_avg(this_rq, 1);
+}
+/*
- Update the rq's load with the elapsed idle time before a task is
- scheduled. if the newly scheduled task is not a CFS task,
idle_exit will
- be the only way to update the runnable statistic.
- */
+void idle_exit(struct rq *this_rq) +{
update_rq_runnable_avg(this_rq, 0);
+}
These seem like fairly unfortunate names to expose to the global namespace, why not expose update_rq_runnable_avg() instead?
On 9 April 2013 10:50, Peter Zijlstra peterz@infradead.org wrote:
On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++++++++-- kernel/sched/idle_task.c | 10 ++++++++++ kernel/sched/sched.h | 12 ++++++++++++ 3 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0fcdbff..1851ca8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1562,6 +1562,27 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); } /* migrations, e.g. sleep=0 leave decay_count == 0 */ }
+/*
- Update the rq's load with the elapsed running time before entering
- idle. if the last scheduled task is not a CFS task, idle_enter
will
- be the only way to update the runnable statistic.
- */
+void idle_enter(struct rq *this_rq) +{
update_rq_runnable_avg(this_rq, 1);
+}
+/*
- Update the rq's load with the elapsed idle time before a task is
- scheduled. if the newly scheduled task is not a CFS task,
idle_exit will
- be the only way to update the runnable statistic.
- */
+void idle_exit(struct rq *this_rq) +{
update_rq_runnable_avg(this_rq, 0);
+}
These seem like fairly unfortunate names to expose to the global namespace, why not expose update_rq_runnable_avg() instead?
Just to gather in one place all cfs actions that should be done when exiting idle even if we only have update_rq_runnable_avg right now. I have distinguished that from idle_balance because this sequence can't generate extra context switch like idle_balance and they would finally not be called in the same time
On Tue, 2013-04-09 at 11:06 +0200, Vincent Guittot wrote:
+void idle_enter(struct rq *this_rq) +{
update_rq_runnable_avg(this_rq, 1);
+}
+void idle_exit(struct rq *this_rq) +{
update_rq_runnable_avg(this_rq, 0);
+}
These seem like fairly unfortunate names to expose to the global namespace, why not expose update_rq_runnable_avg() instead?
Just to gather in one place all cfs actions that should be done when exiting idle even if we only have update_rq_runnable_avg right now. I have distinguished that from idle_balance because this sequence can't generate extra context switch like idle_balance and they would finally not be called in the same time
OK, but could we then please give then more scheduler specific names? It just seems to me that idle_enter/idle_exit() are very obvious function names for unrelated things.
How about calling it idle_{enter,exit}_fair; so that once other classes grow hooks we can do something like:
static void pre_schedule_idle(struct rq *rq, struct task_struct *p) { struct sched_class *class;
for_each_class(class) { if (class->idle_enter) class->idle_enter(rq); } }
or whatnot..
On 10 April 2013 09:26, Peter Zijlstra peterz@infradead.org wrote:
On Tue, 2013-04-09 at 11:06 +0200, Vincent Guittot wrote:
+void idle_enter(struct rq *this_rq) +{
update_rq_runnable_avg(this_rq, 1);
+}
+void idle_exit(struct rq *this_rq) +{
update_rq_runnable_avg(this_rq, 0);
+}
These seem like fairly unfortunate names to expose to the global namespace, why not expose update_rq_runnable_avg() instead?
Just to gather in one place all cfs actions that should be done when exiting idle even if we only have update_rq_runnable_avg right now. I have distinguished that from idle_balance because this sequence can't generate extra context switch like idle_balance and they would finally not be called in the same time
OK, but could we then please give then more scheduler specific names? It just seems to me that idle_enter/idle_exit() are very obvious function names for unrelated things.
How about calling it idle_{enter,exit}_fair; so that once other classes grow hooks we can do something like:
My primary goal was to align with idle_balance name but idle_{enter,exit}_fair is better
In the same way, should we change idle_balance to idle_balance_fair ?
and since we don't have Steve's irq constraint anymore, we could move idle_balance in the beginning of the function pick_next_task_fair ? We will not have spurious switch context and we will remove fair function from __schedule function
Vincent
static void pre_schedule_idle(struct rq *rq, struct task_struct *p) { struct sched_class *class;
for_each_class(class) { if (class->idle_enter) class->idle_enter(rq); }
}
or whatnot..
On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
Changes since V2:
- remove useless definition for UP platform
- rebased on top of Steven Rostedt's patches :
So what's the status of those patches? I still worry about the extra context switch overhead for the high-frequency idle scenario.
On 9 April 2013 10:55, Peter Zijlstra peterz@infradead.org wrote:
On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
Changes since V2:
- remove useless definition for UP platform
- rebased on top of Steven Rostedt's patches :
So what's the status of those patches? I still worry about the extra context switch overhead for the high-frequency idle scenario.
I don't know. I have seen a pulled answer from Ingo but can't find the commits in the tip tree.
Steve, have you got more info about the status of your patches ?
Vincent
On Tue, 2013-04-09 at 14:18 +0200, Vincent Guittot wrote:
On 9 April 2013 10:55, Peter Zijlstra peterz@infradead.org wrote:
On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
Changes since V2:
- remove useless definition for UP platform
- rebased on top of Steven Rostedt's patches :
So what's the status of those patches? I still worry about the extra context switch overhead for the high-frequency idle scenario.
I don't know. I have seen a pulled answer from Ingo but can't find the commits in the tip tree.
Steve, have you got more info about the status of your patches ?
Yeah, I asked Ingo to revert it due to Peter's concerns. I was able to get the latencies I needed without that patch set. That made it not so urgent.
Can you rebase your patches doing something similar? That is, still use the pre/post_schedule_idle() calls, but don't base it off of my patch set.
Thanks,
-- Steve
On 9 April 2013 15:16, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 2013-04-09 at 14:18 +0200, Vincent Guittot wrote:
On 9 April 2013 10:55, Peter Zijlstra peterz@infradead.org wrote:
On Thu, 2013-04-04 at 16:15 +0200, Vincent Guittot wrote:
Changes since V2:
- remove useless definition for UP platform
- rebased on top of Steven Rostedt's patches :
So what's the status of those patches? I still worry about the extra context switch overhead for the high-frequency idle scenario.
I don't know. I have seen a pulled answer from Ingo but can't find the commits in the tip tree.
Steve, have you got more info about the status of your patches ?
Yeah, I asked Ingo to revert it due to Peter's concerns. I was able to get the latencies I needed without that patch set. That made it not so urgent.
Can you rebase your patches doing something similar? That is, still use the pre/post_schedule_idle() calls, but don't base it off of my patch set.
Yes. I'm going to rebase my patches and add the declaration of post_schedule_idle in my patch instead of using your patch
Thanks, Vincent
Thanks,
-- Steve
linaro-kernel@lists.linaro.org