From: "Joel Fernandes (Google)" joel@joelfernandes.org
Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.
To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.
The idea is to mark the CFS class from put_prev_task_fair if the task was running while the put_prev_task() happened and unmark it from pick_next_task_fair. This approach also keeps the solution of this within CFS without modification to other classes. What do you think?
Note: - Just for demo/trial, I took a full int from sched_avg. I am open to any ideas on a flag we can set, all that's needed is 1 bit. - The current UTIL_EST design cannot prevent this issue because it relies on task's sleeping and infact could be affect by this issue itself incase task's util-est EWMA hasn't built up. - Vincent's patches for PELT in DL/RT are still needed for correct OPP selection and estimation of capacity_of etc. This patch solves a different issue.
CC: "eas-dev" eas-dev@lists.linaro.org CC: vincent.guittot@linaro.org CC: patrick.bellasi@arm.com CC: dietmar.eggemann@arm.com CC: chris.redpath@arm.com CC: juri.lelli@redhat.com CC: tkjos@google.com Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org --- include/linux/sched.h | 1 + kernel/sched/fair.c | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index ca3f3eae8980..46b686192641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -404,6 +404,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est; + int freeze_updates; } ____cacheline_aligned;
struct sched_statistics { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09b7eb69802c..521656b05b22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods;
+ /* + * If an update is supposed to be skipped, then do nothing. + * sa->last_update_time should still moves forward. + */ + if (sa->freeze_updates) + return 0; + scale_freq = arch_scale_freq_capacity(cpu); scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
@@ -6963,6 +6970,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf p = task_of(se);
done: __maybe_unused; + #ifdef CONFIG_SMP /* * Move the next running task to the front of @@ -6975,6 +6983,8 @@ done: __maybe_unused; if (hrtick_enabled(rq)) hrtick_start_fair(rq, p);
+ rq->cfs.avg.freeze_updates = 0; + p->se.avg.freeze_updates = 0; return p;
idle: @@ -6991,6 +7001,7 @@ done: __maybe_unused; if (new_tasks > 0) goto again;
+ rq->cfs.avg.freeze_updates = 0; return NULL; }
@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) cfs_rq = cfs_rq_of(se); put_prev_entity(cfs_rq, se); } + + /* + * Task is being put because of preemption either by CFS or by a higher + * class. Make sure no util updates happen. + */ + if (!prev->state) { /* Task is put while running */ + rq->cfs.avg.freeze_updates = 1; + prev->se.avg.freeze_updates = 1; + } else { + rq->cfs.avg.freeze_updates = 0; + prev->se.avg.freeze_updates = 0; + } }
/* -- 2.18.0.rc1.242.g61856ae69a-goog