On 06/08/2018 05:54 PM, Joel Fernandes wrote:
From: "Joel Fernandes (Google)" joel@joelfernandes.org
Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.
To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.
The idea is to mark the CFS class from put_prev_task_fair if the task was running while the put_prev_task() happened and unmark it from pick_next_task_fair. This approach also keeps the solution of this within CFS without modification to other classes. What do you think?
Note:
- Just for demo/trial, I took a full int from sched_avg. I am open to any ideas on a flag we can set, all that's needed is 1 bit.
- The current UTIL_EST design cannot prevent this issue because it relies on task's sleeping and infact could be affect by this issue itself incase task's util-est EWMA hasn't built up.
- Vincent's patches for PELT in DL/RT are still needed for correct OPP selection and estimation of capacity_of etc. This patch solves a different issue.
Utilization (util_avg) is defined as 'accrue+decay' when running and 'decay' when !running. IMHO, this fundamental thing shouldn't be changed. Especially if you do this 'time warping' only for a subset of involved entities (in this case tasks and root cfs_rq's).
[...]
diff --git a/include/linux/sched.h b/include/linux/sched.h index ca3f3eae8980..46b686192641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -404,6 +404,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est;
int freeze_updates; } ____cacheline_aligned;
struct sched_statistics {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09b7eb69802c..521656b05b22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods;
- /*
* If an update is supposed to be skipped, then do nothing.
* sa->last_update_time should still moves forward.
*/
- if (sa->freeze_updates)
return 0;
So load(_avg) and runnable_load(_avg) would be affected as well. Why?
[...]
@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) cfs_rq = cfs_rq_of(se); put_prev_entity(cfs_rq, se); }
- /*
* Task is being put because of preemption either by CFS or by a higher
* class. Make sure no util updates happen.
*/
- if (!prev->state) { /* Task is put while running */
rq->cfs.avg.freeze_updates = 1;
prev->se.avg.freeze_updates = 1;
Only for tasks and root cfs_rq's! Not for cfs_rq's and se's representing task groups. Synchronization issues when we add/subtract se's into cfs_rq's.
[...]