Hi Joel,
On 14 June 2018 at 11:26, Patrick Bellasi patrick.bellasi@arm.com wrote:
On 13-Jun 21:26, Joel Fernandes wrote:
Hi Dietmar,
On Mon, Jun 11, 2018 at 02:44:47PM +0200, Dietmar Eggemann wrote:
On 06/08/2018 05:54 PM, Joel Fernandes wrote:
From: "Joel Fernandes (Google)" joel@joelfernandes.org
Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.
To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.
The idea is to mark the CFS class from put_prev_task_fair if the task was running while the put_prev_task() happened and unmark it from pick_next_task_fair. This approach also keeps the solution of this within CFS without modification to other classes. What do you think?
Note:
- Just for demo/trial, I took a full int from sched_avg. I am open to any ideas on a flag we can set, all that's needed is 1 bit.
- The current UTIL_EST design cannot prevent this issue because it relies on task's sleeping and infact could be affect by this issue itself incase task's util-est EWMA hasn't built up.
- Vincent's patches for PELT in DL/RT are still needed for correct OPP selection and estimation of capacity_of etc. This patch solves a different issue.
Utilization (util_avg) is defined as 'accrue+decay' when running and 'decay' when !running. IMHO, this fundamental thing shouldn't be changed. Especially if you do this 'time warping' only for a subset of involved entities (in this case tasks and root cfs_rq's).
Ok.
[...]
diff --git a/include/linux/sched.h b/include/linux/sched.h index ca3f3eae8980..46b686192641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -404,6 +404,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est;
- int freeze_updates; } ____cacheline_aligned; struct sched_statistics {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09b7eb69802c..521656b05b22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods;
- /*
- If an update is supposed to be skipped, then do nothing.
- sa->last_update_time should still moves forward.
- */
- if (sa->freeze_updates)
return 0;
So load(_avg) and runnable_load(_avg) would be affected as well. Why?
The way I was thinking about it was, none of the definitions of sched_avg should hold when RT preempts CFS. Basically the sched_avg gets frozen (all entries of it during this preemption). Essentially we pause the load-tracking of CFS in such case. Do you have a scenario in mind where freezing of the load_avg may not be desirable?
I _think_ this proposal has (at least) the same issue of my previous one: it inflates tasks utilization by freezing (i.e. ignoring) preemption time and thus reducing by consequence the task periods.
Vincent come up with a really good exaplaination here: 20180606103809.GG31675@e110439-lin https://lkml.org/lkml/2018/6/6/256
Can you please better check if the above reasoning / explaination matches your proposal too?
I also think that this patch has the same problem as Patrick's proposal as explained in https://lkml.org/lkml/2018/6/6/256 Then you skip the full PELT update which includes the load/runnable_load_avg but these signals aim to inflate during preemption And you break the sync of pelt signal of the cgroup path as intermediate cfs_rq and se will continue to update the pelt signal
Vincent
[...]
@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) cfs_rq = cfs_rq_of(se); put_prev_entity(cfs_rq, se); }
- /*
- Task is being put because of preemption either by CFS or by a higher
- class. Make sure no util updates happen.
- */
- if (!prev->state) { /* Task is put while running */
rq->cfs.avg.freeze_updates = 1;
prev->se.avg.freeze_updates = 1;
Only for tasks and root cfs_rq's! Not for cfs_rq's and se's representing task groups. Synchronization issues when we add/subtract se's into cfs_rq's.
Yes, I was aware of this and agree more work is needed to make this patch a thing, but I just wanted to discuss the basic idea with this RFC patch first.
In my proposal I was not touching the PELT utilization signal, just tracking a temporary signal for a RUNNABLE task. Which, despite being fundamentally broken (because of the explanation above), was at least not invasive wrt RQ's and TG's signals propagation.
What you propose here it seems to me a completely different signal, not more utilization, which properties should be (eventually) carefully defined... starting from its propagation across TG's... and provided it really makes any sense.
TBH, WALT does a great job in task placement and OPP selection also because it tracks only the root RQ and Tasks's SE. Your proposal should eventually go in that direction... but still I think you are coming up with a signal which inflate real task needs although not being so aggressive like (unscaled) load, which tracks RUNNABLE time.
I am also curious what Vincent thinks. The thing is even if we have signals from higher classes, the CFS task will still look little due to this unintented preempt-decay which I think may have unintended consequences.
That's not always true: if we collected some util_est samples the estimated utilization is un-affected by preemption time.
thanks,
- Joel
-- #include <best/regards.h>
Patrick Bellasi