Re: [Eas-dev] [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

14 Jun 2018

Hi Joel,
On 14 June 2018 at 11:26, Patrick Bellasi patrick.bellasi@arm.com wrote:
...
On 13-Jun 21:26, Joel Fernandes wrote:
...
Hi Dietmar,
On Mon, Jun 11, 2018 at 02:44:47PM +0200, Dietmar Eggemann wrote:
...
On 06/08/2018 05:54 PM, Joel Fernandes wrote:
...
From: "Joel Fernandes (Google)" joel@joelfernandes.org
Here's a very rough patch just to discuss prevention of decay of
CPU/task's util_avg signal incase its preempted by RT or DL. Its
likely not correct and needs more work but it solves the issue I see
with my synthetic test.
To reproduce the issue, I wrote a synthetic rt-app test with RT task
preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY
I have seen in traces that the util_avg decays quickly even before the
RT task sleeps.
The idea is to mark the CFS class from put_prev_task_fair if the task
was running while the put_prev_task() happened and unmark it from
pick_next_task_fair. This approach also keeps the solution of this
within CFS without modification to other classes. What do you think?
Note:

Just for demo/trial, I took a full int from sched_avg. I am open to
 any ideas on a flag we can set, all that's needed is 1 bit.
The current UTIL_EST design cannot prevent this issue because it
 relies on task's sleeping and infact could be affect by this issue
 itself incase task's util-est EWMA hasn't built up.
Vincent's patches for PELT in DL/RT are still needed for correct OPP
 selection and estimation of capacity_of etc. This patch solves a
 different issue.

Utilization (util_avg) is defined as 'accrue+decay' when running and 'decay'
when !running. IMHO, this fundamental thing shouldn't be changed. Especially
if you do this 'time warping' only for a subset of involved entities (in
this case tasks and root cfs_rq's).
Ok.
...
[...]
...

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca3f3eae8980..46b686192641 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -404,6 +404,7 @@ struct sched_avg {
          unsigned long                   runnable_load_avg;
          unsigned long                   util_avg;
          struct util_est                 util_est;

int                             freeze_updates;
} ____cacheline_aligned;
struct sched_statistics {

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09b7eb69802c..521656b05b22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
          u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
          u64 periods;

/*

If an update is supposed to be skipped, then do nothing.



sa->last_update_time should still moves forward.


*/
if (sa->freeze_updates)
    return 0;



So load(_avg) and runnable_load(_avg) would be affected as well. Why?
The way I was thinking about it was, none of the definitions of sched_avg
should hold when RT preempts CFS. Basically the sched_avg gets frozen (all
entries of it during this preemption). Essentially we pause the load-tracking
of CFS in such case. Do you have a scenario in mind where freezing of the
load_avg may not be desirable?
I _think_ this proposal has (at least) the same issue of my previous
one: it inflates tasks utilization by freezing (i.e. ignoring)
preemption time and thus reducing by consequence the task periods.
Vincent come up with a really good exaplaination here:
   20180606103809.GG31675@e110439-lin
   https://lkml.org/lkml/2018/6/6/256
Can you please better check if the above reasoning / explaination
matches your proposal too?
I also think that this patch has the same problem as Patrick's
proposal as explained in https://lkml.org/lkml/2018/6/6/256
Then you skip the full PELT update which includes the
load/runnable_load_avg but these signals aim to inflate during
preemption
And you break the sync of pelt signal of the cgroup path as
intermediate cfs_rq and se will continue to update the pelt signal
Vincent
...
...
...
[...]
...
@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
                  cfs_rq = cfs_rq_of(se);
                  put_prev_entity(cfs_rq, se);
          }


/*

Task is being put because of preemption either by CFS or by a higher



class.  Make sure no util updates happen.


*/
if (!prev->state) { /* Task is put while running */
    rq->cfs.avg.freeze_updates = 1;


    prev->se.avg.freeze_updates = 1;



Only for tasks and root cfs_rq's! Not for cfs_rq's and se's representing
task groups. Synchronization issues when we add/subtract se's into cfs_rq's.
Yes, I was aware of this and agree more work is needed to make this patch a
thing, but I just wanted to discuss the basic idea with this RFC patch first.
In my proposal I was not touching the PELT utilization signal, just
tracking a temporary signal for a RUNNABLE task. Which, despite being
fundamentally broken (because of the explanation above), was at least
not invasive wrt RQ's and TG's signals propagation.
What you propose here it seems to me a completely different signal,
not more utilization, which properties should be (eventually)
carefully defined... starting from its propagation across TG's... and
provided it really makes any sense.
TBH, WALT does a great job in task placement and OPP selection also
because it tracks only the root RQ and Tasks's SE. Your proposal
should eventually go in that direction... but still I think you are
coming up with a signal which inflate real task needs although not
being so aggressive like (unscaled) load, which tracks RUNNABLE time.
...
I am also curious what Vincent thinks. The thing is even if we have signals
from higher classes, the CFS task will still look little due to this
unintented preempt-decay which I think may have unintended consequences.
That's not always true: if we collected some util_est samples the
estimated utilization is un-affected by preemption time.
...
thanks,

Joel

--
#include <best/regards.h>
Patrick Bellasi

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT