[RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

List overview All Threads
Download

newer

older

[Integration Branch] Update...

Joel Fernandes

8 Jun 2018 8 Jun '18

3:54 p.m.

From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

The idea is to mark the CFS class from put_prev_task_fair if the task was running while the put_prev_task() happened and unmark it from pick_next_task_fair. This approach also keeps the solution of this within CFS without modification to other classes. What do you think?

Note: - Just for demo/trial, I took a full int from sched_avg. I am open to any ideas on a flag we can set, all that's needed is 1 bit. - The current UTIL_EST design cannot prevent this issue because it relies on task's sleeping and infact could be affect by this issue itself incase task's util-est EWMA hasn't built up. - Vincent's patches for PELT in DL/RT are still needed for correct OPP selection and estimation of capacity_of etc. This patch solves a different issue.

CC: "eas-dev" eas-dev@lists.linaro.org CC: vincent.guittot@linaro.org CC: patrick.bellasi@arm.com CC: dietmar.eggemann@arm.com CC: chris.redpath@arm.com CC: juri.lelli@redhat.com CC: tkjos@google.com Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org --- include/linux/sched.h | 1 + kernel/sched/fair.c | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index ca3f3eae8980..46b686192641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -404,6 +404,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est; + int freeze_updates; } ____cacheline_aligned;

struct sched_statistics { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09b7eb69802c..521656b05b22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods;

+ /* + * If an update is supposed to be skipped, then do nothing. + * sa->last_update_time should still moves forward. + */ + if (sa->freeze_updates) + return 0; + scale_freq = arch_scale_freq_capacity(cpu); scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

@@ -6963,6 +6970,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf p = task_of(se);

done: __maybe_unused; + #ifdef CONFIG_SMP /* * Move the next running task to the front of @@ -6975,6 +6983,8 @@ done: __maybe_unused; if (hrtick_enabled(rq)) hrtick_start_fair(rq, p);

+ rq->cfs.avg.freeze_updates = 0; + p->se.avg.freeze_updates = 0; return p;

idle: @@ -6991,6 +7001,7 @@ done: __maybe_unused; if (new_tasks > 0) goto again;

+ rq->cfs.avg.freeze_updates = 0; return NULL; }

@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) cfs_rq = cfs_rq_of(se); put_prev_entity(cfs_rq, se); } + + /* + * Task is being put because of preemption either by CFS or by a higher + * class. Make sure no util updates happen. + */ + if (!prev->state) { /* Task is put while running */ + rq->cfs.avg.freeze_updates = 1; + prev->se.avg.freeze_updates = 1; + } else { + rq->cfs.avg.freeze_updates = 0; + prev->se.avg.freeze_updates = 0; + } }

/* -- 2.18.0.rc1.242.g61856ae69a-goog

Show replies by date

Joel Fernandes

8 Jun 8 Jun

3:58 p.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

On Fri, Jun 08, 2018 at 08:54:57AM -0700, Joel Fernandes wrote:

...

From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

Just to clarify, the issue is:

If a long running CFS task is preempted briefly by RT, then on return from RT itself util_avg will have crashed. The correct behavior IMO is the util_avg should not change and just continue accounting from where we left.

thanks,

- Joel

Quentin Perret

11 Jun 11 Jun

8:40 a.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

Hi Joel,

On Friday 08 Jun 2018 at 08:58:39 (-0700), Joel Fernandes wrote:

...

On Fri, Jun 08, 2018 at 08:54:57AM -0700, Joel Fernandes wrote:

...
From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

Just to clarify, the issue is:

If a long running CFS task is preempted briefly by RT, then on return from RT itself util_avg will have crashed. The correct behavior IMO is the util_avg should not change and just continue accounting from where we left.

Yeah, so this idea looks a little bit like Patrick's proposal from one week ago or so. I think there is an issue that was mentioned with an example like this:

Only 1 CPU in the system, 1 CFS task (50%), 1 sporadic RT task _: sleeping R: running #: preempted

Scenario 1: RT: ____________RRRR____________ CFS: __RRRRRRRR________RRRRRRRR__ | | t1 t2

Scenario 2 RT: ____RRRR____________________ CFS: __RR####RRRRRR____RRRRRRRR__ | | | | t3 t4 t5 t6

So in the scenario 1, (t2-t1) is the time during which the CFS task decays. In scenario 2, IIUC what you propose to do is to remove (t4-t3) from the decay time. So the task is decayed only for (t6-t5). So, even if you don't increase the util_avg between t3 and t4, the task will look bigger than it really is because you decay it less.

With the current implementation, you would decay the task for (t4-t3) and (t6-t5), which is correct for this 50% task. The fact that the util_avg of the task decays during (t4-t3) doesn't matter, since you would have decayed the task for it at the end of its execution anyway if it wasn't preempted.

What I say above is true only if there is at least a little of idle time in the system, I think.

Does that make any sense ?

Thanks, Quentin

Joel Fernandes

14 Jun 14 Jun

3:58 a.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

Hi Quentin,

On Mon, Jun 11, 2018 at 09:40:04AM +0100, Quentin Perret wrote:

...

Hi Joel,

On Friday 08 Jun 2018 at 08:58:39 (-0700), Joel Fernandes wrote:

...
On Fri, Jun 08, 2018 at 08:54:57AM -0700, Joel Fernandes wrote:

...
From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

Just to clarify, the issue is:

If a long running CFS task is preempted briefly by RT, then on return from RT itself util_avg will have crashed. The correct behavior IMO is the util_avg should not change and just continue accounting from where we left.

Yeah, so this idea looks a little bit like Patrick's proposal from one week ago or so. I think there is an issue that was mentioned with an

I don't think this patch is similar to that.. my patch only affects CFS tasks being preempted by higher class. For CFS preempting CFS tasks, the intention is that the behavior should remain the same.

...

example like this:

Only 1 CPU in the system, 1 CFS task (50%), 1 sporadic RT task _: sleeping R: running #: preempted

Scenario 1: RT: ____________RRRR____________ CFS: __RRRRRRRR________RRRRRRRR__ | | t1 t2

Scenario 2 RT: ____RRRR____________________ CFS: __RR####RRRRRR____RRRRRRRR__ | | | | t3 t4 t5 t6

So in the scenario 1, (t2-t1) is the time during which the CFS task decays. In scenario 2, IIUC what you propose to do is to remove (t4-t3) from the decay time. So the task is decayed only for (t6-t5). So, even if you don't increase the util_avg between t3 and t4, the task will look bigger than it really is because you decay it less.

I don't see how. The task shouldn't be decayed between t3 and t4 unless it went to sleep. Say the RT task is not running, the CFS task's utilization should look like the same as if the RT task WAS running... Just because RT task ran, that shouldn't change utilization of the CFS task.. so scenario 2 should only decay between t5 and t6 whether the RT task ran or not. Between t3 and t4, my patch attempts to not change the signal at all for the CFS task and hit the pause button on it. And scenario 1 isn't affected by my patch.

...

With the current implementation, you would decay the task for (t4-t3) and (t6-t5), which is correct for this 50% task. The fact that the util_avg of the task decays during (t4-t3) doesn't matter, since you would have decayed the task for it at the end of its execution anyway if it wasn't preempted.

What I say above is true only if there is at least a little of idle time in the system, I think.

Does that make any sense ?

Sorry, no :(

I think your example is different from Vincent's. Vincent was talking about some of the "sleep time" of a task actually being counted as "preempted time" which would prevent its decay and cause it to look big. Basically if a task is sleeping, then it has to be decayed. The patch I wrote attempts to ensure that if the task is sleeping, then it will be decayed - unless you can show how this invariant isn't satisfied with the patch?

thanks!

- Joel

Quentin Perret

8:10 a.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

Hi Joel,

On Wednesday 13 Jun 2018 at 20:58:34 (-0700), Joel Fernandes wrote:

...

Hi Quentin,

On Mon, Jun 11, 2018 at 09:40:04AM +0100, Quentin Perret wrote:

...
Hi Joel,

On Friday 08 Jun 2018 at 08:58:39 (-0700), Joel Fernandes wrote:

...
On Fri, Jun 08, 2018 at 08:54:57AM -0700, Joel Fernandes wrote:

...
From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

Just to clarify, the issue is:

If a long running CFS task is preempted briefly by RT, then on return from RT itself util_avg will have crashed. The correct behavior IMO is the util_avg should not change and just continue accounting from where we left.

Yeah, so this idea looks a little bit like Patrick's proposal from one week ago or so. I think there is an issue that was mentioned with an

I don't think this patch is similar to that.. my patch only affects CFS tasks being preempted by higher class. For CFS preempting CFS tasks, the intention is that the behavior should remain the same.

...
example like this:

Only 1 CPU in the system, 1 CFS task (50%), 1 sporadic RT task _: sleeping R: running #: preempted

Scenario 1: RT: ____________RRRR____________ CFS: __RRRRRRRR________RRRRRRRR__ | | t1 t2

Scenario 2 RT: ____RRRR____________________ CFS: __RR####RRRRRR____RRRRRRRR__ | | | | t3 t4 t5 t6

So in the scenario 1, (t2-t1) is the time during which the CFS task decays. In scenario 2, IIUC what you propose to do is to remove (t4-t3) from the decay time. So the task is decayed only for (t6-t5). So, even if you don't increase the util_avg between t3 and t4, the task will look bigger than it really is because you decay it less.

I don't see how. The task shouldn't be decayed between t3 and t4 unless it went to sleep. Say the RT task is not running, the CFS task's utilization should look like the same as if the RT task WAS running... Just because RT task ran, that shouldn't change utilization of the CFS task..

It doesn't, that's my point :-)

In scenario 1, the task runs for 8 units of time (1 unit == 1 character) and sleeps during 8 units of time. So the util_avg grows during 8 units of time, and is decayed during 8 unit of times, which means the true util of the task is 50%.

Now, for scenario 2, if we decay the task util during preemption, the util_avg grows during the 8 units of time where the tasks runs, and decays during the preemption time and the sleep time, which is 8 units of time as well. So that's actually the correct value for that task. The thing is, the time during which the task is preempted would have been sleep time if the task wasn't preempted, that's why you should count it as such.

If we apply your approach, in scenario 2 the util_avg of the task will grow for 8 units of time, and decay during the 4 units of sleep time, so you will see the task as 66%, which is incorrect. The correct utilization is 50%, because this is what you would get if there was no preemption.

Is it any clearer ?

...

so scenario 2 should only decay between t5 and t6 whether the RT task ran or not. Between t3 and t4, my patch attempts to not change the signal at all for the CFS task and hit the pause button on it. And scenario 1 isn't affected by my patch.

...
With the current implementation, you would decay the task for (t4-t3) and (t6-t5), which is correct for this 50% task. The fact that the util_avg of the task decays during (t4-t3) doesn't matter, since you would have decayed the task for it at the end of its execution anyway if it wasn't preempted.

What I say above is true only if there is at least a little of idle time in the system, I think.

Does that make any sense ?

Sorry, no :(

Arf, I just re-read my email this morning and it wasn't very clear indeed ... Sorry about that. I hope my message above will help clarify what I meant :-)

Thanks ! Quentin

...

I think your example is different from Vincent's. Vincent was talking about some of the "sleep time" of a task actually being counted as "preempted time" which would prevent its decay and cause it to look big. Basically if a task is sleeping, then it has to be decayed. The patch I wrote attempts to ensure that if the task is sleeping, then it will be decayed - unless you can show how this invariant isn't satisfied with the patch?

thanks!

Joel

Joel Fernandes

4:48 p.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

On June 14, 2018 1:10:32 AM PDT, Quentin Perret quentin.perret@arm.com wrote:

...

Hi Joel,

On Wednesday 13 Jun 2018 at 20:58:34 (-0700), Joel Fernandes wrote:

...
Hi Quentin,

On Mon, Jun 11, 2018 at 09:40:04AM +0100, Quentin Perret wrote:

...
Hi Joel,

On Friday 08 Jun 2018 at 08:58:39 (-0700), Joel Fernandes wrote:

...
On Fri, Jun 08, 2018 at 08:54:57AM -0700, Joel Fernandes wrote:

...
From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay

of

...
...
...
...
CPU/task's util_avg signal incase its preempted by RT or DL.

Its

...
...
...
...
likely not correct and needs more work but it solves the issue

I see

...
...
...
...
with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT

task

...
...
...
...
preempting a 100% CFS task for 300ms.

https://pastebin.com/raw/rXNmRUZY

...
...
...
...
I have seen in traces that the util_avg decays quickly even

before the

...
...
...
...
RT task sleeps.

Just to clarify, the issue is:

If a long running CFS task is preempted briefly by RT, then on

return from RT

...
...
...
itself util_avg will have crashed. The correct behavior IMO is

the util_avg

...
...
...
should not change and just continue accounting from where we

left.

...
...
Yeah, so this idea looks a little bit like Patrick's proposal from

one

...
...
week ago or so. I think there is an issue that was mentioned with

an

...
I don't think this patch is similar to that.. my patch only affects

CFS tasks

...
being preempted by higher class. For CFS preempting CFS tasks, the

intention

...
is that the behavior should remain the same.

...
example like this:

Only 1 CPU in the system, 1 CFS task (50%), 1 sporadic RT task _: sleeping R: running #: preempted

Scenario 1: RT: ____________RRRR____________ CFS: __RRRRRRRR________RRRRRRRR__ | | t1 t2

Scenario 2 RT: ____RRRR____________________ CFS: __RR####RRRRRR____RRRRRRRR__ | | | | t3 t4 t5 t6

So in the scenario 1, (t2-t1) is the time during which the CFS task decays. In scenario 2, IIUC what you propose to do is to remove

(t4-t3)

...
...
from the decay time. So the task is decayed only for (t6-t5). So,

even if

...
...
you don't increase the util_avg between t3 and t4, the task will

look

...
...
bigger than it really is because you decay it less.

I don't see how. The task shouldn't be decayed between t3 and t4

unless it

...
went to sleep. Say the RT task is not running, the CFS task's

utilization

...
should look like the same as if the RT task WAS running... Just

because RT

...
task ran, that shouldn't change utilization of the CFS task..

It doesn't, that's my point :-)

In scenario 1, the task runs for 8 units of time (1 unit == 1 character) and sleeps during 8 units of time. So the util_avg grows during 8 units of time, and is decayed during 8 unit of times, which means the true util of the task is 50%.

Now, for scenario 2, if we decay the task util during preemption, the util_avg grows during the 8 units of time where the tasks runs, and decays during the preemption time and the sleep time, which is 8 units of time as well. So that's actually the correct value for that task. The thing is, the time during which the task is preempted would have been sleep time if the task wasn't preempted, that's why you should count it as such.

If we apply your approach, in scenario 2 the util_avg of the task will grow for 8 units of time, and decay during the 4 units of sleep time, so you will see the task as 66%, which is incorrect. The correct utilization is 50%, because this is what you would get if there was no preemption.

Is it any clearer ?

Yes it is clearer now. I see what you mean. I think such problem will mostly effect periodic tasks which sleep until next activation but it's still an issue. Sorry I missed this point . Thanks to you Vincent and everyone for explaining it. - Joel

...

...
so scenario 2 should only decay between t5 and t6 whether the RT task ran or not.

Between

...
t3 and t4, my patch attempts to not change the signal at all for the

CFS task and hit the

...
pause button on it. And scenario 1 isn't affected by my patch.

...
With the current implementation, you would decay the task for

(t4-t3)

...
...
and (t6-t5), which is correct for this 50% task. The fact that the

util_avg

...
...
of the task decays during (t4-t3) doesn't matter, since you would

have

...
...
decayed the task for it at the end of its execution anyway if it

wasn't

...
...
preempted.

What I say above is true only if there is at least a little of idle

time

...
...
in the system, I think.

Does that make any sense ?

Sorry, no :(

Arf, I just re-read my email this morning and it wasn't very clear indeed ... Sorry about that. I hope my message above will help clarify what I meant :-)

Thanks ! Quentin

...
I think your example is different from Vincent's. Vincent was talking

about

...
some of the "sleep time" of a task actually being counted as

"preempted time"

...
which would prevent its decay and cause it to look big. Basically if

a task

...
is sleeping, then it has to be decayed. The patch I wrote attempts to

ensure

...
that if the task is sleeping, then it will be decayed - unless you

can show

...
how this invariant isn't satisfied with the patch?

thanks!

Joel

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Dietmar Eggemann

11 Jun 11 Jun

12:44 p.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

On 06/08/2018 05:54 PM, Joel Fernandes wrote:

...

From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

The idea is to mark the CFS class from put_prev_task_fair if the task was running while the put_prev_task() happened and unmark it from pick_next_task_fair. This approach also keeps the solution of this within CFS without modification to other classes. What do you think?

Note:

Just for demo/trial, I took a full int from sched_avg. I am open to any ideas on a flag we can set, all that's needed is 1 bit.

The current UTIL_EST design cannot prevent this issue because it relies on task's sleeping and infact could be affect by this issue itself incase task's util-est EWMA hasn't built up.

Vincent's patches for PELT in DL/RT are still needed for correct OPP selection and estimation of capacity_of etc. This patch solves a different issue.

Utilization (util_avg) is defined as 'accrue+decay' when running and 'decay' when !running. IMHO, this fundamental thing shouldn't be changed. Especially if you do this 'time warping' only for a subset of involved entities (in this case tasks and root cfs_rq's).

[...]

...

diff --git a/include/linux/sched.h b/include/linux/sched.h index ca3f3eae8980..46b686192641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -404,6 +404,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est;

int freeze_updates; } ____cacheline_aligned;

struct sched_statistics {

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09b7eb69802c..521656b05b22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods;
/*
* If an update is supposed to be skipped, then do nothing.
* sa->last_update_time should still moves forward.
*/
if (sa->freeze_updates)
return 0;

So load(_avg) and runnable_load(_avg) would be affected as well. Why?

[...]

...

@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) cfs_rq = cfs_rq_of(se); put_prev_entity(cfs_rq, se); }
/*
* Task is being put because of preemption either by CFS or by a higher
* class.  Make sure no util updates happen.
*/
if (!prev->state) { /* Task is put while running */
rq->cfs.avg.freeze_updates = 1;
prev->se.avg.freeze_updates = 1;

Only for tasks and root cfs_rq's! Not for cfs_rq's and se's representing task groups. Synchronization issues when we add/subtract se's into cfs_rq's.

[...]

Joel Fernandes

14 Jun 14 Jun

4:26 a.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

Hi Dietmar,

On Mon, Jun 11, 2018 at 02:44:47PM +0200, Dietmar Eggemann wrote:

...

On 06/08/2018 05:54 PM, Joel Fernandes wrote:

...
From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

The idea is to mark the CFS class from put_prev_task_fair if the task was running while the put_prev_task() happened and unmark it from pick_next_task_fair. This approach also keeps the solution of this within CFS without modification to other classes. What do you think?

Note:

Just for demo/trial, I took a full int from sched_avg. I am open to any ideas on a flag we can set, all that's needed is 1 bit.

The current UTIL_EST design cannot prevent this issue because it relies on task's sleeping and infact could be affect by this issue itself incase task's util-est EWMA hasn't built up.

Vincent's patches for PELT in DL/RT are still needed for correct OPP selection and estimation of capacity_of etc. This patch solves a different issue.

Utilization (util_avg) is defined as 'accrue+decay' when running and 'decay' when !running. IMHO, this fundamental thing shouldn't be changed. Especially if you do this 'time warping' only for a subset of involved entities (in this case tasks and root cfs_rq's).

Ok.

...

[...]

...
diff --git a/include/linux/sched.h b/include/linux/sched.h index ca3f3eae8980..46b686192641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -404,6 +404,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est;

int freeze_updates; } ____cacheline_aligned; struct sched_statistics {

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09b7eb69802c..521656b05b22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods;
/*
* If an update is supposed to be skipped, then do nothing.
* sa->last_update_time should still moves forward.
*/
if (sa->freeze_updates)
return 0;
So load(_avg) and runnable_load(_avg) would be affected as well. Why?

The way I was thinking about it was, none of the definitions of sched_avg should hold when RT preempts CFS. Basically the sched_avg gets frozen (all entries of it during this preemption). Essentially we pause the load-tracking of CFS in such case. Do you have a scenario in mind where freezing of the load_avg may not be desirable?

...

[...]

...
@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) cfs_rq = cfs_rq_of(se); put_prev_entity(cfs_rq, se); }
/*
* Task is being put because of preemption either by CFS or by a higher
* class.  Make sure no util updates happen.
*/
if (!prev->state) { /* Task is put while running */
rq->cfs.avg.freeze_updates = 1;
prev->se.avg.freeze_updates = 1;
Only for tasks and root cfs_rq's! Not for cfs_rq's and se's representing task groups. Synchronization issues when we add/subtract se's into cfs_rq's.

Yes, I was aware of this and agree more work is needed to make this patch a thing, but I just wanted to discuss the basic idea with this RFC patch first.

I am also curious what Vincent thinks. The thing is even if we have signals from higher classes, the CFS task will still look little due to this unintented preempt-decay which I think may have unintended consequences.

thanks,

- Joel

Patrick Bellasi

9:26 a.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

On 13-Jun 21:26, Joel Fernandes wrote:

...

Hi Dietmar,

On Mon, Jun 11, 2018 at 02:44:47PM +0200, Dietmar Eggemann wrote:

...
On 06/08/2018 05:54 PM, Joel Fernandes wrote:

...
From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

The idea is to mark the CFS class from put_prev_task_fair if the task was running while the put_prev_task() happened and unmark it from pick_next_task_fair. This approach also keeps the solution of this within CFS without modification to other classes. What do you think?

Note:

Just for demo/trial, I took a full int from sched_avg. I am open to any ideas on a flag we can set, all that's needed is 1 bit.

The current UTIL_EST design cannot prevent this issue because it relies on task's sleeping and infact could be affect by this issue itself incase task's util-est EWMA hasn't built up.

Vincent's patches for PELT in DL/RT are still needed for correct OPP selection and estimation of capacity_of etc. This patch solves a different issue.

Utilization (util_avg) is defined as 'accrue+decay' when running and 'decay' when !running. IMHO, this fundamental thing shouldn't be changed. Especially if you do this 'time warping' only for a subset of involved entities (in this case tasks and root cfs_rq's).

Ok.

...
[...]

...
diff --git a/include/linux/sched.h b/include/linux/sched.h index ca3f3eae8980..46b686192641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -404,6 +404,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est;

int freeze_updates; } ____cacheline_aligned; struct sched_statistics {

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09b7eb69802c..521656b05b22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods;
/*
* If an update is supposed to be skipped, then do nothing.
* sa->last_update_time should still moves forward.
*/
if (sa->freeze_updates)
return 0;
So load(_avg) and runnable_load(_avg) would be affected as well. Why?
The way I was thinking about it was, none of the definitions of sched_avg should hold when RT preempts CFS. Basically the sched_avg gets frozen (all entries of it during this preemption). Essentially we pause the load-tracking of CFS in such case. Do you have a scenario in mind where freezing of the load_avg may not be desirable?

I _think_ this proposal has (at least) the same issue of my previous one: it inflates tasks utilization by freezing (i.e. ignoring) preemption time and thus reducing by consequence the task periods.

Vincent come up with a really good exaplaination here: 20180606103809.GG31675@e110439-lin https://lkml.org/lkml/2018/6/6/256

Can you please better check if the above reasoning / explaination matches your proposal too?

...

...
[...]

...
@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) cfs_rq = cfs_rq_of(se); put_prev_entity(cfs_rq, se); }
/*
* Task is being put because of preemption either by CFS or by a higher
* class.  Make sure no util updates happen.
*/
if (!prev->state) { /* Task is put while running */
rq->cfs.avg.freeze_updates = 1;
prev->se.avg.freeze_updates = 1;
Only for tasks and root cfs_rq's! Not for cfs_rq's and se's representing task groups. Synchronization issues when we add/subtract se's into cfs_rq's.
Yes, I was aware of this and agree more work is needed to make this patch a thing, but I just wanted to discuss the basic idea with this RFC patch first.

In my proposal I was not touching the PELT utilization signal, just tracking a temporary signal for a RUNNABLE task. Which, despite being fundamentally broken (because of the explanation above), was at least not invasive wrt RQ's and TG's signals propagation.

What you propose here it seems to me a completely different signal, not more utilization, which properties should be (eventually) carefully defined... starting from its propagation across TG's... and provided it really makes any sense.

TBH, WALT does a great job in task placement and OPP selection also because it tracks only the root RQ and Tasks's SE. Your proposal should eventually go in that direction... but still I think you are coming up with a signal which inflate real task needs although not being so aggressive like (unscaled) load, which tracks RUNNABLE time.

...

I am also curious what Vincent thinks. The thing is even if we have signals from higher classes, the CFS task will still look little due to this unintented preempt-decay which I think may have unintended consequences.

That's not always true: if we collected some util_est samples the estimated utilization is un-affected by preemption time.

...

thanks,

Joel

-- #include <best/regards.h>

Patrick Bellasi

Vincent Guittot

10:11 a.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

Hi Joel,

On 14 June 2018 at 11:26, Patrick Bellasi patrick.bellasi@arm.com wrote:

...

On 13-Jun 21:26, Joel Fernandes wrote:

...
Hi Dietmar,

On Mon, Jun 11, 2018 at 02:44:47PM +0200, Dietmar Eggemann wrote:

...
On 06/08/2018 05:54 PM, Joel Fernandes wrote:

...
From: "Joel Fernandes (Google)" joel@joelfernandes.org

Here's a very rough patch just to discuss prevention of decay of CPU/task's util_avg signal incase its preempted by RT or DL. Its likely not correct and needs more work but it solves the issue I see with my synthetic test.

To reproduce the issue, I wrote a synthetic rt-app test with RT task preempting a 100% CFS task for 300ms. https://pastebin.com/raw/rXNmRUZY I have seen in traces that the util_avg decays quickly even before the RT task sleeps.

The idea is to mark the CFS class from put_prev_task_fair if the task was running while the put_prev_task() happened and unmark it from pick_next_task_fair. This approach also keeps the solution of this within CFS without modification to other classes. What do you think?

Note:

Just for demo/trial, I took a full int from sched_avg. I am open to any ideas on a flag we can set, all that's needed is 1 bit.

The current UTIL_EST design cannot prevent this issue because it relies on task's sleeping and infact could be affect by this issue itself incase task's util-est EWMA hasn't built up.

Vincent's patches for PELT in DL/RT are still needed for correct OPP selection and estimation of capacity_of etc. This patch solves a different issue.

Utilization (util_avg) is defined as 'accrue+decay' when running and 'decay' when !running. IMHO, this fundamental thing shouldn't be changed. Especially if you do this 'time warping' only for a subset of involved entities (in this case tasks and root cfs_rq's).

Ok.

...
[...]

...
diff --git a/include/linux/sched.h b/include/linux/sched.h index ca3f3eae8980..46b686192641 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -404,6 +404,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est;

int freeze_updates; } ____cacheline_aligned; struct sched_statistics {

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09b7eb69802c..521656b05b22 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3103,6 +3103,13 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods;
/*

If an update is supposed to be skipped, then do nothing.

sa->last_update_time should still moves forward.

*/

if (sa->freeze_updates)
    return 0;
So load(_avg) and runnable_load(_avg) would be affected as well. Why?
The way I was thinking about it was, none of the definitions of sched_avg should hold when RT preempts CFS. Basically the sched_avg gets frozen (all entries of it during this preemption). Essentially we pause the load-tracking of CFS in such case. Do you have a scenario in mind where freezing of the load_avg may not be desirable?
I _think_ this proposal has (at least) the same issue of my previous one: it inflates tasks utilization by freezing (i.e. ignoring) preemption time and thus reducing by consequence the task periods.

Vincent come up with a really good exaplaination here: 20180606103809.GG31675@e110439-lin https://lkml.org/lkml/2018/6/6/256

Can you please better check if the above reasoning / explaination matches your proposal too?

I also think that this patch has the same problem as Patrick's proposal as explained in https://lkml.org/lkml/2018/6/6/256 Then you skip the full PELT update which includes the load/runnable_load_avg but these signals aim to inflate during preemption And you break the sync of pelt signal of the cgroup path as intermediate cfs_rq and se will continue to update the pelt signal

Vincent

...

...
...
[...]

...
@@ -7006,6 +7017,18 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) cfs_rq = cfs_rq_of(se); put_prev_entity(cfs_rq, se); }
/*

Task is being put because of preemption either by CFS or by a higher

class. Make sure no util updates happen.

*/

if (!prev->state) { /* Task is put while running */
    rq->cfs.avg.freeze_updates = 1;
    prev->se.avg.freeze_updates = 1;
Only for tasks and root cfs_rq's! Not for cfs_rq's and se's representing task groups. Synchronization issues when we add/subtract se's into cfs_rq's.
Yes, I was aware of this and agree more work is needed to make this patch a thing, but I just wanted to discuss the basic idea with this RFC patch first.
In my proposal I was not touching the PELT utilization signal, just tracking a temporary signal for a RUNNABLE task. Which, despite being fundamentally broken (because of the explanation above), was at least not invasive wrt RQ's and TG's signals propagation.

What you propose here it seems to me a completely different signal, not more utilization, which properties should be (eventually) carefully defined... starting from its propagation across TG's... and provided it really makes any sense.

TBH, WALT does a great job in task placement and OPP selection also because it tracks only the root RQ and Tasks's SE. Your proposal should eventually go in that direction... but still I think you are coming up with a signal which inflate real task needs although not being so aggressive like (unscaled) load, which tracks RUNNABLE time.

...
I am also curious what Vincent thinks. The thing is even if we have signals from higher classes, the CFS task will still look little due to this unintented preempt-decay which I think may have unintended consequences.

That's not always true: if we collected some util_est samples the estimated utilization is un-affected by preemption time.

...
thanks,

Joel

-- #include <best/regards.h>

Patrick Bellasi

Qais Yousef

19 Jun 19 Jun

1:16 p.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

On Thu, Jun 14, 2018 at 10:26 AM, Patrick Bellasi patrick.bellasi@arm.com wrote:

...

[...]

...

I _think_ this proposal has (at least) the same issue of my previous one: it inflates tasks utilization by freezing (i.e. ignoring) preemption time and thus reducing by consequence the task periods.

Vincent come up with a really good exaplaination here: 20180606103809.GG31675@e110439-lin https://lkml.org/lkml/2018/6/6/256

Can you please better check if the above reasoning / explaination matches your proposal too?

I am a newbie in here, so apologies in advance if my question is silly or doesn't make any sense.

Why can't we introduce a new 'milder' decay function for this case?

For example

...

Task 1 state RRRRSSSSSSERRRRSSSSSSERRRRSSSSSS util_avg AAAADDDDDD AAAADDDDDD AAAADDDDDD

Task 2 state WWWWRRRRSSEWWWWRRRRSSEWWWWRRRRSS util_avg DDDDAAAADD DDDDAAAADD DDDDAAAADD running_avg AAAADDC AAAADDC AAAADD

We can have the

running_avg ddddAAAADDC [.....]

Where d is a different decay function than D. d can be defined such that y^32 is 0.75 for instance, so we decay at a slower rate. Or it can be non linear and decay at faster rate as time passes

e.g

- For d[t < 8ms] y^32 = 0.7S - For d[t < 16ms] y^32 = 0.6S - For d[t >= 32ms] y^32 = 0.5S

Where S is the last signal value before preemption.

We can also introduce a new A function, a, such that it boosts the signal for the N ticks the task was preempted for

a[n] = x[n]S

Where

x[n] = 1 for n >= N x[n] = 1.25 (for example) for 0 =< n < N.

If this makes sense and not total BS, I suspect *only* boosting the accumulated signal after preemption will have less knock on effect on the rest of the system. IOW, better leave the D function intact/coherent.

running_avg DDDDaaaaDDC [.....]

Thanks

-- Qais Yousef

Joel Fernandes

20 Jun 20 Jun

8:09 a.m.

New subject: [RFC eas-dev] sched/fair: Skip pelt update if preempted by RT

On June 19, 2018 6:16:20 AM PDT, Qais Yousef qais.yousef.arm@gmail.com wrote:

...

On Thu, Jun 14, 2018 at 10:26 AM, Patrick Bellasi patrick.bellasi@arm.com wrote:

...
[...]

...
I _think_ this proposal has (at least) the same issue of my previous one: it inflates tasks utilization by freezing (i.e. ignoring) preemption time and thus reducing by consequence the task periods.

Vincent come up with a really good exaplaination here: 20180606103809.GG31675@e110439-lin https://lkml.org/lkml/2018/6/6/256

Can you please better check if the above reasoning / explaination matches your proposal too?

I am a newbie in here, so apologies in advance if my question is silly or doesn't make any sense.

Why can't we introduce a new 'milder' decay function for this case?

For example

...
Task 1 state RRRRSSSSSSERRRRSSSSSSERRRRSSSSSS util_avg AAAADDDDDD AAAADDDDDD AAAADDDDDD

Task 2 state WWWWRRRRSSEWWWWRRRRSSEWWWWRRRRSS util_avg DDDDAAAADD DDDDAAAADD DDDDAAAADD running_avg AAAADDC AAAADDC AAAADD

We can have the

running_avg ddddAAAADDC [.....]

Where d is a different decay function than D. d can be defined such that y^32 is 0.75 for instance, so we decay at a slower rate. Or it can be non linear and decay at faster rate as time passes

I don't think so... The problem is not about the rate of decay but about decaying vs not decaying. Vincent's example proves that you have to decay when CFS task is not running (whether preempted or sleeping)

...

e.g

For d[t < 8ms] y^32 = 0.7S

For d[t < 16ms] y^32 = 0.6S

For d[t >= 32ms] y^32 = 0.5S

Where S is the last signal value before preemption.

We can also introduce a new A function, a, such that it boosts the signal for the N ticks the task was preempted for
a[n] = x[n]S
Where
 x[n] = 1                               for n >= N
 x[n] = 1.25 (for example)      for 0 =< n < N.
If this makes sense and not total BS, I suspect *only* boosting the accumulated signal after preemption will have less knock on effect on the rest of the system. IOW, better leave the D function intact/coherent.

running_avg DDDDaaaaDDC [.....]

Thanks

-- Qais Yousef

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

2146

days inactive

2158

days old

eas-dev@lists.linaro.org

11 comments

participants

tags (0)

participants (7)

Dietmar Eggemann
Joel Fernandes
Joel Fernandes
Patrick Bellasi
Qais Yousef
Quentin Perret
Vincent Guittot