Re: [Eas-dev] [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

4 Apr 2016


      On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
...
On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
...
I think I follow - Leo please correct me if I mangle your intentions.
It's an issue that Morten and Dietmar had mentioned to me as well.
Yes. We have been working on this issue for a while without getting to a
nice solution yet.
...
...
Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
task group other than the root. The task migrates from one CPU to
another. The cfs_rq.avg structures on the src and dest CPUs
corresponding to the group containing the task are updated immediately
via remove_entity_load_avg()/update_cfs_rq_load_avg() and
attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
on the src and dest are not immediately updated. The root cfs_rq.avg
must catch up over time with PELT.
Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where
the is enqueued/dequeued that gets immediately updated. If the cfs_rq is
a group cfs_rq its group entity se.avg doesn't get updated immediately.
It has to adapt over time at the pace defined by the geometric series.
The impact of a task migration therefore doesn't trickle through to the
root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang
introduced with his rewrite of PELT.
...
...
As to why we care, there's at least one issue which may or may not be
Leo's - the root cfs_rq is the one whose avg structure we read from to
determine the frequency with schedutil. If you have a cpu-bound task in
a non-root cgroup which periodically migrates among CPUs on an otherwise
idle system, I believe each time it migrates the frequency would drop to
fmin and have to ramp up again with PELT.
It makes any scheduling decision based on utilization difficult if fair
group scheduling is used as cpu_util() doesn't give an up-to-date
picture of any utilization caused by task in task groups.
For the energy-aware scheduling patches and patches we have in the
pipeline for improving capacity awareness in the scheduler we rely on
cpu_util().
...
Steve, thanks for explanation and totally agree. My initial purpose is
not from schedutil's perspective, but just did some basic analysis for
utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then
launch a periodic task (period: 100ms, duty cycle: 40%) and limit its
affinity to only two CPUs. So observed the same result like you said.
After applied this patch, I can get much better result for the CPU's
utilization after task's migration. Please see the parsed result for
CPU's utilization:
http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg....
http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg....
...
Leo I noticed you did not modify detach_entity_load_average(). I think
this would be needed to avoid the task's stats being double counted for
a while after switched_from_fair() or task_move_group_fair().
I'm afraid that the solution to problem is more complicated than that
:-(
You are adding/removing a contribution from the root cfs_rq.avg which
isn't part of the signal in the first place. The root cfs_rq.avg only
contains the sum of the load/util of the sched_entities on the cfs_rq.
If you remove the contribution of the tasks from there you may end up
double-accounting for the task migration. Once due to you patch and then
again slowly over time as the group sched_entity starts reflecting that
the task has migrated. Furthermore, for group scheduling to make sense
it has to be the task_h_load() you add/remove otherwise the group
weighting is completely lost. Or am I completely misreading your patch?
I don't think the slow response time for _load_ is necessarily a big
problem. Otherwise we would have had people complaining already about
group scheduling being broken. It is however a problem for all the
initiatives that built on utilization.
We have to either make the updates trickle through the group hierarchy
for utilization, which is difficult with making a lot of changes to the
current code structure, or introduce a new avg structure at root level
which contains the sum of task utilization for all _tasks_ (not groups)
in the group hierarchy and maintain it separately.
None of those two are particularly pretty. Better suggestions?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration