On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
I think I follow - Leo please correct me if I mangle your intentions. It's an issue that Morten and Dietmar had mentioned to me as well.
Yes. We have been working on this issue for a while without getting to a nice solution yet.
Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a task group other than the root. The task migrates from one CPU to another. The cfs_rq.avg structures on the src and dest CPUs corresponding to the group containing the task are updated immediately via remove_entity_load_avg()/update_cfs_rq_load_avg() and attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root on the src and dest are not immediately updated. The root cfs_rq.avg must catch up over time with PELT.
Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where the is enqueued/dequeued that gets immediately updated. If the cfs_rq is a group cfs_rq its group entity se.avg doesn't get updated immediately. It has to adapt over time at the pace defined by the geometric series. The impact of a task migration therefore doesn't trickle through to the root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang introduced with his rewrite of PELT.
As to why we care, there's at least one issue which may or may not be Leo's - the root cfs_rq is the one whose avg structure we read from to determine the frequency with schedutil. If you have a cpu-bound task in a non-root cgroup which periodically migrates among CPUs on an otherwise idle system, I believe each time it migrates the frequency would drop to fmin and have to ramp up again with PELT.
It makes any scheduling decision based on utilization difficult if fair group scheduling is used as cpu_util() doesn't give an up-to-date picture of any utilization caused by task in task groups.
For the energy-aware scheduling patches and patches we have in the pipeline for improving capacity awareness in the scheduler we rely on cpu_util().
Steve, thanks for explanation and totally agree. My initial purpose is not from schedutil's perspective, but just did some basic analysis for utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then launch a periodic task (period: 100ms, duty cycle: 40%) and limit its affinity to only two CPUs. So observed the same result like you said.
After applied this patch, I can get much better result for the CPU's utilization after task's migration. Please see the parsed result for CPU's utilization: http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.... http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg....
Leo I noticed you did not modify detach_entity_load_average(). I think this would be needed to avoid the task's stats being double counted for a while after switched_from_fair() or task_move_group_fair().
I'm afraid that the solution to problem is more complicated than that :-(
You are adding/removing a contribution from the root cfs_rq.avg which isn't part of the signal in the first place. The root cfs_rq.avg only contains the sum of the load/util of the sched_entities on the cfs_rq. If you remove the contribution of the tasks from there you may end up double-accounting for the task migration. Once due to you patch and then again slowly over time as the group sched_entity starts reflecting that the task has migrated. Furthermore, for group scheduling to make sense it has to be the task_h_load() you add/remove otherwise the group weighting is completely lost. Or am I completely misreading your patch?
I don't think the slow response time for _load_ is necessarily a big problem. Otherwise we would have had people complaining already about group scheduling being broken. It is however a problem for all the initiatives that built on utilization.
We have to either make the updates trickle through the group hierarchy for utilization, which is difficult with making a lot of changes to the current code structure, or introduce a new avg structure at root level which contains the sum of task utilization for all _tasks_ (not groups) in the group hierarchy and maintain it separately.
None of those two are particularly pretty. Better suggestions?