On 04/01/2016 12:49 PM, Peter Zijlstra wrote:
On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
When task is migrated from CPU_A to CPU_B, scheduler will decrease the task's load/util from the task's cfs_rq and also add them into migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is migrated to CPU_B, then CPU_A still have task's stale value for load/util; on the other hand CPU_B also cannot reflect new load/util which introduced by the task.
So this patch is to operate the task's load/util to cpu's cfs_rq, so finally cpu's cfs_rq can really reflect task's migration.
Sorry but that is unintelligible.
What is the problem? Why do we care? How did you fix it? and at what cost?
I think I follow - Leo please correct me if I mangle your intentions. It's an issue that Morten and Dietmar had mentioned to me as well.
Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a task group other than the root. The task migrates from one CPU to another. The cfs_rq.avg structures on the src and dest CPUs corresponding to the group containing the task are updated immediately via remove_entity_load_avg()/update_cfs_rq_load_avg() and attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root on the src and dest are not immediately updated. The root cfs_rq.avg must catch up over time with PELT.
As to why we care, there's at least one issue which may or may not be Leo's - the root cfs_rq is the one whose avg structure we read from to determine the frequency with schedutil. If you have a cpu-bound task in a non-root cgroup which periodically migrates among CPUs on an otherwise idle system, I believe each time it migrates the frequency would drop to fmin and have to ramp up again with PELT.
Leo I noticed you did not modify detach_entity_load_average(). I think this would be needed to avoid the task's stats being double counted for a while after switched_from_fair() or task_move_group_fair().