On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:
From: Michael Turquette mturquette@linaro.org
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.
All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.
Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.
Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?
Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.
Not-signed-off-by: Mike Turquette mturquette@linaro.org
Signed-off-by: Michael Turquette mturquette@linaro.org
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 2 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9386b0..1043266 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c
[...]
@@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
/* track cpus that need to be re-evaluated */
}cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
- /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1);
/*
* FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
* typedef update_cpus into an int and skip all of the cpumask
* stuff
*/
cpumask_set_cpu(cpu_of(rq), &update_cpus);
You are only affecting one cpu in {en,de}queue_task_fair() so update_cpus should never have more than one cpu set. The for_each_sched_entity() loops are just to ensure that all the parent sched groups' sched_entities are enqueued on their parents rqs. They all belong to one cpu and that is cpu_of(rq).
I was about to says that you don't need more than just an int, but if you want to hook into rebalance_domains() or somewhere in that path where we move tasks between cpus you would probably want to update both source and destination cpus. It gets more complicated in nohz_idle_balance() where you may move tasks between multiple sources and destinations in one go. So having a mask might not be a bad idea after all.
Since you only periodically check the cpus in the update_cpus mask you may have had tasks {en,de}queue on multiple cpus in the meantime. So you need the cpumask anyway.
If you keep the cpumask here in {en,de}queue_task_fair() you can get away with just a single cpumask_set_cpu(cpu_of(rq)).
If you want to experiment with hooking into load_balance() which would cover periodic, idle, and nohz_idle balancing I think it could worth a try setting env->src_cpu and env->dst_cpu in the update_cpus mask just after attach_tasks() in load_balance(). You get there every time tasks have been migrated due to 'normal' load-balancing.
You can do something very similar after attach_one_task() in active_load_balance_cpu_stop() to get active balance migrations too (not 'normal').
I guess that should cover all changes in utilization due to task migrations. The last bit missing is changes to utilization caused by changing task behaviour. For example, a small task suddenly turning heavy. Since it is now runnable all the time it won't be dequeued and load_balance() may not find a reason to migrate it. If we don't check the utilization of busy cpus every now and again the frequency we picked a while ago may no longer match the current utilization.
A naive solution could be to mark a cpu in update_cpus every time it receives a tick. At the end of task_tick_fair() for example. But then it isn't really event driven anymore. It needs some more thinking.
Morten