On Fri, Mar 13, 2015 at 10:37:53AM +0000, Juri Lelli wrote:
On 11/03/15 17:00, Morten Rasmussen wrote:
On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:
From: Michael Turquette mturquette@linaro.org
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.
All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.
Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.
Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?
Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.
Not-signed-off-by: Mike Turquette mturquette@linaro.org
Signed-off-by: Michael Turquette mturquette@linaro.org
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 2 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9386b0..1043266 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c
[...]
@@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
/* track cpus that need to be re-evaluated */
cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
}
/* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1);
/*
* FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
* typedef update_cpus into an int and skip all of the cpumask
* stuff
*/
cpumask_set_cpu(cpu_of(rq), &update_cpus);
You are only affecting one cpu in {en,de}queue_task_fair() so update_cpus should never have more than one cpu set. The for_each_sched_entity() loops are just to ensure that all the parent sched groups' sched_entities are enqueued on their parents rqs. They all belong to one cpu and that is cpu_of(rq).
I was about to says that you don't need more than just an int, but if you want to hook into rebalance_domains() or somewhere in that path where we move tasks between cpus you would probably want to update both source and destination cpus. It gets more complicated in nohz_idle_balance() where you may move tasks between multiple sources and destinations in one go. So having a mask might not be a bad idea after all.
Since you only periodically check the cpus in the update_cpus mask you may have had tasks {en,de}queue on multiple cpus in the meantime. So you need the cpumask anyway.
Right. We need a way to keep track of "pending requests".
If you keep the cpumask here in {en,de}queue_task_fair() you can get away with just a single cpumask_set_cpu(cpu_of(rq)).
If you want to experiment with hooking into load_balance() which would cover periodic, idle, and nohz_idle balancing I think it could worth a try setting env->src_cpu and env->dst_cpu in the update_cpus mask just after attach_tasks() in load_balance(). You get there every time tasks have been migrated due to 'normal' load-balancing.
You can do something very similar after attach_one_task() in active_load_balance_cpu_stop() to get active balance migrations too (not 'normal').
{attach,detach}_tasks() (and attach_one_task()) end up calling {activate,deactivate}_task, so we have {enqueue,dequeue}_task for source and destination runqueues. So, even in these cases it seems fine to just evaluate the new situation from {en,de}queue_task_fair() (or even the core calling sites if we want to be more general).
Am I missing something?
No, it is me not doing my homework well enough. You are right.
You would however potentially get multiple enqueue/dequeue calls to the same cpus if you are doing a load_balance() but I guess the overhead would be small.
I guess that should cover all changes in utilization due to task migrations. The last bit missing is changes to utilization caused by changing task behaviour. For example, a small task suddenly turning heavy. Since it is now runnable all the time it won't be dequeued and load_balance() may not find a reason to migrate it. If we don't check the utilization of busy cpus every now and again the frequency we picked a while ago may no longer match the current utilization.
A naive solution could be to mark a cpu in update_cpus every time it receives a tick. At the end of task_tick_fair() for example. But then it isn't really event driven anymore. It needs some more thinking.
Right, this bit is missing. Well, the event would be a task crossing the up threshold at the current cap. We kick the thing only if this happened. But yeah, maybe something more clever is possible.
To complete the picture we also need to sort out the problems with blocked load and nohz that you encountered so we don't keep the frequency up unnecessarily.