Quoting Juri Lelli (2015-03-13 03:37:53)
On 11/03/15 17:00, Morten Rasmussen wrote:
On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:
From: Michael Turquette mturquette@linaro.org
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.
All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.
Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.
Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?
Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.
Not-signed-off-by: Mike Turquette mturquette@linaro.org
Signed-off-by: Michael Turquette mturquette@linaro.org
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 2 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9386b0..1043266 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c
[...]
@@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
/* track cpus that need to be re-evaluated */
}cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
- /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1);
/*
* FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
* typedef update_cpus into an int and skip all of the cpumask
* stuff
*/
cpumask_set_cpu(cpu_of(rq), &update_cpus);
You are only affecting one cpu in {en,de}queue_task_fair() so update_cpus should never have more than one cpu set. The for_each_sched_entity() loops are just to ensure that all the parent sched groups' sched_entities are enqueued on their parents rqs. They all belong to one cpu and that is cpu_of(rq).
I was about to says that you don't need more than just an int, but if you want to hook into rebalance_domains() or somewhere in that path where we move tasks between cpus you would probably want to update both source and destination cpus. It gets more complicated in nohz_idle_balance() where you may move tasks between multiple sources and destinations in one go. So having a mask might not be a bad idea after all.
Since you only periodically check the cpus in the update_cpus mask you may have had tasks {en,de}queue on multiple cpus in the meantime. So you need the cpumask anyway.
Right. We need a way to keep track of "pending requests".
Instead of tracking "requests" by tagging cpus that have updated stats in a cpumask we could track "constraints", where a constraint is the return value of get_cpu_usage stored in a per-cpu variable whenever a task is migrated. (or whenever the load contribution for a task change significantly. more on that at the bottom)
In the case of scaling cpu frequency up we can do as Juri says below and kick the logic whenever we cross the up_threshold.
For scaling down we need to know usage (from get_cpu_usage) of every cpu in the frequency domain, or more specifically we need to know the max usage in the freq domain. This is enough to pick the best capacity state based on floor(max_usage).
If you keep the cpumask here in {en,de}queue_task_fair() you can get away with just a single cpumask_set_cpu(cpu_of(rq)).
If you want to experiment with hooking into load_balance() which would cover periodic, idle, and nohz_idle balancing I think it could worth a try setting env->src_cpu and env->dst_cpu in the update_cpus mask just after attach_tasks() in load_balance(). You get there every time tasks have been migrated due to 'normal' load-balancing.
You can do something very similar after attach_one_task() in active_load_balance_cpu_stop() to get active balance migrations too (not 'normal').
{attach,detach}_tasks() (and attach_one_task()) end up calling {activate,deactivate}_task, so we have {enqueue,dequeue}_task for source and destination runqueues. So, even in these cases it seems fine to just evaluate the new situation from {en,de}queue_task_fair() (or even the core calling sites if we want to be more general).
Am I missing something?
I guess that should cover all changes in utilization due to task migrations. The last bit missing is changes to utilization caused by changing task behaviour. For example, a small task suddenly turning heavy. Since it is now runnable all the time it won't be dequeued and load_balance() may not find a reason to migrate it. If we don't check the utilization of busy cpus every now and again the frequency we picked a while ago may no longer match the current utilization.
How do we normally detect a significant change in task behavior? If we have an answer to this question then we have a call site to potentially evaluate cpu frequency.
A naive solution could be to mark a cpu in update_cpus every time it receives a tick. At the end of task_tick_fair() for example. But then it isn't really event driven anymore. It needs some more thinking.
Right, this bit is missing. Well, the event would be a task crossing the up threshold at the current cap. We kick the thing only if this happened. But yeah, maybe something more clever is possible.
If we update constraints in per-cpu vars (as I outlined above) then could we move the code that does that out of {en,de}queue_fair_task and into some place common that solves this problem? I need some help on this one, but I was imagining something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6fce9e0..266f4aa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2785,6 +2785,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq = cfs_rq_of(se); long contrib_delta, utilization_delta; int cpu = cpu_of(rq_of(cfs_rq)); + int cap; u64 now;
/* @@ -2818,6 +2819,10 @@ static inline void update_entity_load_avg(struct sched_entity *se, -utilization_delta); }
+ /* store updated per-cpu usage when update_cfs_rq is true */ + cap = get_cpu_usage(cpu); + atomic_long_set(&per_cpu(usage, cpu), cap); + trace_sched_load_avg_cpu(cpu, cfs_rq); }
The above untested snippet only cares about the CFS usage. In the future we might want to aggregate it with DL and RT usage as well, or we might have a per-class, per-cpu atomic.
Any ideas on this approach?
Further thoughts:
Assuming that the above idea is agreeable, one of the questions I am grappling with is whether we should store the usage in a per-cpu variable or if we should add it as a member to struct rq (in which case it would be a sum of usage from every sched class).
I am currently only looking at the CFS-centric, per-cpu atomic variable method, following the philosophy of Keep It Simple, Stupid. However putting it in struct rq has benefits:
1) locking is already managed for us when we need to update with a new usage value, as _THIS_ cpu is the one that will update its own usage. This means we can avoid costly atomic operations.
2) locking is already managed for us when walking the rq's looking for a max usage in places like load_balance. It is NOT managed when a new task is placed on the rq though, so this would need some synchronization? Or we could ignore the possibility of scaling cpu when a new task is enqueued.
Regards, Mike