Re: [Eas-dev] [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

16 Mar 2015

Quoting Juri Lelli (2015-03-13 03:37:53)
...
On 11/03/15 17:00, Morten Rasmussen wrote:
...
On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:
...
From: Michael Turquette mturquette@linaro.org
{en,de}queue_task_fair are updated to track which cpus will have changed
utilization values as function of task queueing. The affected cpus are
passed on to arch_eval_cpu_freq for further machine-specific processing
based on a selectable policy.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to
kick off the scaling process (via wake_up_process), so as to prevent
re-entering the {en,de}queue code.
All of the call sites in this patch are up for discussion. Does it make
sense to track which cpus have updated statistics in enqueue_fair_task?
I chose this because I wanted to gather statistics for all cpus affected
in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
next version of this patch will focus on the simpler case of not using
scheduler cgroups, which should remove a good chunk of this code,
including the cpumask stuff.
Also discussed at LPC14 is that fact that load_balance is a very
interesting place to do this as frequency can be considered in concert
with task placement. Please put forth any ideas on a sensible way to do
this.
Is run_rebalance_domains a logical place to change cpu frequency? What
other call sites make sense?
Even for platforms that can target a cpu frequency without sleeping
(x86, some ARM platforms with PM microcontrollers) it is currently
necessary to always kick the frequency target work out into a kthread.
This is because of the rw_sem usage in the cpufreq core which might
sleep. Replacing that lock type is probably a good idea.
Not-signed-off-by: Mike Turquette mturquette@linaro.org
Signed-off-by: Michael Turquette mturquette@linaro.org
kernel/sched/fair.c  | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |  2 ++
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d9386b0..1043266 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
[...]
...
@@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
             update_cfs_shares(cfs_rq);
             update_entity_load_avg(se, 1);

       /* track cpus that need to be re-evaluated */


       cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);

}


/* !CONFIG_FAIR_GROUP_SCHED */
if (!se) {
        update_rq_runnable_avg(rq, rq->nr_running);
        add_nr_running(rq, 1);

       /*


        * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to


        * typedef update_cpus into an int and skip all of the cpumask


        * stuff


        */


       cpumask_set_cpu(cpu_of(rq), &update_cpus);



You are only affecting one cpu in {en,de}queue_task_fair() so
update_cpus should never have more than one cpu set. The
for_each_sched_entity() loops are just to ensure that all the parent
sched groups' sched_entities are enqueued on their parents rqs. They all
belong to one cpu and that is cpu_of(rq).
I was about to says that you don't need more than just an int, but if
you want to hook into rebalance_domains() or somewhere in that path
where we move tasks between cpus you would probably want to update both
source and destination cpus. It gets more complicated in
nohz_idle_balance() where you may move tasks between multiple sources
and destinations in one go. So having a mask might not be a bad idea
after all.
Since you only periodically check the cpus in the update_cpus mask you
may have had tasks {en,de}queue on multiple cpus in the meantime. So you
need the cpumask anyway.
Right. We need a way to keep track of "pending requests".
Instead of tracking "requests" by tagging cpus that have updated stats
in a cpumask we could track "constraints", where a constraint is the
return value of get_cpu_usage stored in a per-cpu variable whenever a
task is migrated. (or whenever the load contribution for a task change
significantly. more on that at the bottom)
In the case of scaling cpu frequency up we can do as Juri says below and
kick the logic whenever we cross the up_threshold.
For scaling down we need to know usage (from get_cpu_usage) of every cpu
in the frequency domain, or more specifically we need to know the max
usage in the freq domain. This is enough to pick the best capacity state
based on floor(max_usage).
...
...
If you keep the cpumask here in {en,de}queue_task_fair() you can get
away with just a single cpumask_set_cpu(cpu_of(rq)).
If you want to experiment with hooking into load_balance() which would
cover periodic, idle, and nohz_idle balancing I think it could worth a
try setting env->src_cpu and env->dst_cpu in the update_cpus mask just
after attach_tasks() in load_balance(). You get there every time tasks
have been migrated due to 'normal' load-balancing.
You can do something very similar after attach_one_task() in
active_load_balance_cpu_stop() to get active balance migrations too (not
'normal').
{attach,detach}_tasks() (and attach_one_task()) end up calling
{activate,deactivate}_task, so we have {enqueue,dequeue}_task for
source and destination runqueues. So, even in these cases it seems
fine to just evaluate the new situation from {en,de}queue_task_fair()
(or even the core calling sites if we want to be more general).
Am I missing something?
...
I guess that should cover all changes in utilization due to task
migrations. The last bit missing is changes to utilization caused by
changing task behaviour. For example, a small task suddenly turning
heavy. Since it is now runnable all the time it won't be dequeued and
load_balance() may not find a reason to migrate it. If we don't check
the utilization of busy cpus every now and again the frequency we picked
a while ago may no longer match the current utilization.
How do we normally detect a significant change in task behavior? If we
have an answer to this question then we have a call site to potentially
evaluate cpu frequency.
...
...
A naive solution could be to mark a cpu in update_cpus every time it
receives a tick. At the end of task_tick_fair() for example. But then it
isn't really event driven anymore. It needs some more thinking.
Right, this bit is missing. Well, the event would be a task crossing
the up threshold at the current cap. We kick the thing only if this
happened. But yeah, maybe something more clever is possible.
If we update constraints in per-cpu vars (as I outlined above) then
could we move the code that does that out of {en,de}queue_fair_task and
into some place common that solves this problem? I need some help on
this one, but I was imagining something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fce9e0..266f4aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2785,6 +2785,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
        struct cfs_rq *cfs_rq = cfs_rq_of(se);
        long contrib_delta, utilization_delta;
        int cpu = cpu_of(rq_of(cfs_rq));
+       int cap;
        u64 now;
/*
@@ -2818,6 +2819,10 @@ static inline void update_entity_load_avg(struct sched_entity *se,
                                                        -utilization_delta);
        }
+       /* store updated per-cpu usage when update_cfs_rq is true */
+       cap = get_cpu_usage(cpu);
+       atomic_long_set(&per_cpu(usage, cpu), cap);
+
        trace_sched_load_avg_cpu(cpu, cfs_rq);
 }
The above untested snippet only cares about the CFS usage. In the future
we might want to aggregate it with DL and RT usage as well, or we might
have a per-class, per-cpu atomic.
Any ideas on this approach?
Further thoughts:
Assuming that the above idea is agreeable, one of the questions I am
grappling with is whether we should store the usage in a per-cpu
variable or if we should add it as a member to struct rq (in which case
it would be a sum of usage from every sched class).
I am currently only looking at the CFS-centric, per-cpu atomic variable
method, following the philosophy of Keep It Simple, Stupid.  However
putting it in struct rq has benefits:
1) locking is already managed for us when we need to update with a new
usage value, as _THIS_ cpu is the one that will update its own usage.
This means we can avoid costly atomic operations.
2) locking is already managed for us when walking the rq's looking for a
max usage in places like load_balance. It is NOT managed when a new task
is placed on the rq though, so this would need some synchronization? Or
we could ignore the possibility of scaling cpu when a new task is
enqueued.
Regards,
Mike

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

Signed-off-by: Michael Turquette mturquette@linaro.org