On 29/04/15 09:32, Michael Turquette wrote:
Quoting Juri Lelli (2015-04-28 10:48:27)
Hi Mike,
I apologize in advance for the long email, but I'd still want to share with you today's thoughts :).
On 28/04/15 05:02, Michael Turquette wrote:
Quoting Juri Lelli (2015-04-27 10:09:50)
[snip]
wake_up_process(gd->task);
So, we always wake up the kthread, even when we know that we won't need a freq change. This might be, I fear, an almost certain source of reasonable complain and pushback. I understand that we might not want to start optimizing things, but IMHO this point deserves some more thought before posting. Don't you think we could do some level of aggregation before kicking the kthread? In task_tick_fair(), for example, we could just check if we are beyond the 25% threshold and kick the kthread only in that case.
This patch does not check against a threshold. It always requests a rate based on the current utilization plus 25%.
On systems with discretized cpu frequencies (opps) we will often target the same opp, occasionally crossing the boundary into another opp. On systems with continuous cpu frequencies we will continually give ourselves "room to grow".
Can you make an example of such systems?
CPPC-based systems.
I thought a lot about all of the feedback that my v1 patchset got last week on eas-dev. Two comments in particular colored my views on supporting continuous frequency bands and not relying on a threshold.
First is Ashwins' comment here: https://lists.linaro.org/pipermail/eas-dev/2015-April/000093.html
Second is Morten's reply here: https://lists.linaro.org/pipermail/eas-dev/2015-April/000094.html
If we decide that we only care about opps then it is easy to create a threshold for the opp "bucket" that we are currently in. But in a continuous system creating a threshold is more difficult. E.g. if we have decide to use an 80% threshold for a continuous system, we can easily determine if our current utilization exceeds this threshold at our current capacity/frequency. But what is the new frequency target? Without a table to guide us we have to just make something up!
Right, but I'm still not sure that we still want to continuously adapt to the current usage (plus the margin) as we might introduce too much overhead. Also, is it really worthy when we have to activate all this just to save a little more power or go a little more fast? This is really blue sky, but maybe a trade-off would be to try to discretize such systems (if it makes sense to control them from the scheduler). Yes, we already have an activation threshold, but I'm not sure this is enough.
So I decided to transmute the threshold into a margin. Instead of checking to see if we crossed some boundary we always try to maintain a bit of overhead. This works for table-based and table-less systems, and allows us to hit the minimum and maximum frequencies without any weird corner cases.
Ok, margin is fine. I kept that in my deltas below. I think we also need that to be able to stabilize capacity requests.
So we can't easily check if the cpu frequency needs to change or not in the scheduler hot path using this method.
You mean because in this case we don't have any reference to base such a threshold on?
Correct.
An alternative is to put the throttle check in the hot path and not kick the thread until we are unthrottled. I need to think on how to do this. I'd like to do it without locking, but mixing 64-bit ktime_t with 32-bit atomit_t is hard. Any ideas?
Don't we already bail out in gov_cfs_update_cpu() if we are not yet past the throttling threshold? This is in the hot path.
Yeah, but it is still hidden behind the rwsem. I'd to get rid of that too if I can think of a way.
Agreed, we should move away from that.
Anyway, I played a little bit with this version today and I came up with the following patches. The idea is to reduce triggering points, so that we - in theory - reduce the overall overhead of this thing. I ran simple synthetic workloads to test this, mainly task with phases and periodic workloads. I attach some plots to which I refer below, time on x-axis and freqs on y-axis.
With the first patch I tried to reduce the number of times we kick the kthread from task_tick_fair(). The idea is to extend the governor API so that we can ask for any capacity required (instead of letting it read the usage signal).
Fig1 shows a light/heavy/light task with the current implementation. As you pointed out in the rump up phase we slowly adapt to the new utilization (each step requires also kicking the kthread). Fig2 shows the up threshold approach where we kick the kthread and go to max only when needed. Patch follows.
Thanks for the patches. I'll review them and get back to you but I wanted to respond to your questions above asap.
Thanks!
Best,
- Juri
Regards, Mike
From 9f3d102e3f88d4e1d60c0d9497de709146e7f2ce Mon Sep 17 00:00:00 2001 From: Juri Lelli juri.lelli@arm.com Date: Tue, 28 Apr 2015 14:10:57 +0100 Subject: [PATCH 1/4] sched/cpufreq_sched_cfs: implement direct API
Instead of using get_cpu_usage() we can let each CPU request the capacity it needs. The gov's kthread is responsible for aggregating requests.
A benefit of this new API is shown in task_tick_fair(), where we can request a transition to max opp only when really needed.
Signed-off-by: Juri Lelli juri.lelli@arm.com
kernel/sched/cpufreq_sched_cfs.c | 18 +++++++++++------- kernel/sched/fair.c | 26 +++++++++++++++++++++----- kernel/sched/sched.h | 5 +++-- 3 files changed, 35 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/cpufreq_sched_cfs.c b/kernel/sched/cpufreq_sched_cfs.c index 040469d..c8c6d2e 100644 --- a/kernel/sched/cpufreq_sched_cfs.c +++ b/kernel/sched/cpufreq_sched_cfs.c @@ -14,9 +14,10 @@
#include "sched.h"
-#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */ #define THROTTLE_NSEC 50000000 /* 50ms default */
+static DEFINE_PER_CPU(unsigned long, new_capacity);
/**
- gov_data - per-policy data internal to the governor
- @throttle: next throttling period expiry. Derived from throttle_nsec
@@ -85,7 +86,7 @@ static unsigned long gov_cfs_select_freq(struct cpufreq_policy *policy) * lockless behavior. */ for_each_cpu(cpu, policy->cpus) {
usage = get_cpu_usage(cpu);
usage = per_cpu(new_capacity, cpu); if (usage > max_usage) max_usage = usage; }
@@ -93,15 +94,13 @@ static unsigned long gov_cfs_select_freq(struct cpufreq_policy *policy) /* add margin to max_usage based on imbalance_pct */ max_usage = max_usage * MARGIN_PCT / 100;
cpu = cpumask_first(policy->cpus);
if (max_usage >= capacity_orig_of(cpu)) {
if (max_usage >= SCHED_LOAD_SCALE) { freq = policy->max; goto out; } /* freq is current utilization + 25% */
freq = max_usage * policy->max / capacity_orig_of(cpu);
freq = (max_usage * policy->max) >> SCHED_LOAD_SHIFT;
out: return freq; @@ -201,7 +200,7 @@ static void gov_cfs_irq_work(struct irq_work *irq_work)
- gov_cfs_update_cpu raises an IPI. The irq_work handler for that IPI wakes up
- the thread that does the actual work, gov_cfs_thread.
*/ -void gov_cfs_update_cpu(int cpu) +void gov_cfs_update_cpu(int cpu, unsigned long capacity) { struct cpufreq_policy *policy; struct gov_data *gd; @@ -223,6 +222,7 @@ void gov_cfs_update_cpu(int cpu) goto out; }
per_cpu(new_capacity, cpu) = capacity; irq_work_queue_on(&gd->irq_work, cpu);
out: @@ -233,6 +233,7 @@ out: static void gov_cfs_start(struct cpufreq_policy *policy) { struct gov_data *gd;
int cpu; /* prepare per-policy private data */ gd = kzalloc(sizeof(*gd), GFP_KERNEL);
@@ -251,6 +252,9 @@ static void gov_cfs_start(struct cpufreq_policy *policy) pr_debug("%s: throttle threshold = %u [ns]\n", __func__, gd->throttle_nsec);
for_each_cpu(cpu, policy->related_cpus)
per_cpu(new_capacity, cpu) = 0;
/* init per-policy kthread */ gd->task = kthread_run(gov_cfs_thread, policy, "kgov_cfs_task"); if (IS_ERR_OR_NULL(gd->task))
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 041538e..27e21a1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4267,7 +4267,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) }
if(sched_energy_freq())
gov_cfs_update_cpu(cpu_of(rq));
gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg); hrtick_update(rq);
} @@ -4332,7 +4332,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) }
if(sched_energy_freq())
gov_cfs_update_cpu(cpu_of(rq));
gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg); hrtick_update(rq);
} @@ -4800,6 +4800,12 @@ next: done: return target; }
+unsigned long capacity_curr_of(int cpu) +{
return arch_scale_freq_capacity(NULL, cpu);
+}
/*
- get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
- tasks. The unit of the return value must be the one of capacity so we can
@@ -4817,7 +4823,7 @@ done:
- Without capping the usage, a group could be seen as overloaded (CPU0 usage
- at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
*/ -int get_cpu_usage(int cpu) +static int get_cpu_usage(int cpu) { unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; unsigned long capacity = capacity_orig_of(cpu); @@ -7820,6 +7826,11 @@ static void rq_offline_fair(struct rq *rq)
#endif /* CONFIG_SMP */
+static inline unsigned long task_utilization(struct task_struct *p) +{
return p->se.avg.utilization_avg_contrib;
+}
/*
- scheduler tick hitting a task of our scheduling class:
*/ @@ -7827,6 +7838,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se;
int cpu = task_cpu(curr); for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se);
@@ -7838,8 +7850,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
update_rq_runnable_avg(rq, 1);
if(sched_energy_freq())
gov_cfs_update_cpu(cpu_of(rq));
if (sched_energy_freq() &&
(capacity_curr_of(cpu) < SCHED_LOAD_SCALE) &&
((capacity_curr_of(cpu) * 100) <
(task_utilization(curr) * MARGIN_PCT))) {
gov_cfs_update_cpu(cpu_of(rq), SCHED_LOAD_SCALE);
}
}
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ec23523..3983bd6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1396,11 +1396,12 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) } #endif
-int get_cpu_usage(int cpu); unsigned long capacity_orig_of(int cpu); +unsigned long capacity_curr_of(int cpu);
#ifdef CONFIG_CPU_FREQ_GOV_SCHED_CFS -void gov_cfs_update_cpu(int cpu); +#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */ +void gov_cfs_update_cpu(int cpu, unsigned long capacity); #else static inline void gov_cfs_update_cpu(int cpu) {}
#endif
2.2.2
I then tried to address the tail effect when the task starts behaving as light again (~2sec in the pictures). The problem is that when we dequeue the task we see no utilization on the rq and we go to min. When the task is enqueued back we go to max (at least the first time) and then we continue doing this sort of ping pong until we converge to the actual (light) utilization. The following patch changes this behaviour as in Fig.3: in the tail we slowly adapt to the new task phase considering the decaying effect of the utilization.
From 81cefca25fa022913dc2913acf71414925b997eb Mon Sep 17 00:00:00 2001 From: Juri Lelli juri.lelli@arm.com Date: Tue, 28 Apr 2015 16:08:37 +0100 Subject: [PATCH 2/4] sched/cpufreq_sched_cfs: (re)move triggering points
remove the trigger in enqueue_task_fair and move it in select_task_rq_fair(), also consider the pre-decayed tasks utilization as we want to stabilize capacity
modify dequeue_task_fair() trigger; don't scale down when we are going idle (this change requires a small addition to governor's API)
Signed-off-by: Juri Lelli juri.lelli@arm.com
kernel/sched/cpufreq_sched_cfs.c | 5 +++++ kernel/sched/fair.c | 32 ++++++++++++++++++++++---------- kernel/sched/sched.h | 1 + 3 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/cpufreq_sched_cfs.c b/kernel/sched/cpufreq_sched_cfs.c index c8c6d2e..2fe1684 100644 --- a/kernel/sched/cpufreq_sched_cfs.c +++ b/kernel/sched/cpufreq_sched_cfs.c @@ -187,6 +187,11 @@ static void gov_cfs_irq_work(struct irq_work *irq_work) wake_up_process(gd->task); }
+void gov_cfs_reset_cpu(int cpu) +{
per_cpu(new_capacity, cpu) = 0;
+}
/**
- gov_cfs_update_cpu - interface to scheduler for changing capacity values
- @cpu: cpu whose capacity utilization has recently changed
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 27e21a1..4e21abf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4266,9 +4266,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) add_nr_running(rq, 1); }
if(sched_energy_freq())
gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg);
hrtick_update(rq);
}
@@ -4331,8 +4328,18 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_rq_runnable_avg(rq, 1); }
if(sched_energy_freq())
gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg);
if(sched_energy_freq()) {
/*
* Ask for an update only if we are not going idle.
* If we are going idle we just need to clear our
* current request.
*/
if (rq->cfs.nr_running)
gov_cfs_update_cpu(cpu_of(rq),
rq->cfs.utilization_load_avg);
else
gov_cfs_reset_cpu(cpu_of(rq));
} hrtick_update(rq);
} @@ -4834,6 +4841,11 @@ static int get_cpu_usage(int cpu) return (usage * capacity) >> SCHED_LOAD_SHIFT; }
+static inline unsigned long task_utilization(struct task_struct *p) +{
return p->se.avg.utilization_avg_contrib;
+}
/*
- select_task_rq_fair: Select target runqueue for the waking task in domains
- that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -4922,6 +4934,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f unlock: rcu_read_unlock();
/* We want to consider the pre-decayed utilization */
if(sched_energy_freq())
gov_cfs_update_cpu(new_cpu,
cpu_rq(new_cpu)->cfs.utilization_load_avg +
task_utilization(p)); return new_cpu;
}
@@ -7826,11 +7843,6 @@ static void rq_offline_fair(struct rq *rq)
#endif /* CONFIG_SMP */
-static inline unsigned long task_utilization(struct task_struct *p) -{
return p->se.avg.utilization_avg_contrib;
-}
/*
- scheduler tick hitting a task of our scheduling class:
*/ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3983bd6..6dd8f3a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1402,6 +1402,7 @@ unsigned long capacity_curr_of(int cpu); #ifdef CONFIG_CPU_FREQ_GOV_SCHED_CFS #define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */ void gov_cfs_update_cpu(int cpu, unsigned long capacity); +void gov_cfs_reset_cpu(int cpu); #else static inline void gov_cfs_update_cpu(int cpu) {}
#endif
2.2.2
This approach seems to work also for a light/medium/light task. We go to max and then adapt the the real (medium) utilization (Fig.4).
Finally a couple of patches more (first one should actually be squashed in 01/04) to cover load_balancing (not really tested already).
From 3e7226989c21fdd680279f4f8a150597b5833b95 Mon Sep 17 00:00:00 2001 From: Juri Lelli juri.lelli@arm.com Date: Tue, 28 Apr 2015 16:10:00 +0100 Subject: [PATCH 3/4] sched/cpufreq_sched_cfs: update requested capacity even when throttled
If the kthread is throttled we still need to update requests, or we may end-up with stale values.
Signed-off-by: Juri Lelli juri.lelli@arm.com
kernel/sched/cpufreq_sched_cfs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/cpufreq_sched_cfs.c b/kernel/sched/cpufreq_sched_cfs.c index 2fe1684..c8d9408 100644 --- a/kernel/sched/cpufreq_sched_cfs.c +++ b/kernel/sched/cpufreq_sched_cfs.c @@ -222,12 +222,13 @@ void gov_cfs_update_cpu(int cpu, unsigned long capacity)
gd = policy->governor_data;
per_cpu(new_capacity, cpu) = capacity;
/* bail early if we are throttled */ if (ktime_before(ktime_get(), gd->throttle)) { goto out; }
per_cpu(new_capacity, cpu) = capacity; irq_work_queue_on(&gd->irq_work, cpu);
out:
2.2.2
From 333e1741c7de8dfc21f5bb9f2a9c29d4dc84f2de Mon Sep 17 00:00:00 2001 From: Juri Lelli juri.lelli@arm.com Date: Tue, 28 Apr 2015 16:55:22 +0100 Subject: [PATCH 4/4] sched/fair: cpufreq_sched_cfs triggers for load_balancing
This should cover load_balance paths (untested).
Signed-off-by: Juri Lelli juri.lelli@arm.com
kernel/sched/fair.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4e21abf..ad1e7cc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7203,6 +7203,14 @@ out_one_pinned:
ld_moved = 0;
out:
/* dst_grpmask might be NULL for NEWLY_IDLE. */
if (sched_energy_freq() && ld_moved && env.dst_grpmask)
/*
* dequeue_task_fair() already took care of src_cpu
*/
gov_cfs_update_cpu(env.dst_cpu,
cpu_rq(env.dst_cpu)->cfs.utilization_load_avg);
return ld_moved;
}
@@ -7402,8 +7410,12 @@ out_unlock: busiest_rq->active_balance = 0; raw_spin_unlock(&busiest_rq->lock);
if (p)
if (p) { attach_one_task(target_rq, p);
if (sched_energy_freq())
gov_cfs_update_cpu(cpu_of(target_rq),
target_rq->cfs.utilization_load_avg);
} local_irq_enable();
-- 2.2.2
Comments?
Thanks,
- Juri