Hi Mike,
I apologize in advance for the long email, but I'd still want to share with you today's thoughts :).
On 28/04/15 05:02, Michael Turquette wrote:
Quoting Juri Lelli (2015-04-27 10:09:50)
[snip]
wake_up_process(gd->task);
So, we always wake up the kthread, even when we know that we won't need a freq change. This might be, I fear, an almost certain source of reasonable complain and pushback. I understand that we might not want to start optimizing things, but IMHO this point deserves some more thought before posting. Don't you think we could do some level of aggregation before kicking the kthread? In task_tick_fair(), for example, we could just check if we are beyond the 25% threshold and kick the kthread only in that case.
This patch does not check against a threshold. It always requests a rate based on the current utilization plus 25%.
On systems with discretized cpu frequencies (opps) we will often target the same opp, occasionally crossing the boundary into another opp. On systems with continuous cpu frequencies we will continually give ourselves "room to grow".
Can you make an example of such systems?
So we can't easily check if the cpu frequency needs to change or not in the scheduler hot path using this method.
You mean because in this case we don't have any reference to base such a threshold on?
An alternative is to put the throttle check in the hot path and not kick the thread until we are unthrottled. I need to think on how to do this. I'd like to do it without locking, but mixing 64-bit ktime_t with 32-bit atomit_t is hard. Any ideas?
Don't we already bail out in gov_cfs_update_cpu() if we are not yet past the throttling threshold? This is in the hot path.
Anyway, I played a little bit with this version today and I came up with the following patches. The idea is to reduce triggering points, so that we - in theory - reduce the overall overhead of this thing. I ran simple synthetic workloads to test this, mainly task with phases and periodic workloads. I attach some plots to which I refer below, time on x-axis and freqs on y-axis.
With the first patch I tried to reduce the number of times we kick the kthread from task_tick_fair(). The idea is to extend the governor API so that we can ask for any capacity required (instead of letting it read the usage signal).
Fig1 shows a light/heavy/light task with the current implementation. As you pointed out in the rump up phase we slowly adapt to the new utilization (each step requires also kicking the kthread). Fig2 shows the up threshold approach where we kick the kthread and go to max only when needed. Patch follows.
From 9f3d102e3f88d4e1d60c0d9497de709146e7f2ce Mon Sep 17 00:00:00 2001
From: Juri Lelli juri.lelli@arm.com Date: Tue, 28 Apr 2015 14:10:57 +0100 Subject: [PATCH 1/4] sched/cpufreq_sched_cfs: implement direct API
Instead of using get_cpu_usage() we can let each CPU request the capacity it needs. The gov's kthread is responsible for aggregating requests.
A benefit of this new API is shown in task_tick_fair(), where we can request a transition to max opp only when really needed.
Signed-off-by: Juri Lelli juri.lelli@arm.com --- kernel/sched/cpufreq_sched_cfs.c | 18 +++++++++++------- kernel/sched/fair.c | 26 +++++++++++++++++++++----- kernel/sched/sched.h | 5 +++-- 3 files changed, 35 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/cpufreq_sched_cfs.c b/kernel/sched/cpufreq_sched_cfs.c index 040469d..c8c6d2e 100644 --- a/kernel/sched/cpufreq_sched_cfs.c +++ b/kernel/sched/cpufreq_sched_cfs.c @@ -14,9 +14,10 @@
#include "sched.h"
-#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */ #define THROTTLE_NSEC 50000000 /* 50ms default */
+static DEFINE_PER_CPU(unsigned long, new_capacity); + /** * gov_data - per-policy data internal to the governor * @throttle: next throttling period expiry. Derived from throttle_nsec @@ -85,7 +86,7 @@ static unsigned long gov_cfs_select_freq(struct cpufreq_policy *policy) * lockless behavior. */ for_each_cpu(cpu, policy->cpus) { - usage = get_cpu_usage(cpu); + usage = per_cpu(new_capacity, cpu); if (usage > max_usage) max_usage = usage; } @@ -93,15 +94,13 @@ static unsigned long gov_cfs_select_freq(struct cpufreq_policy *policy) /* add margin to max_usage based on imbalance_pct */ max_usage = max_usage * MARGIN_PCT / 100;
- cpu = cpumask_first(policy->cpus); - - if (max_usage >= capacity_orig_of(cpu)) { + if (max_usage >= SCHED_LOAD_SCALE) { freq = policy->max; goto out; }
/* freq is current utilization + 25% */ - freq = max_usage * policy->max / capacity_orig_of(cpu); + freq = (max_usage * policy->max) >> SCHED_LOAD_SHIFT;
out: return freq; @@ -201,7 +200,7 @@ static void gov_cfs_irq_work(struct irq_work *irq_work) * gov_cfs_update_cpu raises an IPI. The irq_work handler for that IPI wakes up * the thread that does the actual work, gov_cfs_thread. */ -void gov_cfs_update_cpu(int cpu) +void gov_cfs_update_cpu(int cpu, unsigned long capacity) { struct cpufreq_policy *policy; struct gov_data *gd; @@ -223,6 +222,7 @@ void gov_cfs_update_cpu(int cpu) goto out; }
+ per_cpu(new_capacity, cpu) = capacity; irq_work_queue_on(&gd->irq_work, cpu);
out: @@ -233,6 +233,7 @@ out: static void gov_cfs_start(struct cpufreq_policy *policy) { struct gov_data *gd; + int cpu;
/* prepare per-policy private data */ gd = kzalloc(sizeof(*gd), GFP_KERNEL); @@ -251,6 +252,9 @@ static void gov_cfs_start(struct cpufreq_policy *policy) pr_debug("%s: throttle threshold = %u [ns]\n", __func__, gd->throttle_nsec);
+ for_each_cpu(cpu, policy->related_cpus) + per_cpu(new_capacity, cpu) = 0; + /* init per-policy kthread */ gd->task = kthread_run(gov_cfs_thread, policy, "kgov_cfs_task"); if (IS_ERR_OR_NULL(gd->task)) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 041538e..27e21a1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4267,7 +4267,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) }
if(sched_energy_freq()) - gov_cfs_update_cpu(cpu_of(rq)); + gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg);
hrtick_update(rq); } @@ -4332,7 +4332,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) }
if(sched_energy_freq()) - gov_cfs_update_cpu(cpu_of(rq)); + gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg);
hrtick_update(rq); } @@ -4800,6 +4800,12 @@ next: done: return target; } + +unsigned long capacity_curr_of(int cpu) +{ + return arch_scale_freq_capacity(NULL, cpu); +} + /* * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS * tasks. The unit of the return value must be the one of capacity so we can @@ -4817,7 +4823,7 @@ done: * Without capping the usage, a group could be seen as overloaded (CPU0 usage * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity */ -int get_cpu_usage(int cpu) +static int get_cpu_usage(int cpu) { unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; unsigned long capacity = capacity_orig_of(cpu); @@ -7820,6 +7826,11 @@ static void rq_offline_fair(struct rq *rq)
#endif /* CONFIG_SMP */
+static inline unsigned long task_utilization(struct task_struct *p) +{ + return p->se.avg.utilization_avg_contrib; +} + /* * scheduler tick hitting a task of our scheduling class: */ @@ -7827,6 +7838,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; + int cpu = task_cpu(curr);
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -7838,8 +7850,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
update_rq_runnable_avg(rq, 1);
- if(sched_energy_freq()) - gov_cfs_update_cpu(cpu_of(rq)); + if (sched_energy_freq() && + (capacity_curr_of(cpu) < SCHED_LOAD_SCALE) && + ((capacity_curr_of(cpu) * 100) < + (task_utilization(curr) * MARGIN_PCT))) { + gov_cfs_update_cpu(cpu_of(rq), SCHED_LOAD_SCALE); + } }
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ec23523..3983bd6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1396,11 +1396,12 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) } #endif
-int get_cpu_usage(int cpu); unsigned long capacity_orig_of(int cpu); +unsigned long capacity_curr_of(int cpu);
#ifdef CONFIG_CPU_FREQ_GOV_SCHED_CFS -void gov_cfs_update_cpu(int cpu); +#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */ +void gov_cfs_update_cpu(int cpu, unsigned long capacity); #else static inline void gov_cfs_update_cpu(int cpu) {} #endif