Re: [Eas-dev] [RFC internal v2 4/4] sched: cpufreq_sched_cfs: PELT-based cpu frequency scaling

28 Apr 2015

Hi Mike,
I apologize in advance for the long email, but I'd still want to share
with you today's thoughts :).
On 28/04/15 05:02, Michael Turquette wrote:
...
Quoting Juri Lelli (2015-04-27 10:09:50)
[snip]
...
...
...


  wake_up_process(gd->task);



So, we always wake up the kthread, even when we know that we won't
need a freq change. This might be, I fear, an almost certain source of
reasonable complain and pushback. I understand that we might not want
to start optimizing things, but IMHO this point deserves some more
thought before posting. Don't you think we could do some level of
aggregation before kicking the kthread? In task_tick_fair(), for
example, we could just check if we are beyond the 25% threshold and kick
the kthread only in that case.
This patch does not check against a threshold. It always requests a rate
based on the current utilization plus 25%.
On systems with discretized cpu frequencies (opps) we will often target
the same opp, occasionally crossing the boundary into another opp. On
systems with continuous cpu frequencies we will continually give
ourselves "room to grow".
Can you make an example of such systems?
...
So we can't easily check if the cpu frequency needs to change or not in
the scheduler hot path using this method.
You mean because in this case we don't have any reference to base such
a threshold on?
...
An alternative is to put the throttle check in the hot path and not kick
the thread until we are unthrottled. I need to think on how to do this.
I'd like to do it without locking, but mixing 64-bit ktime_t with 32-bit
atomit_t is hard. Any ideas?
Don't we already bail out in gov_cfs_update_cpu() if we are not yet
past the throttling threshold? This is in the hot path.
Anyway, I played a little bit with this version today and I came up with
the following patches. The idea is to reduce triggering points, so that
we - in theory - reduce the overall overhead of this thing. I ran
simple synthetic workloads to test this, mainly task with phases and
periodic workloads. I attach some plots to which I refer below, time on
x-axis and freqs on y-axis.
With the first patch I tried to reduce the number of times we kick the
kthread from task_tick_fair(). The idea is to extend the governor API
so that we can ask for any capacity required (instead of letting it
read the usage signal).
Fig1 shows a light/heavy/light task with the current implementation.
As you pointed out in the rump up phase we slowly adapt to the new
utilization (each step requires also kicking the kthread). Fig2 shows
the up threshold approach where we kick the kthread and go to max only
when needed. Patch follows.
...
From 9f3d102e3f88d4e1d60c0d9497de709146e7f2ce Mon Sep 17 00:00:00 2001
From: Juri Lelli juri.lelli@arm.com
Date: Tue, 28 Apr 2015 14:10:57 +0100
Subject: [PATCH 1/4] sched/cpufreq_sched_cfs: implement direct API
Instead of using get_cpu_usage() we can let each CPU
request the capacity it needs. The gov's kthread is
responsible for aggregating requests.
A benefit of this new API is shown in task_tick_fair(),
where we can request a transition to max opp only when
really needed.
Signed-off-by: Juri Lelli juri.lelli@arm.com
---
 kernel/sched/cpufreq_sched_cfs.c | 18 +++++++++++-------
 kernel/sched/fair.c              | 26 +++++++++++++++++++++-----
 kernel/sched/sched.h             |  5 +++--
 3 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/cpufreq_sched_cfs.c b/kernel/sched/cpufreq_sched_cfs.c
index 040469d..c8c6d2e 100644
--- a/kernel/sched/cpufreq_sched_cfs.c
+++ b/kernel/sched/cpufreq_sched_cfs.c
@@ -14,9 +14,10 @@
#include "sched.h"
-#define MARGIN_PCT		125 /* taken from imbalance_pct = 125 */
 #define THROTTLE_NSEC		50000000 /* 50ms default */
+static DEFINE_PER_CPU(unsigned long, new_capacity);
+
 /**
  * gov_data - per-policy data internal to the governor
  * @throttle: next throttling period expiry. Derived from throttle_nsec
@@ -85,7 +86,7 @@ static unsigned long gov_cfs_select_freq(struct cpufreq_policy *policy)
     * lockless behavior.
     */
    for_each_cpu(cpu, policy->cpus) {
-		usage = get_cpu_usage(cpu);
+		usage = per_cpu(new_capacity, cpu);
    	if (usage > max_usage)
    		max_usage = usage;
    }
@@ -93,15 +94,13 @@ static unsigned long gov_cfs_select_freq(struct cpufreq_policy *policy)
    /* add margin to max_usage based on imbalance_pct */
    max_usage = max_usage * MARGIN_PCT / 100;
-	cpu = cpumask_first(policy->cpus);
-
-	if (max_usage >= capacity_orig_of(cpu)) {
+	if (max_usage >= SCHED_LOAD_SCALE) {
    	freq = policy->max;
    	goto out;
    }
/* freq is current utilization + 25% */
-	freq = max_usage * policy->max / capacity_orig_of(cpu);
+	freq = (max_usage * policy->max) >> SCHED_LOAD_SHIFT;
out:
    return freq;
@@ -201,7 +200,7 @@ static void gov_cfs_irq_work(struct irq_work *irq_work)
  * gov_cfs_update_cpu raises an IPI. The irq_work handler for that IPI wakes up
  * the thread that does the actual work, gov_cfs_thread.
  */
-void gov_cfs_update_cpu(int cpu)
+void gov_cfs_update_cpu(int cpu, unsigned long capacity)
 {
    struct cpufreq_policy *policy;
    struct gov_data *gd;
@@ -223,6 +222,7 @@ void gov_cfs_update_cpu(int cpu)
    	goto out;
    }
+	per_cpu(new_capacity, cpu) = capacity;
    irq_work_queue_on(&gd->irq_work, cpu);
out:
@@ -233,6 +233,7 @@ out:
 static void gov_cfs_start(struct cpufreq_policy *policy)
 {
    struct gov_data *gd;
+	int cpu;
/* prepare per-policy private data */
    gd = kzalloc(sizeof(*gd), GFP_KERNEL);
@@ -251,6 +252,9 @@ static void gov_cfs_start(struct cpufreq_policy *policy)
    pr_debug("%s: throttle threshold = %u [ns]\n",
    	  __func__, gd->throttle_nsec);
+	for_each_cpu(cpu, policy->related_cpus)
+		per_cpu(new_capacity, cpu) = 0;
+
    /* init per-policy kthread */
    gd->task = kthread_run(gov_cfs_thread, policy, "kgov_cfs_task");
    if (IS_ERR_OR_NULL(gd->task))
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 041538e..27e21a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4267,7 +4267,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
    }
if(sched_energy_freq())
-		gov_cfs_update_cpu(cpu_of(rq));
+		gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg);
hrtick_update(rq);
 }
@@ -4332,7 +4332,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
    }
if(sched_energy_freq())
-		gov_cfs_update_cpu(cpu_of(rq));
+		gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg);
hrtick_update(rq);
 }
@@ -4800,6 +4800,12 @@ next:
 done:
    return target;
 }
+
+unsigned long capacity_curr_of(int cpu)
+{
+	return arch_scale_freq_capacity(NULL, cpu);
+}
+
 /*
  * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
  * tasks. The unit of the return value must be the one of capacity so we can
@@ -4817,7 +4823,7 @@ done:
  * Without capping the usage, a group could be seen as overloaded (CPU0 usage
  * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
  */
-int get_cpu_usage(int cpu)
+static int get_cpu_usage(int cpu)
 {
    unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
    unsigned long capacity = capacity_orig_of(cpu);
@@ -7820,6 +7826,11 @@ static void rq_offline_fair(struct rq *rq)
#endif /* CONFIG_SMP */
+static inline unsigned long task_utilization(struct task_struct *p)
+{
+	return p->se.avg.utilization_avg_contrib;
+}
+
 /*
  * scheduler tick hitting a task of our scheduling class:
  */
@@ -7827,6 +7838,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;
+	int cpu = task_cpu(curr);
for_each_sched_entity(se) {
    	cfs_rq = cfs_rq_of(se);
@@ -7838,8 +7850,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
update_rq_runnable_avg(rq, 1);
-	if(sched_energy_freq())
-		gov_cfs_update_cpu(cpu_of(rq));
+	if (sched_energy_freq() &&
+	    (capacity_curr_of(cpu) < SCHED_LOAD_SCALE) &&
+	    ((capacity_curr_of(cpu) * 100) <
+	     (task_utilization(curr) * MARGIN_PCT))) {
+		gov_cfs_update_cpu(cpu_of(rq), SCHED_LOAD_SCALE);
+	}
 }
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec23523..3983bd6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1396,11 +1396,12 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 }
 #endif
-int get_cpu_usage(int cpu);
 unsigned long capacity_orig_of(int cpu);
+unsigned long capacity_curr_of(int cpu);
#ifdef CONFIG_CPU_FREQ_GOV_SCHED_CFS
-void gov_cfs_update_cpu(int cpu);
+#define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */
+void gov_cfs_update_cpu(int cpu, unsigned long capacity);
 #else
 static inline void gov_cfs_update_cpu(int cpu) {}
 #endif
-- 
2.2.2

I then tried to address the tail effect when the task starts behaving as
light again (~2sec in the pictures). The problem is that when we dequeue
the task we see no utilization on the rq and we go to min. When the task
is enqueued back we go to max (at least the first time) and then we
continue doing this sort of ping pong until we converge to the actual
(light) utilization. The following patch changes this behaviour as in
Fig.3: in the tail we slowly adapt to the new task phase considering
the decaying effect of the utilization.

>From 81cefca25fa022913dc2913acf71414925b997eb Mon Sep 17 00:00:00 2001
From: Juri Lelli juri.lelli@arm.com
Date: Tue, 28 Apr 2015 16:08:37 +0100
Subject: [PATCH 2/4] sched/cpufreq_sched_cfs: (re)move triggering points

 - remove the trigger in enqueue_task_fair and move it in
   select_task_rq_fair(), also consider the pre-decayed tasks
   utilization as we want to stabilize capacity

 - modify dequeue_task_fair() trigger; don't scale down when
   we are going idle (this change requires a small addition
   to governor's API)

Signed-off-by: Juri Lelli juri.lelli@arm.com
---
 kernel/sched/cpufreq_sched_cfs.c |  5 +++++
 kernel/sched/fair.c              | 32 ++++++++++++++++++++++----------
 kernel/sched/sched.h             |  1 +
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/cpufreq_sched_cfs.c b/kernel/sched/cpufreq_sched_cfs.c
index c8c6d2e..2fe1684 100644
--- a/kernel/sched/cpufreq_sched_cfs.c
+++ b/kernel/sched/cpufreq_sched_cfs.c
@@ -187,6 +187,11 @@ static void gov_cfs_irq_work(struct irq_work *irq_work)
    wake_up_process(gd->task);
 }

+void gov_cfs_reset_cpu(int cpu)
+{
+	per_cpu(new_capacity, cpu) = 0;
+}
+
 /**
  * gov_cfs_update_cpu - interface to scheduler for changing capacity values
  * @cpu: cpu whose capacity utilization has recently changed
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 27e21a1..4e21abf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4266,9 +4266,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
    	add_nr_running(rq, 1);
    }

-	if(sched_energy_freq())
-		gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg);
-
    hrtick_update(rq);
 }

@@ -4331,8 +4328,18 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
    	update_rq_runnable_avg(rq, 1);
    }

-	if(sched_energy_freq())
-		gov_cfs_update_cpu(cpu_of(rq), rq->cfs.utilization_load_avg);
+	if(sched_energy_freq()) {
+		/*
+		 * Ask for an update only if we are not going idle.
+		 * If we are going idle we just need to clear our
+		 * current request.
+		 */
+		if (rq->cfs.nr_running)
+			gov_cfs_update_cpu(cpu_of(rq),
+				rq->cfs.utilization_load_avg);
+		else
+			gov_cfs_reset_cpu(cpu_of(rq));
+	}

    hrtick_update(rq);
 }
@@ -4834,6 +4841,11 @@ static int get_cpu_usage(int cpu)
    return (usage * capacity) >> SCHED_LOAD_SHIFT;
 }

+static inline unsigned long task_utilization(struct task_struct *p)
+{
+	return p->se.avg.utilization_avg_contrib;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -4922,6 +4934,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 unlock:
    rcu_read_unlock();

+	/* We want to consider the pre-decayed utilization */
+	if(sched_energy_freq())
+		gov_cfs_update_cpu(new_cpu,
+				   cpu_rq(new_cpu)->cfs.utilization_load_avg +
+				   task_utilization(p));
    return new_cpu;
 }

@@ -7826,11 +7843,6 @@ static void rq_offline_fair(struct rq *rq)

 #endif /* CONFIG_SMP */

-static inline unsigned long task_utilization(struct task_struct *p)
-{
-	return p->se.avg.utilization_avg_contrib;
-}
-
 /*
  * scheduler tick hitting a task of our scheduling class:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3983bd6..6dd8f3a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1402,6 +1402,7 @@ unsigned long capacity_curr_of(int cpu);
 #ifdef CONFIG_CPU_FREQ_GOV_SCHED_CFS
 #define MARGIN_PCT 125 /* taken from imbalance_pct = 125 */
 void gov_cfs_update_cpu(int cpu, unsigned long capacity);
+void gov_cfs_reset_cpu(int cpu);
 #else
 static inline void gov_cfs_update_cpu(int cpu) {}
 #endif
-- 
2.2.2

This approach seems to work also for a light/medium/light task. We
go to max and then adapt the the real (medium) utilization (Fig.4).

Finally a couple of patches more (first one should actually be
squashed in 01/04) to cover load_balancing (not really tested
already).

>From 3e7226989c21fdd680279f4f8a150597b5833b95 Mon Sep 17 00:00:00 2001
From: Juri Lelli juri.lelli@arm.com
Date: Tue, 28 Apr 2015 16:10:00 +0100
Subject: [PATCH 3/4] sched/cpufreq_sched_cfs: update requested capacity even
 when throttled

If the kthread is throttled we still need to update requests, or
we may end-up with stale values.

Signed-off-by: Juri Lelli juri.lelli@arm.com
---
 kernel/sched/cpufreq_sched_cfs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_sched_cfs.c b/kernel/sched/cpufreq_sched_cfs.c
index 2fe1684..c8d9408 100644
--- a/kernel/sched/cpufreq_sched_cfs.c
+++ b/kernel/sched/cpufreq_sched_cfs.c
@@ -222,12 +222,13 @@ void gov_cfs_update_cpu(int cpu, unsigned long capacity)

    gd = policy->governor_data;

+	per_cpu(new_capacity, cpu) = capacity;
+
    /* bail early if we are throttled */
    if (ktime_before(ktime_get(), gd->throttle)) {
    	goto out;
    }

-	per_cpu(new_capacity, cpu) = capacity;
    irq_work_queue_on(&gd->irq_work, cpu);

 out:
-- 
2.2.2

>From 333e1741c7de8dfc21f5bb9f2a9c29d4dc84f2de Mon Sep 17 00:00:00 2001
From: Juri Lelli juri.lelli@arm.com
Date: Tue, 28 Apr 2015 16:55:22 +0100
Subject: [PATCH 4/4] sched/fair: cpufreq_sched_cfs triggers for load_balancing

This should cover load_balance paths (untested).

Signed-off-by: Juri Lelli juri.lelli@arm.com
---
 kernel/sched/fair.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e21abf..ad1e7cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7203,6 +7203,14 @@ out_one_pinned:

    ld_moved = 0;
 out:
+	/* dst_grpmask might be NULL for NEWLY_IDLE. */
+	if (sched_energy_freq() && ld_moved && env.dst_grpmask)
+		/*
+		 * dequeue_task_fair() already took care of src_cpu
+		 */
+		gov_cfs_update_cpu(env.dst_cpu,
+			cpu_rq(env.dst_cpu)->cfs.utilization_load_avg);
+
    return ld_moved;
 }

@@ -7402,8 +7410,12 @@ out_unlock:
    busiest_rq->active_balance = 0;
    raw_spin_unlock(&busiest_rq->lock);

-	if (p)
+	if (p) {
    	attach_one_task(target_rq, p);
+		if (sched_energy_freq())
+			gov_cfs_update_cpu(cpu_of(target_rq),
+				target_rq->cfs.utilization_load_avg);
+	}

    local_irq_enable();

-- 
2.2.2

Comments?

Thanks,

- Juri

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC internal v2 4/4] sched: cpufreq_sched_cfs: PELT-based cpu frequency scaling