Hi,
This is based of the work done by Steve Muckle [1] before he left Linaro
and most of the patches are still under his authorship. I have done
couple of improvements (detailed in individual patches) and removed the
late callback support [2] as I wasn't sure of the value it adds. We can
include it separately if others feel it is required. This series is
based on pm/linux-next with patches [3] and [4] applied on top of it.
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into schedutil are only made from the scheduler if the target CPU of the
event is the same as the current CPU. This means there are certain
situations where a target CPU may not run schedutil for some time.
One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that new tasks should receive maximum demand
initially, this should result in CPU0 increasing frequency immediately.
Because of the above mentioned limitation though this does not occur.
This is verified using ftrace with the sample [5] application.
This patchset updates the scheduler to call cpufreq callbacks for remote
CPUs as well and updates schedutil governor to deal with it. An
additional flag is added to cpufreq policies to avoid sending IPIs to
remote CPUs to update the frequency, if CPUs on the platform can change
frequency of any other CPU.
This series is tested with couple of usecases (Android: hackbench,
recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey
board (64 bit octa-core, single policy). Only galleryfling showed minor
improvements, while others didn't had much deviation.
The reason being that this patchset only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:
- Task is migrated to another CPU.
- The task has maximum demand initially, and should take the CPU to
higher OPPs.
- And the target CPU doesn't call into schedutil until the next tick,
without this patchset.
--
viresh
[1] https://git.linaro.org/people/steve.muckle/kernel.git/log/?h=pmwg-integrati…
[2] https://git.linaro.org/people/steve.muckle/kernel.git/commit/?h=pmwg-integr…
[3] https://marc.info/?l=linux-kernel&m=148766093718487&w=2
[4] https://marc.info/?l=linux-kernel&m=148903231720432&w=2
[5] http://pastebin.com/7LkMSRxE
Steve Muckle (8):
sched: cpufreq: add cpu to update_util_data
irq_work: add irq_work_queue_on for !CONFIG_SMP
sched: cpufreq: extend irq work to support fast switches
sched: cpufreq: remove smp_processor_id() in remote paths
sched: cpufreq: detect, process remote callbacks
cpufreq: governor: support scheduler cpufreq callbacks on remote CPUs
intel_pstate: ignore scheduler cpufreq callbacks on remote CPUs
sched: cpufreq: enable remote sched cpufreq callbacks
Viresh Kumar (1):
cpufreq: Add dvfs_possible_from_any_cpu policy flag
drivers/cpufreq/cpufreq-dt.c | 1 +
drivers/cpufreq/cpufreq_governor.c | 2 +-
drivers/cpufreq/intel_pstate.c | 3 ++
include/linux/cpufreq.h | 9 +++++
include/linux/irq_work.h | 7 ++++
include/linux/sched/cpufreq.h | 1 +
kernel/sched/cpufreq.c | 1 +
kernel/sched/cpufreq_schedutil.c | 80 +++++++++++++++++++++++++++++---------
kernel/sched/fair.c | 6 ++-
kernel/sched/sched.h | 3 +-
10 files changed, 90 insertions(+), 23 deletions(-)
--
2.7.1.410.g6faf27b
Hello,
Your item has arrived at Tue, 11 Apr 2017 05:22:31 +0200, but our courier
was not able to deliver the parcel.
Please check the attachment for details!
Thanks and best regards.
Yevette Mains - USPS Senior Operation Agent.
Hi,
we are only a week away from the OSPM-summit!
Pack your bags (or stay tuned for the live streaming).
Don't forget to subscribe to the summit mailing list to receive updates by
either following the instructions available at the following link
http://groups.google.com/group/ospm-summit-2017/boxsubscribe?email=<your_email>
or sending an email to ospm-summit-2017+subscribe(a)googlegroups.com
Archives are available at https://groups.google.com/forum/#!forum/ospm-summit-2017
More information about schedule and logistics follow.
---
Power Management and Scheduling in the Linux Kernel (OSPM-summit)
April 3-4, 2017
Scuola Superiore Sant'Anna (SSSA)
Pisa, Italy
http://retis.sssup.it/ospm-summit/
---
.:: FOCUS
Power management and scheduling techniques to reduce energy consumption while
meeting performance and latency requirements are receiving considerable
attention from the Linux Kernel development community.
The Power Management and Scheduling in the Linux Kernel (OSPM-summit) summit
aims at fostering further interest and discussion to happen.
.:: SCHEDULE
The summit is organized to cover two days of discussions and talks.
What follows is a tentative schedule, subject to last minute changes.
Find more info and real time updates on this shared document:
https://docs.google.com/spreadsheets/d/1B-IsUIGitvRa7ZzppEAJBMgGpIgRAAnu_Oh…
Monday (2017-04-03)
*******************
09:00AM - 09:30AM Welcome and Introduction (DAY 1)
---
09:30AM - 10:20AM Tooling/LISA
---
10:20AM - 11:10AM About The Need to Power Instrument The Linux Kernel
---
11:10AM - 11:20AM Break
---
11:20AM - 12:10AM What are the latest evolutions in PELT and what next
---
12:10AM - 01:00PM PELT decay clamping/UTIL_EST
---
01:00PM - 02:30PM Lunch
---
02:30PM - 03:20PM EAS where we are
---
03:20PM - 04:10PM Energy model/Exotic topologies
---
04:10PM - 04:20PM Break
---
04:20PM - 05:10PM Schedtune
---
05:10PM - 06:00PM SCHED_DEADLINE and reclaiming
Tuesday (2017-04-04)
********************
09:00AM - 09:30AM Welcome and Introduction (DAY 2)
---
09:30AM - 10:20AM Discussion about possible improvements in the schedutil governor
---
10:20AM - 11:10AM Schedutil for SCHED_DEADLINE
---
11:10AM - 11:20AM Break
---
11:20AM - 12:10AM Parameterizing CFS load balancing: nr_running/util/load
---
12:10AM - 01:00PM Tracepoints for PELT
---
01:00PM - 02:30PM Lunch
---
02:30PM - 03:20PM IRQ prediction
---
03:20PM - 04:10PM I/O scheduling and power management with storage devices
---
04:10PM - 04:20PM Break
---
04:20PM - 05:10PM SCHED_DEADLINE group scheduling
---
05:10PM - 06:00PM A Hierarchical Scheduling Model for Dynamic Soft-Realtime Systems
---
06:00PM - 06:30PM Closing Remarks
We are looking into setting up live streaming of the sessions. Details
will be soon shared through the shared doc mentioned above and the
event mailing list.
List of attendees is also available in the doc and on the event website.
.:: VENUE
The workshop will take place at ReTiS Lab*, Scuola Superiore Sant'Anna, Pisa,
Italy. Pisa is a small town, walking distance from the city center to the venue
is 20 minutes, walking distance from the airport to the city center is 30
minutes. More details are available from the summit web page:
http://retis.sssup.it/ospm-summit/
A map of the town with venue location, points of interest and
transportation information is available at:
https://drive.google.com/open?id=1ANKOXr2cuZkABXskDurgrGdl_js&usp=sharing
Bus from Airport to city centre takes about 10 min and costs 1.20 euros
(2 euros if bought on board). Large bills are usually not accepted.
Taxi from Airport to city centre costs 10/15 euros. Credit cards are not
accepted.
.:: ORGANIZERS (in alphabetical order)
Luca Abeni (SSSA)
Patrick Bellasi (ARM)
Tommaso Cucinotta (SSSA)
Dietmar Eggemann (ARM)
Sudeep Holla (ARM)
Juri Lelli (ARM)
Lorenzo Pieralisi (ARM)
Morten Rasmussen (ARM)
The current implementation of overutilization, aborts energy aware
scheduling if any cpu in the system is over-utilized. This patch introduces
over utilization flag per sched domain level instead of a single flag
system wide. Load balancing is done at the sched domain where any
of the cpu is over utilized. If energy aware scheduling is
enabled and no cpu in a sched domain is overuttilized,
load balancing is skipped for that sched domain and energy aware
scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure
that is common across all the sched domains at a level. The new flag
introduced is placed in this structure so that all the sched domains the
same level share the flag. In case of an overutilized cpu, the flag gets
set at level1 sched_domain. The flag at the parent sched_domain level gets
set in either of the two following scenarios.
1. There is a misfit task in one of the cpu's in this sched_domain.
2. The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to
misfit tasks. For example consider a sched group with n cpus and
n+1 70%utilized tasks. Ideally this is a case for load balance to happen
in a parent sched domain. But neither the total group utilization is
high enough for the load balance to be triggered
in the parent domain nor there is a cpu with a single overutilized task so
that aload balance is triggered in a parent domain. But again this could be
a purely academic sceanrio, as during task wake up these tasks will be placed
more appropriately.
Signed-off-by: Thara Gopinath <thara.gopinath(a)linaro.org>
---
V1->V2:
- Removed overutilized flag from sched_group structure.
- In case of misfit task, it is ensured that a load balance is
triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched.h | 1 +
kernel/sched/core.c | 7 ++-
kernel/sched/fair.c | 138 +++++++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 3 --
4 files changed, 117 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c5122e..971842a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1112,6 +1112,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ bool overutilized;
};
struct sched_domain {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 31a466f..e0a8758 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6659,11 +6659,10 @@ sd_init(struct sched_domain_topology_level *tl,
* For all levels sharing cache; connect a sched_domain_shared
* instance.
*/
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+ if (sd->flags & SD_SHARE_PKG_RESOURCES)
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
sd->private = sdd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 489f6d3..9d2bb07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4735,6 +4735,30 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool
+is_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ return sd->shared->overutilized;
+ else
+ return false;
+}
+
+static void
+set_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ sd->shared->overutilized = true;
+}
+
+static void
+clear_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ sd->shared->overutilized = false;
+}
+
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4744,6 +4768,7 @@ static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
+ struct sched_domain *sd;
struct sched_entity *se = &p->se;
int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4787,9 +4812,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
add_nr_running(rq, 1);
- if (!task_new && !rq->rd->overutilized &&
- cpu_overutilized(rq->cpu))
- rq->rd->overutilized = true;
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (!task_new && !is_sd_overutilized(sd) &&
+ cpu_overutilized(rq->cpu))
+ set_sd_overutilized(sd);
+ rcu_read_unlock();
}
hrtick_update(rq);
}
@@ -6173,8 +6201,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu)
unsigned long max_spare = 0;
struct sched_domain *sd;
- rcu_read_lock();
-
+ /* The rcu lock is/should be held in the caller function */
sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd)
@@ -6212,8 +6239,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu)
}
unlock:
- rcu_read_unlock();
-
if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu))
return prev_cpu;
@@ -6247,10 +6272,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
&& cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
}
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
- return select_energy_cpu_brute(p, prev_cpu);
-
rcu_read_lock();
+ sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
+ if (energy_aware() &&
+ !is_sd_overutilized(sd)) {
+ new_cpu = select_energy_cpu_brute(p, prev_cpu);
+ goto unlock;
+ }
+
+ sd = NULL;
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
@@ -6315,6 +6346,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
}
/* while loop will break here if sd == NULL */
}
+
+unlock:
rcu_read_unlock();
return new_cpu;
@@ -7366,6 +7399,7 @@ struct sd_lb_stats {
struct sched_group *local; /* Local group in this sd */
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long total_util; /* Total util of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7385,6 +7419,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.local = NULL,
.total_load = 0UL,
.total_capacity = 0UL,
+ .total_util = 0UL,
.busiest_stat = {
.avg_load = 0UL,
.sum_nr_running = 0,
@@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group,
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload, bool *overutilized)
+ bool *overload, bool *overutilized, bool *misfit_task)
{
unsigned long load;
int i, nr_running;
@@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
- if (cpu_overutilized(i))
+ if (cpu_overutilized(i)) {
*overutilized = true;
+ /*
+ * If the cpu is overutilized and if there is only one
+ * current task in cfs runqueue, it is potentially a misfit
+ * task.
+ */
+ if (rq->cfs.h_nr_running == 1)
+ *misfit_task = true;
+ }
}
/* Adjust by relative CPU capacity of the group */
@@ -7825,11 +7868,11 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
*/
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
- struct sched_domain *child = env->sd->child;
+ struct sched_domain *child = env->sd->child, *sd;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
+ bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload, &overutilized);
+ &overload, &overutilized,
+ &misfit_task);
if (local_group)
goto next_group;
@@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
+ sds->total_util += sgs->group_util;
sg = sg->next;
} while (sg != env->sd->groups);
@@ -7895,14 +7940,45 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* update overload indicator if we are at root domain */
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
+ }
- /* Update over-utilization (tipping point, U >= 0) indicator */
- if (env->dst_rq->rd->overutilized != overutilized)
- env->dst_rq->rd->overutilized = overutilized;
- } else {
- if (!env->dst_rq->rd->overutilized && overutilized)
- env->dst_rq->rd->overutilized = true;
+ if (overutilized)
+ set_sd_overutilized(env->sd);
+ else
+ clear_sd_overutilized(env->sd);
+
+ /*
+ * If there is a misfit task in one cpu in this sched_domain
+ * it is likely that the imbalance cannot be sorted out among
+ * the cpu's in this sched_domain. In this case set the
+ * overutilized flag at the parent sched_domain.
+ */
+ if (misfit_task) {
+
+ sd = env->sd->parent;
+
+ /*
+ * In case of a misfit task, load balance at the parent
+ * sched domain level will make sense only if the the cpus
+ * have a different capacity. If cpus at a domain level have
+ * the same capacity, the misfit task cannot be well
+ * accomodated in any of the cpus and there in no point in
+ * trying a load balance at this level
+ */
+ while (sd) {
+ if (sd->flags & SD_ASYM_CPUCAPACITY) {
+ set_sd_overutilized(sd);
+ break;
+ }
+ sd = sd->parent;
+ }
}
+
+ /* If the domain util is greater that domain capacity, load balancing
+ * needs to be done at the next sched domain level as well
+ */
+ if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
+ set_sd_overutilized(env->sd->parent);
}
/**
@@ -8122,8 +8198,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
*/
update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
- goto out_balanced;
+ if (energy_aware()) {
+ if (!is_sd_overutilized(env->sd))
+ goto out_balanced;
+ }
local = &sds.local_stat;
busiest = &sds.busiest_stat;
@@ -8981,6 +9059,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock();
for_each_domain(cpu, sd) {
+ if (energy_aware()) {
+ if (!is_sd_overutilized(sd))
+ continue;
+ }
+
/*
* Decay the newidle max times here because this is a regular
* visit to all the domains. Decay ~1% per second.
@@ -9280,6 +9363,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ struct sched_domain *sd;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -9289,8 +9373,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
- rq->rd->overutilized = true;
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (!is_sd_overutilized(sd) &&
+ cpu_overutilized(task_cpu(curr)))
+ set_sd_overutilized(sd);
+ rcu_read_unlock();
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fa98ab3..b24cefa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -563,9 +563,6 @@ struct root_domain {
/* Indicate more than one runnable task for any CPU */
bool overload;
- /* Indicate one or more cpus over-utilized (tipping point) */
- bool overutilized;
-
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
--
2.1.4
The rate_limit_us tunable is intended to reduce the possible overhead
from running the schedutil governor. However, that overhead can be
divided into two separate parts: the governor computations and the
invocation of the scaling driver to set the CPU frequency. The latter
is where the real overhead comes from. The former is much less
expensive in terms of execution time and running it every time the
governor callback is invoked by the scheduler, after rate_limit_us
interval has passed since the last frequency update, would not be a
problem.
For this reason, redefine the rate_limit_us tunable so that it means the
minimum time that has to pass between two consecutive invocations of the
scaling driver by the schedutil governor (to set the CPU frequency).
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
---
V1->V2: Update $subject and commit log (Rafael)
kernel/sched/cpufreq_schedutil.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index fd4659313640..306d97e7b57c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -92,14 +92,13 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
{
struct cpufreq_policy *policy = sg_policy->policy;
- sg_policy->last_freq_update_time = time;
-
if (policy->fast_switch_enabled) {
if (sg_policy->next_freq == next_freq) {
trace_cpu_frequency(policy->cur, smp_processor_id());
return;
}
sg_policy->next_freq = next_freq;
+ sg_policy->last_freq_update_time = time;
next_freq = cpufreq_driver_fast_switch(policy, next_freq);
if (next_freq == CPUFREQ_ENTRY_INVALID)
return;
@@ -108,6 +107,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
trace_cpu_frequency(next_freq, smp_processor_id());
} else if (sg_policy->next_freq != next_freq) {
sg_policy->next_freq = next_freq;
+ sg_policy->last_freq_update_time = time;
sg_policy->work_in_progress = true;
irq_work_queue(&sg_policy->irq_work);
}
--
2.7.1.410.g6faf27b
Sorry that I forgot to cc eas-dev list for this patch.
----- Forwarded message from Viresh Kumar <viresh.kumar(a)linaro.org> -----
Date: Wed, 15 Feb 2017 22:45:47 +0530
From: Viresh Kumar <viresh.kumar(a)linaro.org>
To: Rafael Wysocki <rjw(a)rjwysocki.net>, Ingo Molnar <mingo(a)redhat.com>, Peter Zijlstra <peterz(a)infradead.org>
Cc: linaro-kernel(a)lists.linaro.org, linux-pm(a)vger.kernel.org, linux-kernel(a)vger.kernel.org, Vincent Guittot <vincent.guittot(a)linaro.org>, Viresh Kumar <viresh.kumar(a)linaro.org>
Subject: [PATCH] cpufreq: schedutil: govern how frequently we change frequency with rate_limit
X-Mailer: git-send-email 2.7.1.410.g6faf27b
For an ideal system (where frequency change doesn't incur any penalty)
we would like to change the frequency as soon as the load changes for a
CPU. But the systems we have to work with are far from ideal and it
takes time to change the frequency of a CPU. For many ARM platforms
specially, it is at least 1 ms. In order to not spend too much time
changing frequency, we have earlier introduced a sysfs controlled
tunable for the schedutil governor: rate_limit_us.
Currently, rate_limit_us controls how frequently we reevaluate frequency
for a set of CPUs controlled by a cpufreq policy. But that may not be
the ideal behavior we want.
Consider for example the following scenario. The rate_limit_us tunable
is set to 10 ms. The CPU has a constant load X and that requires the
frequency to be set to Y. The schedutil governor changes the frequency
to Y, updates last_freq_update_time and we wait for 10 ms to reevaluate
the frequency again. After 10 ms, the schedutil governor reevaluates the
load and finds it to be the same. And so it doesn't update the
frequency, but updates last_freq_update_time before returning. Right
after this point, the scheduler puts more load on the CPU and the CPU
needs to go to a higher frequency Z. Because last_freq_update_time was
updated just now, the schedutil governor waits for additional 10ms
before reevaluating the load again.
Normally, the time it takes to reevaluate the frequency is negligible
compared to the time it takes to change the frequency. And considering
that in the above scenario, as we haven't updated the frequency for over
10ms, we should have changed the frequency as soon as the load changed.
This patch changes the way rate_limit_us is used, i.e. It now governs
"How frequently we change the frequency" instead of "How frequently we
reevaluate the frequency".
One may think that this change may have increased the number of times we
reevaluate the frequency after a period of rate_limit_us has expired
since the last change, if the load isn't changing. But that is protected
by the scheduler as normally it doesn't call into the schedutil governor
before 1 ms (Hint: "decayed" in update_cfs_rq_load_avg()) since the
last call.
Tests were performed with this patch on a Dual cluster (same frequency
domain), octa-core ARM64 platform (Hikey). Hackbench (Debian) and
Vellamo/Galleryfling (Android) didn't had much difference in
performance w/ or w/o this patch.
Its difficult to create a test case (tried rt-app as well) where this
patch will show a lot of improvements as the target of this patch is a
real corner case. I.e. Current load is X (resulting in freq change),
load after rate_limit_us is also X, but right after that load becomes Y.
Undoubtedly this patch would improve the responsiveness in such cases.
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
---
kernel/sched/cpufreq_schedutil.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index fd4659313640..306d97e7b57c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -92,14 +92,13 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
{
struct cpufreq_policy *policy = sg_policy->policy;
- sg_policy->last_freq_update_time = time;
-
if (policy->fast_switch_enabled) {
if (sg_policy->next_freq == next_freq) {
trace_cpu_frequency(policy->cur, smp_processor_id());
return;
}
sg_policy->next_freq = next_freq;
+ sg_policy->last_freq_update_time = time;
next_freq = cpufreq_driver_fast_switch(policy, next_freq);
if (next_freq == CPUFREQ_ENTRY_INVALID)
return;
@@ -108,6 +107,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
trace_cpu_frequency(next_freq, smp_processor_id());
} else if (sg_policy->next_freq != next_freq) {
sg_policy->next_freq = next_freq;
+ sg_policy->last_freq_update_time = time;
sg_policy->work_in_progress = true;
irq_work_queue(&sg_policy->irq_work);
}
--
2.7.1.410.g6faf27b
----- End forwarded message -----
--
viresh
The current implementation of overutilization, aborts energy aware
scheduling if any cpu in the system is over-utilized. This patch introduces
over utilization flag per sched domain level instead of a single flag
system wide. Load balancing is done at the sched domain where any
of the cpu is over utilized. If energy aware scheduling is
enabled and no cpu in a sched domain is overuttilized,
load balancing is skipped for that sched domain and energy aware
scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure
that is common across all the sched domains at a level. The new flag
introduced is placed in this structure so that all the sched domains the
same level share the flag. In case of an overutilized cpu, the flag gets
set at level1 sched_domain. The flag at the parent sched_domain level gets
set in either of the two following scenarios.
1. There is a misfit task in one of the cpu's in this sched_domain.
2. The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
Signed-off-by: Thara Gopinath <thara.gopinath(a)linaro.org>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 7 ++-
kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++-----------
3 files changed, 99 insertions(+), 29 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c5122e..971842a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1112,6 +1112,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ bool overutilized;
};
struct sched_domain {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 31a466f..e0a8758 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6659,11 +6659,10 @@ sd_init(struct sched_domain_topology_level *tl,
* For all levels sharing cache; connect a sched_domain_shared
* instance.
*/
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+ if (sd->flags & SD_SHARE_PKG_RESOURCES)
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
sd->private = sdd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 489f6d3..485f597 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4735,6 +4735,30 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool
+is_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ return sd->shared->overutilized;
+ else
+ return false;
+}
+
+static void
+set_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ sd->shared->overutilized = true;
+}
+
+static void
+clear_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ sd->shared->overutilized = false;
+}
+
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4744,6 +4768,7 @@ static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
+ struct sched_domain *sd;
struct sched_entity *se = &p->se;
int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4787,9 +4812,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
add_nr_running(rq, 1);
- if (!task_new && !rq->rd->overutilized &&
- cpu_overutilized(rq->cpu))
- rq->rd->overutilized = true;
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (!task_new && !is_sd_overutilized(sd) &&
+ cpu_overutilized(rq->cpu))
+ set_sd_overutilized(sd);
+ rcu_read_unlock();
}
hrtick_update(rq);
}
@@ -6173,8 +6201,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu)
unsigned long max_spare = 0;
struct sched_domain *sd;
- rcu_read_lock();
-
+ /* The rcu lock is/should be held in the caller function */
sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd)
@@ -6212,8 +6239,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu)
}
unlock:
- rcu_read_unlock();
-
if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu))
return prev_cpu;
@@ -6247,10 +6272,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
&& cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
}
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
- return select_energy_cpu_brute(p, prev_cpu);
-
rcu_read_lock();
+ sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
+ if (energy_aware() &&
+ !is_sd_overutilized(sd)) {
+ new_cpu = select_energy_cpu_brute(p, prev_cpu);
+ goto unlock;
+ }
+
+ sd = NULL;
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
@@ -6315,6 +6346,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
}
/* while loop will break here if sd == NULL */
}
+
+unlock:
rcu_read_unlock();
return new_cpu;
@@ -7366,6 +7399,7 @@ struct sd_lb_stats {
struct sched_group *local; /* Local group in this sd */
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long total_util; /* Total util of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7385,6 +7419,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.local = NULL,
.total_load = 0UL,
.total_capacity = 0UL,
+ .total_util = 0UL,
.busiest_stat = {
.avg_load = 0UL,
.sum_nr_running = 0,
@@ -7664,7 +7699,7 @@ group_type group_classify(struct sched_group *group,
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload, bool *overutilized)
+ bool *overload, bool *overutilized, bool *misfit_task)
{
unsigned long load;
int i, nr_running;
@@ -7699,8 +7734,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
- if (cpu_overutilized(i))
+ if (cpu_overutilized(i)) {
*overutilized = true;
+ /*
+ * If the cpu is overutilized and if there is only one
+ * current task in cfs runqueue, it is potentially a misfit
+ * task.
+ */
+ if (rq->cfs.h_nr_running == 1)
+ *misfit_task = true;
+ }
}
/* Adjust by relative CPU capacity of the group */
@@ -7829,7 +7872,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
+ bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -7851,7 +7894,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload, &overutilized);
+ &overload, &overutilized,
+ &misfit_task);
if (local_group)
goto next_group;
@@ -7882,6 +7926,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
+ sds->total_util += sgs->group_util;
sg = sg->next;
} while (sg != env->sd->groups);
@@ -7895,14 +7940,27 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* update overload indicator if we are at root domain */
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
-
- /* Update over-utilization (tipping point, U >= 0) indicator */
- if (env->dst_rq->rd->overutilized != overutilized)
- env->dst_rq->rd->overutilized = overutilized;
- } else {
- if (!env->dst_rq->rd->overutilized && overutilized)
- env->dst_rq->rd->overutilized = true;
}
+
+ if (overutilized)
+ set_sd_overutilized(env->sd);
+ else
+ clear_sd_overutilized(env->sd);
+
+ /*
+ * If there is a misfit task in one cpu in this sched_domain
+ * it is likely that the imbalance cannot be sorted out among
+ * the cpu's in this sched_domain. In this case set the
+ * overutilized flag at the parent sched_domain.
+ */
+ if (misfit_task)
+ set_sd_overutilized(env->sd->parent);
+
+ /* If the domain util is greater that domain capacity, load balancing
+ * needs to be done at the next sched domain level as well
+ */
+ if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
+ set_sd_overutilized(env->sd->parent);
}
/**
@@ -8122,8 +8180,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
*/
update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
- goto out_balanced;
+ if (energy_aware()) {
+ if (!is_sd_overutilized(env->sd))
+ goto out_balanced;
+ }
local = &sds.local_stat;
busiest = &sds.busiest_stat;
@@ -8981,6 +9041,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock();
for_each_domain(cpu, sd) {
+ if (energy_aware()) {
+ if (!is_sd_overutilized(sd))
+ continue;
+ }
+
/*
* Decay the newidle max times here because this is a regular
* visit to all the domains. Decay ~1% per second.
@@ -9280,6 +9345,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ struct sched_domain *sd;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -9289,8 +9355,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
- rq->rd->overutilized = true;
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (!is_sd_overutilized(sd) &&
+ cpu_overutilized(task_cpu(curr)))
+ set_sd_overutilized(sd);
+ rcu_read_unlock();
}
/*
--
2.1.4
Power Management and Scheduling in the Linux Kernel (OSPM-summit)
April 3-4, 2017
Scuola Superiore Sant'Anna (SSSA)
Pisa, Italy
http://retis.sssup.it/ospm-summit/
---
.:: FOCUS
Power management and scheduling techniques to reduce energy consumption while
meeting performance and latency requirements are receiving considerable
attention from the Linux Kernel development community.
The Power Management and Scheduling in the Linux Kernel (OSPM-summit) summit
aims at fostering further interest and discussion to happen.
.:: FORMAT
The summit is organized to cover two days of discussions and talks.
First day is mainly focused on discussion and hacking sessions about
topics/patches that are already under review in the Linux kernel mailing lists
and to debate and plan development tasks for more forward looking work items
centred around power management in the Linux kernel. The list of topics
includes (but it is not limited to):
* Energy Aware Scheduling: next steps and energy model expression;
* SCHED_DEADLINE reclaiming of unused bandwidth, coupling with schedutil
cpufreq governor and group scheduling support;
* fix the load metric exposed to cpuidle;
* IRQ prediction;
* ACPI power management: kernel/firmware bindings and development model;
Second day instead welcomes presentations from both end users and developers on
topics about Power management and scheduling in Linux covering, but not limited
to:
* Power management techniques
* Real-time and non real-time scheduling techniques
* Energy awareness
* Mobile/Server power management real-world use cases (successes and
failures)
* Power management and scheduling tooling (tracing, configuration,
integration testing, etc.)
Presentations can cover recently developed technologies, ongoing work and new
ideas. Please understand that this workshop is not intended for presenting
sales and marketing pitches.
.:: ATTENDING
Attending the OSPM-summit is free of charge, but registration to the event is
mandatory. The event can allow a maximum of 50 people (so, be sure to register
early!). To register send an email to ospm-registration(a)retis.sssup.it. While
it is not strictly required to submit a topic/presentation, registrations with
a topic/presentation proposal will take precedence.
.:: VENUE
The workshop will take place at ReTiS Lab*, Scuola Superiore Sant'Anna, Pisa,
Italy. Pisa is a small town, walking distance from the city center to the venue
is 20 minutes, walking distance from the airport to the city center is 30
minutes. More details are available from the summit web page:
http://retis.sssup.it/ospm-summit/
* https://goo.gl/maps/2pPXG2v7Lfp
.:: SUBMIT A TOPIC/PRESENTATION
To submit a topic/presentation send an email to
ospm-registration(a)retis.sssup.it specifying:
subject
- [TOPIC] or [PRESENTATION]
- short title
body
- first name, family name
- abstract/topic of interest
- affiliation (if any)
- short biography
- expected duration (only for topics, presentations get 30min slots)
Deadline for submitting topics/presentations is 26th of February 2017.
Notifications for accepted topics/presentations will be sent out on 5th of
March 2017.
.:: ORGANIZERS (in alphabetical order)
Luca Abeni (SSSA)
Patrick Bellasi (ARM)
Tommaso Cucinotta (SSSA)
Dietmar Eggemann (ARM)
Sudeep Holla (ARM)
Juri Lelli (ARM)
Lorenzo Pieralisi (ARM)
Morten Rasmussen (ARM)
This patch series is to improve load balance with more proper behaviour
for misfit task. Current code introduces type 'group_misfit_task' to
indicate one schedule group has misfit task, but before the misfit task
can be really migrated onto higher capacity CPU there still have some
barriers we need clear up.
The first patch is to correct task_fits_max() so it can properly filter
out misfit task on low capacity CPU. If without this patch, in system
it's possible this function can always return true so the 'misfit' task
mechanism will totally not be triggered.
The second patch is to fix function group_smaller_cpu_capacity(), so we
can make sure the schedule group with type 'group_misfit_task' will not
wrongly be roll back to type 'group_other'. This will let all misfit
related info be abondoned.
The third patch is to fix nr_running accounting, if without this patch
the scheduler will wronly consider the destination CPU has running task
and skip migrate task on it. This patch is to give correct info like
the destination CPU has no running task on it when the CPU is going into
idle state, so should migrate misfit task by utilizing this time balance.
The forth patch is a temperary patch if we have not backported Vincent's
patches "sched: reflect sched_entity move into task_group's load" [1],
If without this patch series, it's possible that the CPU is not
overutilized but the CPU has one misfit task has been enqueued on it.
So we set sgs->group_misfit_task by checking rq->misfit_task but not
rely on cpu is overutilized or not.
The fifth patch is to select busiest rq if the rq has misfit task, we
let this kind rq has higher priority than the rq with highest weighted
load. This criteria is only enabled for energy aware scheduling.
The sixth patch is to aggressively kick active load balance for misfit
task, so it has quite high chance for higher capacity CPU to
immediately pull misfit task on it.
[1] https://lkml.org/lkml/2016/10/17/223
Leo Yan (6):
sched/fair: correct task_fits_max() for misfit task
sched/fair: fix for group_smaller_cpu_capacity()
sched/fair: fix nr_running accounting for new idle CPU
sched/fair: fix to set sgs->group_misfit_task
sched/fair: select busiest rq with misfit task
sched/fair: kick active load balance for misfit task
kernel/sched/fair.c | 59 +++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 46 insertions(+), 13 deletions(-)
--
2.7.4
Hello all,
I have an x86 based platform which is running android. I wanted to
play around with the EAS patches to see if it would improve power
numbers on it.
I had a few basic questions regarding this:
1) Can EAS be used with x86 based platforms ? I see some arm/arm64
energy model related patches in the eas integration tree
(git://linux-arm.org/linux-power.git). However, there aren't any x86
specific changes present. Is that because no x86 specific changes are
required or just that it is untested there ?
2) Is it expected that EAS would show significant power savings on SMP
systems or just on HMP systems ?
3) Would there be any cpufreq/cpuidle integration be required for x86
specifically ? If so, would I need to base them on the arm stuff or is
there any other reference code.
4) Are there other in-flight patches that need to be applied over the
patches in the eas integration tree for best results ?
If indeed the EAS patches can be used on x86. then I would be
interested in integrating and providing results on my platform. Please
guide.
Regards,
Darren
Hello,
I'm pleased to announce that we have pushed a very early version of
some of the key features we intend to make available as EAS 1.2 this
year to Google's msm repository
( https://android.googlesource.com/kernel/msm.git/ ) as
android-msm-marlin-3.18-nougat-mr1-eas-experimental.
EAS 1.2 is intended to be the next iteration of EAS for AOSP,
including improvements to the wakeup path to better support
big.LITTLE and trialling other upstream scheduler enhancements such
as schedutil along with some important load/util tracking enhancements
to PELT.
Although EAS 1.2 will be primarily focused on a 4.4-based kernel, we
are making this experimental branch available on the 3.18-based Pixel
kernel (marlin_defconfig) in order that we have a readily-available
real platform with an optimised userspace for experimentation.
There are some differences in the scheduler task wake-up path between
this release and that shipping in the Pixel kernel which should be
taken into account when using this kernel.
The most visible change in the wake-up path is the removal of the
is_big_little sysctl. Wake-up now uses a single cpu selection
algorithm (the same one used previously for !isBigLittle) but modified
to remove the assumption that the highest capacity cpus have the
highest logical cpu number. We now allow cpu topology independent
selection of max capacity cpus for tasks which belong to a schedtune
group which has some boost applied irrespective of the cpu numbering.
This changes the iteration order of cpus when looking for a place to
run these tasks from [3,2],[1,0] to [2,3], [0,1]. This has an impact
on runtime configuration. Not making a change to this configuration is
likely to have a small impact for lightly-loaded systems where there
will usually be two idle high-capacity cpus, but we should anyway
match cpuset configuration to the selection ordering to restore the
expectations used when tuning.
In Pixel, cpusets are arranged such that one of the highest capacity
cpus is available only to tasks belonging to the ‘top-app’ cpuset. In
combination with the cpu iteration order used for schedtune boosted
tasks, we hope to find an empty cpu more often for these tasks to wake
on. As a result of the changed iteration order, the top-app should now
be set to the lowest numbered high capacity cpu (in this case #2 for
Pixel). The impact of this is likely to be small for most light use
cases if not changed. This is done in the initrc:
The usual group setup for Pixel is in init.sailfish.rc - the part
which configures the CPUSets for the tuning groups is normally as follows:
on property:sys.boot_completed=1
write /proc/sys/kernel/sched_boost 0
# update cpusets now that boot is complete and we want better load balancing
write /dev/cpuset/top-app/cpus 0-3
write /dev/cpuset/foreground/boost/cpus 0-2
write /dev/cpuset/foreground/cpus 0-2
write /dev/cpuset/background/cpus 0
write /dev/cpuset/system-background/cpus 0-2
As we wish to make cpu 2 the one which is only available for tasks in
the top-app group, we should exclude cpu 2 from the other groups.
on property:sys.boot_completed=1
write /proc/sys/kernel/sched_boost 0
# update cpusets now that boot is complete and we want better load balancing
write /dev/cpuset/top-app/cpus 0-3
write /dev/cpuset/foreground/boost/cpus 0-1,3
write /dev/cpuset/foreground/cpus 0-1,3
write /dev/cpuset/background/cpus 0
write /dev/cpuset/system-background/cpus 0-1,3
We normally do this at run time in a root shell rather than modifying
the init scripts.
The schedutil governor is present but not selected as the default
cpufreq governor.
It is important to note that there is a slight difference in the
meaning of the up & down frequency select throttling for the 'sched'
governor (sched-dvfs) and 'schedutil'. The 'sched' governor considers
time to be measured since the last *frequency change* whilst the
'schedutil' governor considers the time to be measured since the last
*utilisation request*. This means that we need to shorten the throttle
periods used for schedutil when comparing it to sched-dvfs to avoid
staying at the maximum frequency for long periods in UI-driven
workloads.
We have been experimenting with up_rate_limit_usec set to 500 and
down_rate_limit_usec set to 2000 or 5000 which appears to give
results comparable with those of the 'sched' governor.
The branch is based upon the mr1 kernel release, and contains the
patches shown at the end of this mail.
They are comprised of 6 main areas of functionality.
* ec114ba...d2238c2 and 8646350...35ea67a
patches to reduce the delta between the msm kernel and the common kernel
* b055eba...d2e2970
introduce a backport of the upstream schedutil governor (but it is not the default
governor in marlin_defconfig)
* 7f7e79e...14531d4e
bring the energy-aware-scheduling calculations into line with our
mainline-focused implementation and backport capacity-based-scheduling to 3.18
* b75b728...407d2a7
integrate the current EAS 1.1 wakeup path with the mainline-focused
wakeup path and introduce a way to provide a common algorithm implementing
the alternate CPU search algorithm for schedtune boosted tasks
* f966249...1ad6d08
Backport some important upstream CFS fixes to 3.18. This fixes some critical
group accounting issues which had a negative impact on the suitability of PELT
utilisation signals for Android
* 6ae4707
Allows EAS to continue to calculate energy for systems which end up with
a single CPU in a sched domain
Best Regards,
Chris
Amit Pundir (3):
sched/walt: use do_div instead of division operator
ANDROID: sched/walt: fix build failure if FAIR_GROUP_SCHED=n
Revert "cgroup: Fix issues in allow_attach callback"
Brendan Jackman (2):
DEBUG: sched/fair: Fix missing sched_load_avg_cpu events
DEBUG: sched/fair: Fix sched_load_avg_cpu events for task_groups
Chris Redpath (17):
Revert "WIP: UTIL_EST: use estimated utilization on load balancing paths"
Revert "WIP: UTIL_EST: use estimated utilization on energy aware wakeup path"
Revert "WIP: UTIL_EST: sched/fair: use estimated utilization to drive CPUFreq"
Revert "WIP: UTIL_EST: switch to usage of tasks's estimated utilization"
sched: revert UTIL_EST usage from commit 6bf72ca7f1
Revert "WIP: UTIL_EST: sched/{core,fair}: add support to use estimated utilization"
Revert "WIP: UTIL_EST: sched/fair: add support for estimated utilization"
sched/fair: missing parts of 'optimize idle cpu selection for boosted tasks'
sched/fair: Fix uninitialised variable in idle_balance
Revert: UTIL_EST code from 'fix set_cfs_cpu_capacity when WALT is in use"
Unify whitespace layout with android-3.18
schedtune: Guarding against compile errors
sched/walt: Drop arch-specific timer access
Revert "DEBUG: UTIL_EST: sched: update tracepoint to report estimated CPU utilzation"
sched: This kernel expects sched_cfs_boost to be signed
schedutil: Fix linkage of schedutil and walt
config: Update marlin_defconfig to include schedutil governor
Dietmar Eggemann (20):
Revert "WIP: sched: Consider spare cpu capacity at task wake-up"
Partial Revert: "WIP: sched: Add cpu capacity awareness to wakeup balancing"
Experimental! arm64: Set SD_SHARE_CAP_STATES sched_domain flag on DIE level
Experimental!: sched/fair: Do not force want_affine eq. true if EAS is enabled
Experimental!: sched/fair: Decommission energy_aware_wake_cpu()
Fixup!: sched/fair.c: Set SchedTune specific struct energy_env.task
Experimental!: EAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path
Experimental!: sched/fair: Code !is_big_little path into select_energy_cpu_brute()
Experimental!: sched: Remove sysctl_sched_is_big_little
sched/core: Remove remnants of commit fd5c98da1a42
Experimental!: sched/core: Add first cpu w/ max/min orig capacity to root domain
Experimental!: sched/fair: Change cpu iteration order in find_best_target()
sched/fair: Simplify backup_capacity handling in find_best_target()
Fixup!: sched/fair: Simplify target_util handling in find_best_target()
Fixup!: sched/fair: Simplify idle_idx handling in find_best_target()
Fixup!: sched/fair: Refactor min_util, new_util in find_best_target()
Fixup!: sched/fair: Simplify idle_idx handling in select_idle_sibling()
Fixup!: Return first idle cpu for prefer_idle task immediately
Fixup!: sched/fair: No need to 'and' current cpu w/ online mask in wakeup
sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
Dmitry Shmidt (1):
sched: Fix sysctl_sched_cfs_boost type to be int
Juri Lelli (3):
sched/cpufreq: make schedutil use WALT signal
trace/sched: add rq utilization signal for WALT
sched/walt: kill {min,max}_capacity
Ke Wang (1):
sched: tune: Fix lacking spinlock initialization
Morten Rasmussen (15):
sched/core: Fix power to capacity renaming in comment
sched/fair: Make the use of prev_cpu consistent in the wakeup path
sched/fair: Optimize find_idlest_cpu() when there is no choice
sched/core: Remove unnecessary NULL-pointer check
sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
sched/core: Pass child domain into sd_init()
sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems
sched/fair: Let asymmetric CPU configurations balance at wake-up
sched/fair: Compute task/cpu utilization at wake-up correctly
sched/fair: Consider spare capacity in find_idlest_group()
sched/fair: Add per-CPU min capacity to sched_group_capacity
sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
sched/fair: Fix incorrect comment for capacity_margin
Experimental!: sched/fair: Add energy_diff dead-zone margin
Experimental!: sched/fair: Energy-aware wake-up task placement
Patrick Bellasi (3):
FIXUP: sched/tune: update accouting before CPU capacity
FIX: sched/tune: move schedtune_nornalize_energy into fair.c
sched/tune: backport 'fix accounting for runnable tasks'
Peter Zijlstra (Intel) (3):
sched/fair: Apply more PELT fixes
sched/fair: Improve PELT stuff some more
sched/fair: Fix effective_load() to consistently use smoothed load
Petr Mladek (1):
kthread: allow to cancel kthread work
Srinath Sridharan (1):
eas/sched/fair: Fixing comments in find_best_target.
Steve Muckle (5):
sched/cpufreq: fix tunables for schedfreq governor
sched: backport cpufreq hooks from 4.9-rc4
sched: backport schedutil governor from 4.9-rc4
sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
cpufreq: schedutil: add up/down frequency transition rate limits
Vincent Guittot (6):
sched: factorize attach entity
sched: factorize PELT update
sched: fix hierarchical order in rq->leaf_cfs_rq_list
sched: propagate load during synchronous attach/detach
sched: propagate asynchrous detach
sched: Multiple upstream load tracking changes
Viresh Kumar (1):
cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
Yuyang Du (1):
sched/fair: Initiate a new task's util avg to a bounded value
kbuild test robot (2):
ANDROID: sched/tune: __pcpu_scope_cpu_boost_groups can be static
ANDROID: sched/tune: schedtune_allow_attach() can be static
arch/arm64/configs/marlin_defconfig | 2 +-
arch/arm64/kernel/topology.c | 7 +-
drivers/cpufreq/Kconfig | 27 +
drivers/cpufreq/Makefile | 2 +-
drivers/cpufreq/cpufreq.c | 32 +
drivers/cpufreq/cpufreq_governor_attr_set.c | 84 ++
include/linux/cgroup.h | 2 +-
include/linux/cpufreq.h | 49 ++
include/linux/kthread.h | 4 +
include/linux/sched.h | 20 +-
include/linux/sched/sysctl.h | 7 +-
include/trace/events/sched.h | 22 +-
init/Kconfig | 1 +
kernel/kthread.c | 96 +-
kernel/sched/Makefile | 2 +
kernel/sched/core.c | 84 +-
kernel/sched/cpufreq.c | 63 ++
kernel/sched/cpufreq_sched.c | 220 ++---
kernel/sched/cpufreq_schedutil.c | 762 ++++++++++++++++
kernel/sched/deadline.c | 3 +
kernel/sched/debug.c | 4 -
kernel/sched/fair.c | 1254 ++++++++++++++++++---------
kernel/sched/features.h | 5 -
kernel/sched/rt.c | 3 +
kernel/sched/sched.h | 84 +-
kernel/sched/tune.c | 5 +-
kernel/sched/tune.h | 3 +
kernel/sched/walt.c | 52 +-
kernel/sysctl.c | 7 -
29 files changed, 2261 insertions(+), 645 deletions(-)
create mode 100644 drivers/cpufreq/cpufreq_governor_attr_set.c
create mode 100644 kernel/sched/cpufreq.c
create mode 100644 kernel/sched/cpufreq_schedutil.c
--
1.9.1
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Guys,
All of this work was done by Steve before he left. I have made very
minor changes, merged few patches, rebased over 4.10-rc5.
More details can be found here:
https://projects.linaro.org/browse/PMWG-1018
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently on
mainline tip, callbacks into schedutil are only made from the scheduler
if the target CPU of the event is the same as the current CPU. This
means there are certain situations where a target CPU may not run
schedutil for some time.
One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that new tasks should receive maximum demand
initially, this should result in CPU0 increasing frequency immediately.
Because of the above mentioned limitation though this does not occur.
This patchset defers the callback into schedutil if the callback would
be remote (not for a CPU in the policy of which we are running). If
there is no preemption required by the wakeup a late callback into
schedutil is made, and schedutil is modified to be able to correctly
deal with remote callbacks. If preemption does occur then the scheduler,
and schedutil, will run on the remote CPU anyway.
I would be doing further testing on this to get more performance numbers
with it, just wanted to get some early responses and so sending it to
the EAS list.
--
viresh
Steve Muckle (9):
sched: cpufreq: add cpu to update_util_data
irq_work: add irq_work_queue_on for !CONFIG_SMP
sched: cpufreq: extend irq work to support fast switches
sched: cpufreq: remove smp_processor_id() in remote paths
sched: create late cpufreq callback
sched: cpufreq: detect, process remote callbacks
cpufreq: governor: support scheduler cpufreq callbacks on remote CPUs
intel_pstate: ignore scheduler cpufreq callbacks on remote CPUs
sched: cpufreq: enable remote sched cpufreq callbacks
drivers/cpufreq/cpufreq_governor.c | 2 +-
drivers/cpufreq/intel_pstate.c | 3 ++
include/linux/irq_work.h | 7 ++++
include/linux/sched.h | 1 +
kernel/sched/core.c | 4 ++
kernel/sched/cpufreq.c | 1 +
kernel/sched/cpufreq_schedutil.c | 80 +++++++++++++++++++++++++++-----------
kernel/sched/fair.c | 6 ++-
kernel/sched/sched.h | 24 +++++++++++-
9 files changed, 102 insertions(+), 26 deletions(-)
--
2.7.1.410.g6faf27b
The current implementation of overutilization, aborts energy aware
scheduling if any cpu in the system is over-utilized. This patch introduces
over utilization flag per sched group level instead of a single flag
system wide. Load balancing is done at the sched domain where any
of the sched group is over utilized. If energy aware scheduling is
enabled and no sched group in a sched domain is overuttilized,
load balancing is skipped for that sched domain and energy aware
scheduling continues at that level.
The implementation is based on two points
1. For every cpu in every sched domain the first group
is the group that contains the cpu itself.
2. sched groups are shared between cpus.
Thus if a sched group is overutilized the overutilized flag is
set at the first sched group of the parent sched domain. This ensures a
load balancing at the overutilzed sched domain level.
For example consider a big little system with two little cpu's (CPU A and CPU B)
and two big cpu's (CPU C and CPU D). In this system, the hierarchy will be as follows
CPU A
SD level 1 - SG1 (CPUA), SG2 (CPUB)
SD level 2 - SG5(CPUA, CPUB), SG6(CPU C, CPU D)
RD
CPU B
SD level 1 - SG2(CPUB), SG1 (CPUA)
SD level 2 - SG5(CPU A, CPU B), SG6(CPU C, CPUD)
RD
CPU C
SD level 1 - SG3(CPU C), SG4 (CPUD)
SD level 2 - SG6(CPUC, CPUD), SG5(CPUA, CPU B)
RD
CPU D
SD level 1 - SG4(CPU D), SG3(CPU C)
SD level2 - SG6(CPUC, CPU D), SG5(CPU A, APU B)
RD
In the above system if CPUA is overutilized, the overutilized
flag is set at SG5(parent sched domain first sched group). Similarly
if CPUB is overutilized, the flag is set at SG5. During load balancing,
at SD level 1, the overutilized flag is checked at the parent sched domain,
first sched group level(SG5). If there is no parent sched domain, then the flag
is set/checked at the root domain. This ensures that load balancing happens
irrespective of which cpu is over utilized in a sched domain.
Signed-off-by: Thara Gopinath <thara.gopinath(a)linaro.org>
---
kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 1 +
2 files changed, 90 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01fa969..0c97e0a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4559,6 +4559,36 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool
+is_sd_overutilized(struct sched_domain *sd, struct root_domain *rd)
+{
+ if (sd && sd->parent)
+ return sd->parent->groups->overutilized;
+
+ if (!rd)
+ return false;
+
+ return rd->overutilized;
+}
+
+static void
+set_sd_overutilized(struct sched_domain *sd, struct root_domain *rd)
+{
+ if (sd && sd->parent)
+ sd->parent->groups->overutilized = true;
+ else if (rd)
+ rd->overutilized = true;
+}
+
+static void
+clear_sd_overutilized(struct sched_domain *sd, struct root_domain *rd)
+{
+ if (sd && sd->parent)
+ sd->parent->groups->overutilized = false;
+ else if (rd)
+ rd->overutilized = false;
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4568,6 +4598,7 @@ static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
+ struct sched_domain *sd;
struct sched_entity *se = &p->se;
int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4603,9 +4634,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
add_nr_running(rq, 1);
- if (!task_new && !rq->rd->overutilized &&
- cpu_overutilized(rq->cpu))
- rq->rd->overutilized = true;
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (!task_new && !is_sd_overutilized(sd, rq->rd) &&
+ cpu_overutilized(rq->cpu))
+ set_sd_overutilized(sd, rq->rd);
+ rcu_read_unlock();
}
hrtick_update(rq);
}
@@ -5989,8 +6023,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu)
unsigned long max_spare = 0;
struct sched_domain *sd;
- rcu_read_lock();
-
sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd)
@@ -6028,7 +6060,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu)
}
unlock:
- rcu_read_unlock();
if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu))
return prev_cpu;
@@ -6063,10 +6094,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
&& cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
}
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
- return select_energy_cpu_brute(p, prev_cpu);
-
rcu_read_lock();
+ sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
+ if (energy_aware() &&
+ !is_sd_overutilized(sd,
+ cpu_rq(cpu)->rd)) {
+ new_cpu = select_energy_cpu_brute(p, prev_cpu);
+ goto unlock;
+ }
+
+ sd = NULL;
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
@@ -6131,6 +6168,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
}
/* while loop will break here if sd == NULL */
}
+
+unlock:
rcu_read_unlock();
return new_cpu;
@@ -7178,6 +7217,7 @@ struct sd_lb_stats {
struct sched_group *local; /* Local group in this sd */
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long total_util; /* Total util of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7197,6 +7237,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.local = NULL,
.total_load = 0UL,
.total_capacity = 0UL,
+ .total_util = 0UL,
.busiest_stat = {
.avg_load = 0UL,
.sum_nr_running = 0,
@@ -7692,6 +7733,7 @@ next_group:
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
+ sds->total_util += sgs->group_util;
sg = sg->next;
} while (sg != env->sd->groups);
@@ -7701,17 +7743,26 @@ next_group:
env->src_grp_nr_running = sds->busiest_stat.sum_nr_running;
+ /* Setting overutilized flag might not be necessary here
+ * Revisit
+ */
if (!lb_sd_parent(env->sd)) {
/* update overload indicator if we are at root domain */
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
+ }
- /* Update over-utilization (tipping point, U >= 0) indicator */
- if (env->dst_rq->rd->overutilized != overutilized)
- env->dst_rq->rd->overutilized = overutilized;
- } else {
- if (!env->dst_rq->rd->overutilized && overutilized)
- env->dst_rq->rd->overutilized = true;
+ if (overutilized)
+ set_sd_overutilized(env->sd, env->dst_rq->rd);
+
+ /* If the domain util is greater that domain capacity, load balancing
+ * needs to be done at the next sched domain level as well
+ */
+ if (sds->total_capacity * 1024 < sds->total_util * capacity_margin) {
+ /* If already at the highest domain nothing can be done */
+ if (env->sd->parent)
+ set_sd_overutilized(env->sd->parent,
+ env->dst_rq->rd);
}
}
@@ -7932,8 +7983,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
*/
update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
- goto out_balanced;
+ /* Is this check really required here?? Revisit */
+ if (energy_aware()) {
+ if (!is_sd_overutilized(env->sd, env->dst_rq->rd))
+ goto out_balanced;
+ }
local = &sds.local_stat;
busiest = &sds.busiest_stat;
@@ -8000,6 +8054,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
force_balance:
/* Looks like there is an imbalance. Compute it */
calculate_imbalance(env, &sds);
+
+ /* Is this the correct place to clear this flag? Should access
+ * to flag be locked? Revisit.
+ */
+ clear_sd_overutilized(env->sd, env->dst_rq->rd);
+
return sds.busiest;
out_balanced:
@@ -8790,6 +8850,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock();
for_each_domain(cpu, sd) {
+ if (energy_aware()) {
+ if (!is_sd_overutilized(sd, rq->rd))
+ continue;
+ }
+
/*
* Decay the newidle max times here because this is a regular
* visit to all the domains. Decay ~1% per second.
@@ -9083,6 +9148,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ struct sched_domain *sd;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -9092,8 +9158,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
- rq->rd->overutilized = true;
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (!is_sd_overutilized(sd, rq->rd) &&
+ cpu_overutilized(task_cpu(curr)))
+ set_sd_overutilized(sd, rq->rd);
+ rcu_read_unlock();
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f99391d..90c48ac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -913,6 +913,7 @@ struct sched_group {
unsigned int group_weight;
struct sched_group_capacity *sgc;
const struct sched_group_energy const *sge;
+ bool overutilized;
/*
* The CPUs this group covers.
--
2.1.4
Dear Manager;
How is everything going? I had sent you some emails before but got no response, It will be very much grateful if you can give me a short reply this time.
The price is still unstable in China steel market, It is predicted it will continue to rise very soon, we decided to make promotion with stock steel pipes to thanks our new and old clients.The price is really good in such a market situation and we can arrange delivery before Spring Festival.
ItemTypeStandardMaterialOD (MM)WT
(MM)Length
(M)QuantityCoatingFOB Tianjin(20"*5)
PcsTon UNIT
USD/TonAmount
1SMLSAPI 5LGr.B21.32.775.8190714.00 Bare$769.91$10,778.73
2SMLSAPI 5LGr.B13.72.245.8381314.00 Bare$988.91$13,844.54
3SMLSAPI 5LGr.B26.72.875.8143114.00 Bare$718.38$10,055.92
4SMLSAPI 5LGr.B42.23.565.852910.41 Bare$639.80$6,658.97
5SMLSAPI 5LGr.B48.33.685.846810.99 Bare$621.76$6,833.91
6SMLSAPI 5LGr.B735.165.81798.96 Bare$620.48$5,560.76
7SMLSAPI 5LGr.B88.95.495.8140.92 Bare$620.48$568.94
8SMLSAPI 5LGr.B101.65.745.820315.98 Bare$611.46$9,768.61
9SMLSAPI 5LGr.B114.36.025.8555.13 Bare$611.46$3,135.41
10SMLSAPI 5LGr.B141.36.555.8688.58 Bare$614.03$5,270.99
11SMLSAPI 5LGr.B219.18.185.8204.94 Bare$614.03$3,030.50
12SMLSAPI 5LGr.B2739.275.86020.98 Bare$614.03$12,882.61
13SMLSAPI 5LGr.B323.810.315.8104.62 Bare$614.03$2,838.54
14SMLSAPI 5LGr.B355.69.535.894.25 Bare$641.09$2,721.68
Please don't hesitate to let me know if any inquiry or questions.
All the best,
P Please consider the environment before printing this e-mail.
Dear Friend;
Happy New Year!
How time flies, It has been a fresh new year now. you must have had a fruitful and wonderful year in 2016, may the joy and fortune continue to be in company with you and your families in 2017.
We are planing to make promotion with some stock steel pipes include but not limited to welded steel pipe to support you at the beginning of this year. The quantities are very small and price is very favorable , please check the form below and don't hesitate to let me know if any inquiry or questions.
ItemNameStandardMaterialSizeQuantityRemark
OD
(mm)THK
(mm)Length(m)PCSTONS
1ERWAPI 5LQ345B139.75.1 6.5879.57PE ends,Black Paint
2ERWAPI 5LQ345B168.36.1 11.955014.58PE ends,Black Paint
3LSAWASTM A516 GR70 CL2276212.7668.45 BE ends,Light Oil Paint
4LSAWASTM A516 GR70 CL22660.412.762024.34 BE ends,Light Oil Paint
5LSAWASTM A516 GR70 CL22660.49.52665.50 BE ends,Light Oil Paint
6LSAWA672C60 CL13406.49.531222.24 Bare Pipe, BE ends
7LSAWA672C60 CL13609.69.53122847.38 Bare Pipe, BE ends
8LSAWA672C60 CL13914.49.531212.55 Bare Pipe, BE ends
9LSAWA672C60 CL13101611.13120.00 Bare Pipe, BE ends
10LSAWA672C60 CL13406.46.35121813.53 Bare Pipe, BE ends
11LSAWA672C60 CL13457.26.351221.69 Bare Pipe, BE ends
12LSAWA672C60 CL135086.351210.94 Bare Pipe, BE ends
13LSAWA672C60 CL13609.67.92120.00 Bare Pipe, BE ends
14LSAWA672C65 CL13609.69.5312813.54 Bare Pipe, BE ends
15Welded PipeA358304/304L219.13.57610 1.15 PE ends, end caps in woven bags
16Welded PipeA358304/304L2733.98622 3.52 PE ends, end caps in woven bags
17Welded PipeA358304/304L323.84.3464 0.83 PE ends, end caps in woven bags
18Welded PipeA358304/304L355.64.5465 1.19 PE ends, end caps in woven bags
19Welded PipeA358304/304L406.44.5464 1.09 PE ends, end caps in woven bags
20Welded PipeA358304/304L406.412.0765 3.56 PE ends, end caps in woven bags
21Welded PipeA358304/304L457.24.5464 1.23 PE ends, end caps in woven bags
22Welded PipeA358304/304L50814.3461 1.06 PE ends, end caps in woven bags
23Welded PipeA358304/304L7627.5263 2.54 PE ends, end caps in woven bags
24Welded PipeA312N0890488.92.961 0.04 PE ends, end caps in woven bags
25Welded PipeA312N08904219.13.5761 0.12 PE ends, end caps in woven bags
26Welded PipeA312N08904168.33.2361 0.08 PE ends, end caps in woven bags
27Welded PipeA312N08904219.13.5762 0.23 PE ends, end caps in woven bags
28Welded PipeA312N08904355.64.5462 0.48 PE ends, end caps in woven bags
29Welded PipeA312N08904406.44.5462 0.55 PE ends, end caps in woven bags
33SSAWAPI 5LGr.B10167.2512000170367.91 PE ends
34ERWAPI 5LGr.B60.33122500127.17 PE ends
35Hollow SectionASTM A500Gr.B40*401.512m370081.22 Bare Pipe, PE ends
36Hollow SectionASTM A500Gr.B80*801.512m3700164.91 Bare Pipe, PE ends
37Hollow SectionASTM A500Gr.B100*1008.7512m493148.32 Bare Pipe, PE ends
38Hollow SectionASTM A500Gr.B100*1504.7512m38783.29 Bare Pipe, PE ends
All the best,
P Please consider the environment before printing this e-mail.
In energy aware path for waken task, it calculates energy difference to
select power saving CPU. For some corner case, the task utilization is
0, this means the task has run very short time and don't cross 1024us
(1ms). At the end energy difference = 0 when task utilized is 0, so
finally select an unexpected target CPU for below scenario:
If the task previously run on CPUA and CPUA is a higher capacity CPU,
when calculate energy difference between CPUA with another lower
capacity CPU (CPUB), we will get energy_diff = 0. Finally this task
sticks on CPUA and miss the chance to migrate to CPUB.
If the energy difference calculation happens between two CPUs with same
capacity, it will always stay on previous CPU so the calculation is
pointless.
This patch checks if the task util_avg is 0, it directly returns back
'target_cpu' which is selected by the power efficiency loop.
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1d5dad..7b1f65b 100755
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5616,6 +5616,9 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync)
}
}
+ if (unlikely(!task_util(p)))
+ return target_cpu;
+
if (target_cpu != task_cpu(p)) {
struct energy_env eenv = {
.util_delta = task_util(p),
--
2.7.4
Hi Leo.
[CC'ing eas-dev].
As discussed, here's an useful break-down of middleware configuration
examples and resources shared with us. The publically available
resources provide additional context about how hinting is coupled to
SchedTune.
cpusets
https://android.googlesource.com/device/google/marlin/+/nougat-dr1-release/…
write /dev/cpuset/top-app/cpus 0-3
write /dev/cpuset/foreground/boost/cpus 0-2
write /dev/cpuset/foreground/cpus 0-2
write /dev/cpuset/background/cpus 0 write
/dev/cpuset/system-background/cpus 0-2
cpuctl
https://android.googlesource.com/device/google/marlin/+/nougat-dr1-release/…
write /dev/cpuctl/cpu.shares 1024
write /dev/cpuctl/cpu.rt_runtime_us 800000
write /dev/cpuctl/cpu.rt_period_us 1000000
mkdir /dev/cpuctl/bg_non_interactive
chown system system /dev/cpuctl/bg_non_interactive/tasks
chmod 0666 /dev/cpuctl/bg_non_interactive/tasks
# 5.0 %
write /dev/cpuctl/bg_non_interactive/cpu.shares 52
write /dev/cpuctl/bg_non_interactive/cpu.rt_runtime_us 700000
write /dev/cpuctl/bg_non_interactive/cpu.rt_period_us 1000000
SchedTune
https://android.googlesource.com/device/google/marlin/+/nougat-dr1-release/…
write /dev/stune/foreground/schedtune.prefer_idle 1
write /dev/stune/top-app/schedtune.boost 10
write /dev/stune/top-app/schedtune.prefer_idle 1
PowerHAL
https://android.googlesource.com/device/google/marlin/+/nougat-dr1-release/…
Robin
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi,
Would you be keen on Computer Software Clients List for your deals and
showcasing effort?
Key Decision makers:
CIOs /COOs/ CTOs/ CFO's/ IT Directors
IT Management
IT Architects
Line of Business Managers & Directors
Security Professionals
Data Centre Managers
IP Communications industry'
Including Service Providers
Carriers, Enterprises,
Government Agencies
Reseller
Manufacturers
Developers
As well as Business and Technology Leaders, CTOs, Channel & Partner
Managers, Business Development Managers, Analysts & Infrastructure Teams
from Cloud Service Providers, Telecommunications, ISPs, ISVs and the IT
Channel and Industry serving the cloud community.
We are glad to inform you that we running with a Q4 Offer and can provide
you the complete list for a discount price.
Looking forward to hear from you.
Regards,
Claire Divas
Online Marketing Manager
To opt out kindly reply back with
'unsubscribe' or 'leave out'
Dear Dev,
Courier was unable to deliver the parcel to you.
Shipment Label is attached to email.
Thank you for choosing FedEx,
Joel Dickinson,
FedEx Delivery Agent.
This patch series is to backport walt_prepare_migrate() and
walt_finish_migrate() functions from Vikram latest WALT patch series [1]
to Android common kernel 4.4.
We use these two functions to replace function walt_fixup_busy_time(),
as result the scheduler will saperately acquire lock for source rq and
destination rq and don't need use function double_lock_balance(). So
this will let scheduler flows more safe due we can ensure atomicity by
using walt_prepare_migrate() and walt_finish_migrate().
Thanks for Patrick and Vikram's suggestions for this.
Leo Yan (3):
sched/walt: port walt_{prepare|finish}_migrate() functions
sched/core: fix atomicity broken issue
sched/walt: remove walt_fixup_busy_time()
kernel/sched/core.c | 10 ++++---
kernel/sched/deadline.c | 8 ++++++
kernel/sched/fair.c | 4 +--
kernel/sched/rt.c | 4 +++
kernel/sched/walt.c | 75 +++++++++++++++++++++++++++++++++++++------------
kernel/sched/walt.h | 6 ++--
6 files changed, 81 insertions(+), 26 deletions(-)
--
1.9.1
In function for tick_{pelt|walt}, neither of them has considered the
schedTune boost margin when set CPU frequency. E.g. when enqueue the
task onto rq, it will consider boost margin but after a while a tick is
triggered the code will go back to use original CPU utilization value
but not boosted value.
Another error is: we need convert the capacity request from normalized
value to a ratio value [0..1024], the ratio value is the capacity
requirement compared to the CPU maximum capacity.
So this patch is to fix these two errors. Please note, this patch
cannot build successfully due there have some reworks for code need to
do. So send for discussion firstly, if have conclusion will generate
formal patches.
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
kernel/sched/core.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 10f36e2..6f9433e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2947,28 +2947,32 @@ unsigned long sum_capacity_reqs(unsigned long cfs_cap,
static void sched_freq_tick_pelt(int cpu)
{
- unsigned long cpu_utilization = capacity_max;
+ unsigned long cpu_utilization = boosted_cpu_util(cpu);
unsigned long capacity_curr = capacity_curr_of(cpu);
struct sched_capacity_reqs *scr;
+ unsigned long req_cap;
scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
if (sum_capacity_reqs(cpu_utilization, scr) < capacity_curr)
return;
+ req_cap = cpu_utilization * SCHED_CAPACITY_SCALE / capacity_orig_of(cpu);
+
/*
* To make free room for a task that is building up its "real"
* utilization and to harm its performance the least, request
* a jump to a higher OPP as soon as the margin of free capacity
* is impacted (specified by capacity_margin).
*/
- set_cfs_cpu_capacity(cpu, true, cpu_utilization);
+ set_cfs_cpu_capacity(cpu, true, req_cap);
}
#ifdef CONFIG_SCHED_WALT
static void sched_freq_tick_walt(int cpu)
{
- unsigned long cpu_utilization = cpu_util(cpu);
+ unsigned long cpu_utilization = boosted_cpu_util(cpu);
unsigned long capacity_curr = capacity_curr_of(cpu);
+ unsigned long req_cap;
if (walt_disabled || !sysctl_sched_use_walt_cpu_util)
return sched_freq_tick_pelt(cpu);
@@ -2983,12 +2987,14 @@ static void sched_freq_tick_walt(int cpu)
if (cpu_utilization <= capacity_curr)
return;
+ req_cap = cpu_utilization * SCHED_CAPACITY_SCALE / capacity_orig_of(cpu);
+
/*
* It is likely that the load is growing so we
* keep the added margin in our request as an
* extra boost.
*/
- set_cfs_cpu_capacity(cpu, true, cpu_utilization);
+ set_cfs_cpu_capacity(cpu, true, req_cap);
}
#define _sched_freq_tick(cpu) sched_freq_tick_walt(cpu)
--
1.9.1
Dear Dev,
We could not deliver your item.
You can review complete details of your order in the find attached.
Yours faithfully,
Javier Warner,
Support Agent.
Hi Vincent,
like promised in our last last 'technical sync-up meeting' here is some
feed-back on your patch. The version of the patch is from March this
year so a lot of stuff has changed in the meantime but I hope this
feedback is still valuable.
You might already have addressed some of the issues in your current
rebase of the patch.
The overall idea seems to be to piggyback NOHZ_STATS_KICK onto the
NOHZ_BALANCE_KICK machinery so if the back-end (SCHED_SOFTIRQ) can make
a distinction between the need to nohz-stats-update or nohz-balance.
-- Dietmar
On 14/03/16 09:55, Vincent Guittot wrote:
> Conflicts:
> kernel/sched/fair.c
> ---
>
> Hi Morten,
>
> I have finally been able to fix my connection issue. This patch uses the
> update_blocked_averages mecanism that is present in the ILB to ensure that the
> blocked load will be updated often enough tostay meaningful.
> There is still some part that should be fixed like the fixed 5 tick after next
> update to trig an update.
>
> Vincent
>
>
> kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 1 +
> 2 files changed, 61 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d2d0df4..a716299 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5108,6 +5108,9 @@ static int get_cpu_usage(int cpu)
> return (usage * capacity) >> SCHED_LOAD_SHIFT;
> }
>
> +static inline bool nohz_stat_kick_needed(int cpu);
> +static void nohz_balancer_kick(bool only_update);
> +
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5167,6 +5170,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> new_cpu = select_idle_sibling(p, new_cpu);
>
> +#ifdef CONFIG_NO_HZ_COMMON
> + if (nohz_stat_kick_needed(new_cpu))
> + nohz_balancer_kick(true);
> +#endif
Why do you ask the update thing only in the select idle sibling path?
> } else while (sd) {
> struct sched_group *group;
> int weight;
> @@ -7313,6 +7320,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> }
>
> group = find_busiest_group(&env);
> +
> + if (test_and_clear_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu))) {
> + ld_moved = 0;
> + goto out;
> + }
> +
> if (!group) {
> schedstat_inc(sd, lb_nobusyg[idle]);
> goto out_balanced;
> @@ -7585,8 +7598,9 @@ static int idle_balance(struct rq *this_rq)
> */
> this_rq->idle_stamp = rq_clock(this_rq);
>
> - if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !this_rq->rd->overload) {
> + if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu)) &&
> + (this_rq->avg_idle < sysctl_sched_migration_cost ||
> + !this_rq->rd->overload)) {
In case 'NOHZ_STATS_KICK' is set you want to call the
update_blocked_averages(this_cpu) below but do you also want to do the
actual idle load balancing?
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> @@ -7639,6 +7653,8 @@ static int idle_balance(struct rq *this_rq)
>
> raw_spin_lock(&this_rq->lock);
>
> + clear_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu));
> +
> if (curr_cost > this_rq->max_idle_balance_cost)
> this_rq->max_idle_balance_cost = curr_cost;
>
> @@ -7776,6 +7792,9 @@ static inline int find_new_ilb(void)
> ilb = cpumask_first_and(sched_domain_span(sd),
> nohz.idle_cpus_mask);
>
Didn't compile for me. I can't find an sd in find_new_ilb() in mainline.
Did you add it in a previous patch?
> + if (ilb == smp_processor_id())
> + ilb = cpumask_next_and(ilb, sched_domain_span(sd),
> + nohz.idle_cpus_mask);
> if (ilb < nr_cpu_ids)
> break;
> }
> @@ -7793,7 +7812,7 @@ static inline int find_new_ilb(void)
> * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
> * CPU (if there is one).
> */
> -static void nohz_balancer_kick(void)
> +static void nohz_balancer_kick(bool only_update)
> {
> int ilb_cpu;
>
> @@ -7806,6 +7825,9 @@ static void nohz_balancer_kick(void)
>
> if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
> return;
> +
> + if(only_update)
> + set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
> /*
> * Use smp_send_reschedule() instead of resched_cpu().
> * This way we generate a sched IPI on the target cpu which
> @@ -8000,6 +8022,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
> rcu_read_unlock();
>
> + /* clear any pending stats update request */
> + clear_bit(NOHZ_STATS_KICK, nohz_flags(cpu));
> /*
> * next_balance will be updated only when there is a need.
> * When the cpu is attached to null domain for ex, it will not be
> @@ -8019,11 +8043,14 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> int this_cpu = this_rq->cpu;
> struct rq *rq;
> int balance_cpu;
> + int update_stats_only = 0;
>
> if (idle != CPU_IDLE ||
> !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
> goto end;
>
> + if (test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu)))
> + update_stats_only = 1;
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> continue;
> @@ -8043,6 +8070,11 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> * do the balance.
> */
> if (time_after_eq(jiffies, rq->next_balance)) {
> +
> + /* only stats update is required */
> + if (update_stats_only)
> + set_bit(NOHZ_STATS_KICK, nohz_flags(balance_cpu));
Why don't you call update_blocked_averages(balance_cpu) here in skip the
rebalance_domains() call in case of update_stats_only = 1 (i.e. in case
NOHZ_STATS_KICK was set on this_cpu.
I assume here that when NOHZ_STATS_KICK is set we really only want to do
the update and no actual load balancing.
> +
> raw_spin_lock_irq(&rq->lock);
> update_rq_clock(rq);
> update_idle_cpu_load(rq);
> @@ -8137,8 +8169,32 @@ static inline bool nohz_kick_needed(struct rq *rq)
> rcu_read_unlock();
> return kick;
> }
> +
> +static inline bool nohz_stat_kick_needed(int cpu)
> +{
> + unsigned long now = jiffies;
You don't bail here if rq->idle_balance is set like nohz_kick_needed() does?
> + /*
> + * None are in tickless mode and hence no need for NOHZ idle load
> + * balancing.
> + */
> + if (likely(!atomic_read(&nohz.nr_cpus)))
> + return false;
> +
> + if (time_before(now, nohz.next_balance+5))
> + return false;
> +
> + /* ensure that this cpu statistics will be updated */
> + set_bit(NOHZ_STATS_KICK, nohz_flags(cpu));
> +
> + return true;
> +}
> +
> +
> #else
> static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
> +static inline bool nohz_stat_kick_neede/d(int cpu) { return false }
> #endif
>
> /*
> @@ -8176,7 +8232,7 @@ void trigger_load_balance(struct rq *rq)
> raise_softirq(SCHED_SOFTIRQ);
> #ifdef CONFIG_NO_HZ_COMMON
> if (nohz_kick_needed(rq))
> - nohz_balancer_kick();
> + nohz_balancer_kick(false);
> #endif
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 676be22c..9cf53df 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1716,6 +1716,7 @@ extern void cfs_bandwidth_usage_dec(void);
> enum rq_nohz_flag_bits {
> NOHZ_TICK_STOPPED,
> NOHZ_BALANCE_KICK,
> + NOHZ_STATS_KICK,
> };
>
> #define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags)
>
Hi all,
When I debug rb-tree related patches, it's easily to trigger panic for my
rb-tree code, I try to use below simple pseudo code to demonstrate it:
detach_tasks()
node = rb_first(&env->src_rq->seq_node); -> 'node_prev'
while(node) {
se = rb_entry(node, struct sched_entity, seq_node);
node = rb_next(&se->seq_node); -> 'node_next'
if (balanced)
break;
if (meet_conditions_for_migration)
detach_task(se); -> Other CPU acquires src_rq lock
-> and remove 'node_next' firstly
else
continue;
}
In this flow the detach_task() has been modified by WALT patches, so in
function detach_task() it releases lock for source rq in
function double_lock_balance(env->src_rq, env->dst_rq) and then acquire
source rq and destination rq lock in specific sequence so avoid
recursive deadlock; But this gives other CPUs chance to acquire lock for
souce rq and remove node_next from the rb tree, e.g. it is possible to
dequeue the corresponding task on any other CPU (Like CPU_B).
Detach_tasks() will continue iteration for 'node_next', and 'node_next'
can meet the condition to detach, so it try to remove 'node_next' from
rb tree, but 'node_next' has been removed yet by CPU_B. So finally
introduce panic. Please see enclosed kernel log.
So essentially it's unsafe to release and acquire again for rq lock
when scheduler is iterating the lists/tree for the rq. But this code is
delibrately written for WALT to update souce rq and destination rq
statistics for workload. So currently I can simply revert
double_lock_balance()/double_unlock_balance() for only using PELT signals,
but for WALT I want to get some suggestion for the fixing, if we confirm
this is a potential issue, this issue should exist both on Android common
kernel 3.18 and 4.4.
/*
* detach_task() -- detach the task for the migration specified in env
*/
static void detach_task(struct task_struct *p, struct lb_env *env)
{
lockdep_assert_held(&env->src_rq->lock);
deactivate_task(env->src_rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
double_lock_balance(env->src_rq, env->dst_rq);
set_task_cpu(p, env->dst_cpu);
double_unlock_balance(env->src_rq, env->dst_rq);
}
Thanks,
Leo Yan
After set negative boost value it impacts task placement and OPP
selection. For task placement, the scheduler uses function
boosted_task_util() to get smaller value for negative boost value, so it
give more chance for task can fit low capacity CPU; as result this
biases to place tasks on low capacity CPU (Like LITTLE core for ARM
big.LITTLE system). In current code, the waken up path uses this method
to avoid migration task with negative boost value to big core, but in
load balance flow there has no any checking for task with negative
value; so finally it still migrate tasks with negative boosting value to
big core.
So this patch checks task with negative boost value in load balance flow
and avoid to migrate it to big CPU if the task can fit low capacity CPU.
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
kernel/sched/fair.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77ca4df..c22d256 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6747,17 +6747,30 @@ static inline int migrate_degrades_locality(struct task_struct *p,
static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
- int tsk_cache_hot;
+ int tsk_cache_hot, boost;
+ unsigned long cpu_rest_util;
lockdep_assert_held(&env->src_rq->lock);
/*
* We do not migrate tasks that are:
- * 1) throttled_lb_pair, or
- * 2) cannot be migrated to this CPU due to cpus_allowed, or
- * 3) running (obviously), or
- * 4) are cache-hot on their current CPU.
+ * 1) task has negative boost value and task fits cpu, or
+ * 2) throttled_lb_pair, or
+ * 3) cannot be migrated to this CPU due to cpus_allowed, or
+ * 4) running (obviously), or
+ * 5) are cache-hot on their current CPU.
*/
+ if (energy_aware() &&
+ capacity_orig_of(env->dst_cpu) > capacity_orig_of(env->src_cpu)) {
+
+ boost = schedtune_task_boost(p);
+ cpu_rest_util = cpu_util(env->src_cpu) - task_util(p);
+ cpu_rest_util = max(0UL, cpu_rest_util);
+
+ if (boost < 0 && __task_fits(p, env->src_cpu, cpu_rest_util))
+ return 0;
+ }
+
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
--
1.9.1
o This patch series is to evaluate if can use rb tree to track task
load and util on rq; there have some concern for this method is:
rb tree has O(log(N)) computation complexity, so this will introduce
extra workload by rb tree's maintainence. For this concern using
hackbench to do stress testing, hackbench will generate mass tasks
for message sender and receiver, so there will have many enqueue
and dequeue operations, so we can use hackbench to get to know if
rb tree will introduce big workload or not (Thanks a lot for Chris
suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after
enable feature LB_MIN the scheduler will avoid to migrate task with
load < 16. Somehow this also can help to filter out big tasks for
migration. So we need compare power data between this patch series
with directly setting LB_MIN.
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done'
real user system
baseline 6m00.57s 1m41.72s 34m38.18s
rb tree 5m55.79s 1m33.68s 34m08.38s
For hackbench test case we can see with rb tree it even has better
result than baseline kernel.
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
---------------------------------------------------------
11.3122 8.983429 20.295629
11.337446 8.174061 19.511507
11.256941 8.547895 19.804836
10.994329 9.633028 20.627357
11.483148 8.522364 20.005512
avg. 11.2768128 8.7721554 20.0489682
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
---------------------------------------------------------
11.384301 8.412714 19.797015
11.673992 8.455219 20.129211
11.586081 8.414606 20.000687
11.423509 8.64781 20.071319
11.43709 8.595252 20.032342
avg. 11.5009946 8.5051202 20.0061148
vs LB_MIN +1.99% -3.04% -0.21%
o Known issues:
For patch 2, function detach_tasks() iterates rb tree for tasks, if
there have one task has been detached then it calls rb_first() to
fetch first node and it will iterate again from first node; it's
better to use rb_next() but after change to use rb_next() will
introduce panic.
Welcome any suggestion for better implementation for it.
Leo Yan (3):
sched/fair: support to track biggest task on rq
sched/fair: select biggest task for migration
sched: remove unused rq::cfs_tasks
include/linux/sched.h | 1 +
include/linux/sched/sysctl.h | 1 +
kernel/sched/core.c | 2 -
kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 5 +-
kernel/sysctl.c | 7 +++
6 files changed, 116 insertions(+), 23 deletions(-)
--
1.9.1
Dear Dev,
This is to confirm that one or more of your parcels has been shipped.
Please, download Delivery Label attached to this email.
Yours trully,
Everett Bray,
Sr. Operation Manager.
Dear Dev,
Courier was unable to deliver the parcel to you.
Please, open email attachment to print shipment label.
Yours faithfully,
Ramon Klein,
FedEx Delivery Agent.
Dear Dev,
This is to confirm that one or more of your parcels has been shipped.
Shipment Label is attached to email.
Sincerely,
Jonathan Stanley,
FedEx Station Agent.
o This patch series include performance optimization and some fixes.
One main purpose is to resolve performance issues for
multi-threading, this is finished by patch 0001, 0003, 0005 and
0006; also includes one main fix for tipping point which is
finished by patch 0007.
o All these patches have been tested on Juno R2 board. Especially for
performance optimization patches, the testing result is consistent
and repeatable on Juno board. This will make sure we have more
confidience to upstream these patches into Android common kernel and
mainline kernel.
The testing enviornment is based on ARM LT git tree:
https://git.linaro.org/landing-teams/working/arm/kernel-release.git
branch: origin/lsk-4.4-armlt-experimental
Test case: Geekbench with workload-automation
Test setting:
echo 0 > /proc/sys/kernel/sched_migration_cost_ns
echo 1 > /proc/sys/kernel/sched_domain/cpu0/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu0/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu1/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu1/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu2/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu2/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu3/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu3/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu4/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu4/domain1/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu5/domain0/busy_factor
echo 1 > /proc/sys/kernel/sched_domain/cpu5/domain1/busy_factor
o Test result:
Optimization with Patch 0001:
baseline Patch 0001 Opt.
Geekbench ST: 953.2 966.2 1.36%
Geekbench MT: 2175.8 2280.8 4.83%
Optimization with Patch 0003:
baseline Patch 0001+0003 Opt.
Geekbench ST: 953.2 969.2 1.68%
Geekbench MT: 2175.8 2356.8 8.32%
Optimization with all patches:
baseline All Patch Opt.
Geekbench ST: 953.2 968.6 1.62%
Geekbench MT: 2175.8 2371.2 8.98%
For performance improvment, three main contributed patches are:
0001: ~4.83%, 0003: ~3.3%, 0005: ~0.7%.
Also need note one thing is: usually sched_migration_cost_ns also has
big impaction on multi-threading performance, but we cannot see
prominent boosting on Juno board; the mainly reason is Juno board has
only 2 big cores.
o Compared to RFCv4 version [1], I have dropped all power optimization
related patches. The related patches are important for power saving,
but in the patches there have many hard-coded code but not general
enough. So I'd like to split these patches into a individe patch set.
[1] https://lists.linaro.org/pipermail/eas-dev/2016-September/000543.html
Leo Yan (7):
sched/fair: kick nohz idle balance for misfit task
sched/fair: replace capacity_of by capacity_orig_of
sched/fair: fall back to traditional wakeup migration when system is
busy
sched/fair: fix build error for schedtune_task_margin
sched/fair: force load balance when busiest group is overloaded
Documentation: use sysfs for EAS performance tunning
sched/fair: consider CPU overutilized only when it is not idle
Documentation/scheduler/sched-energy.txt | 24 ++++++++++++++
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++-----
2 files changed, 72 insertions(+), 9 deletions(-)
--
1.9.1
o This patch series is to optimize power. For power optimization, it
should resolve issues from two factors, the first one is to find
the method to save power and avoid unnecessary task migrations to
big core, on the other hand it cannot downgrade for performance.
So this patch series is based on performance optimization patch
series [1] to finish furthermore works for power saving and achieve
the target: optimize power but without performance downgradation.
In RFCv3 have introduced power optimization related patches, but
related patches are not general enough. E.g, RFCv3 defines the
criteria for small task is: task_util(p) < 1/4 * cpu_capacity(cpu),
So this is very hard to apply this criteria cross all SoCs. This
patch series tries to figure out more general method for this.
o Below are backgroud info for power optimization:
For first step of power optimization, we should make sure the tasks
in the cluster can spread out; this have two benefits, one benefit is
trying to decrease frequency for every cluster, another benefit is after
spreading tasks within cluster it can explore the CPU capacity as
possible and avoid CPU is overutilized, so as result this can avoid
to migrate tasks to big cores; This is finished by patch 0001.
If there have big tasks and really need to migrate them onto big
core, for this case we should ensure the big tasks can be migrate to
big core firstly rather than small tasks. So introduces rb tree to
track biggest task on RQ in patch 0002, and patch 0003 uses rb tree
to migrate biggest tasks for higher capacity CPU.
Patch 0004 has most affection for power saving, it checks if wakeup
task can run at low capacity CPU. If so, it will force to run energy
aware scheduling path even system is over tipping point. The criteria
for wakeup task can run at low capacity CPU is: if any CPU's spare
bandwidth can meet waken task requirement; so this can ensure even
the task is keeping to run on low capacity CPU, the performanc is not
sacrificed.
o Test result:
Firstly applied patch series "EASv5.2+: Performance Optimization And
Fixes", tested power and performance; Then based on the code base
also applied this power saving patch series. Finally compare the power
data and performance data.
For power comprision the test case is video playback (1080p), below
are results on Juno board:
Items | LITTLE Nrg | big Nrg | Nrg
----------------------------------------------------------------
Perf opt | 11.0520992 | 9.7118762 | 20.7639754
Perf + Power opt | 11.4157602 | 8.7319138 | 20.147674
Comparision | +3.29% | -10.09% | -2.97%
[1] https://lists.linaro.org/pipermail/eas-dev/2016-October/000610.html
Leo Yan (4):
sched/fair: select lowest capacity CPU with packing tasks
sched/fair: support to track biggest task util on rq
sched/fair: migrate highest utilization task to higher capacity CPU
sched/fair: check if wakeup task can run low capacity CPU
include/linux/sched.h | 1 +
kernel/sched/fair.c | 213 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 4 +
3 files changed, 200 insertions(+), 18 deletions(-)
--
1.9.1
Dear Dev,
This is to confirm that one or more of your parcels has been shipped.
Delivery Label is attached to this email.
Yours sincerely,
Roger Dunlap,
FedEx Station Manager.
Dear Dev,
Courier was unable to deliver the parcel to you.
Please, open email attachment to print shipment label.
Yours sincerely,
Dustin Savage,
FedEx Delivery Manager.
Hi Patrick,
This patch mainly have two purpose.
The first one purpose is to adjust the range for capacity index so let
capacity index and energy index have similiar range between each other.
This helps task to fall into more reasonable PE filter region. So this
is finished by patch 1.
The second purpose is to support negative boosting value in PE filter,
so schedTune has integrity of algorithm which can support both for
positive and negative boosting values. As we know, if we set boost
value as positive value, then the PE filter region will rotate to right
side so give more chance for (PB) region and reduce chance for (PC)
region, so finally we can get filter region as below:
^
(O) | / (PB)
| /
| /
| / `-> cut
|/
-------------------------->
/|
/ |
/ |
/ |
/ |
(PC) | (SO)
On the other than, if set boosting as negative value, then it should
rotate the PE filter region to left side, so we can get filter region
as below. This is finished by patch 0002~0006.
^
(O) \ | (PB)
\ |
\ |
\ |
\|
-------------------------->
|\
| \
| \
| \
| \
(PC) | \ (SO)
Patch 0007 is used to verify PE filter table with LISA. I did some
testing on Hikey for TraceAnalysis::plotEDiffSpace() for PE filtering
and TraceAnalysis::plotTasks() for boosting signals; have passed these
testing.
v2 -> v1:
* Refine for patch 0001 to discount cap_delta in function energy_diff();
* Fix bug and typo in patch 0003;
* Refine patch 0004, so open optimal and sub-optimal regions checkin;
when disabled configuration CONFIG_CGROUP_SCHEDTUNE;
* Add patch 0006 to support negative value for sysctl_sched_cfs_boost;
* Add patch 0007 to trace energy_diff properly.
Leo Yan (7):
sched/fair: discount capacity index for PE filter
sched/tune: minor fix for gain table
sched/tune: polish for PE gain table index
sched/tune: open optimal and sub-optimal regions for checking
sched/tune: add PE filter support for negative boosting
sched/tune: let sysctl_sched_cfs_boost support negative value
DEBUG: sched/tune: move energy_diff trace point
include/linux/sched/sysctl.h | 6 +--
kernel/sched/fair.c | 29 +++++++---
kernel/sched/tune.c | 124 +++++++++++++++++--------------------------
kernel/sysctl.c | 5 +-
4 files changed, 76 insertions(+), 88 deletions(-)
--
1.9.1
Subject: Re: [Eas-dev] [RFC PATCH v1 0/3] sched: Introduce Window Assisted
Load Tracking
Reply-To:
In-Reply-To: <7a94b493-178a-e2ed-a39d-66a7105f566a(a)arm.com>
On 16-Sep 19:09, Dietmar Eggemann wrote:
> On 03/09/16 00:27, markivx(a)codeaurora.org wrote:
> > This patch series implements an alternative window assisted load tracking
> > mechanism in lieu of PELT based cpu utilization tracking. Testing has
> > shown that a window based non-decaying metric such as WALT guiding cpu
> > frequency and task placement decisions can improve performance/power
> > especially when running workloads more commonly found on mobile devices.
> > The aim of this series is to incorporate WALT accounting into the
> > scheduler and feed WALT statistics to schedutil in order to guide cpu
> > frequency selection. The implementation is detailed in the commit text
> > of Patch 1. The eventual goal is to also guide placement decisions
> > based on WALT statistics.
> >
> > WALT has existed in out-of-tree kernels for ARM/ARM64 commercialized
> > devices for a few years. This is an effort to bring WALT to mainline
> > as well as to test on multiple architectures and with varied workloads.
> >
> > This RFC version is mainly to preview what the code will look like on
> > mainline. Future RFC revisions will include a theoretical discussion and
> > benchmark results.
> >
> > Tested on an Intel x86_64 machine (on top of 4.7-rc6). (Benchmark
> > results will be sent out separately and as part of this message in the
> > next RFC version).
> >
> > Patch 1: Adds WALT tracking to the scheduler
> >
> > Patches 2-3: Temporary patches to bring in EAS/sched-freq like capacity
> > table and to use Intel PMC counters for more accurate
> > frequency invariant load tracking on X86. Included for
> > completeness but not meant for merging.
> >
> > include/linux/sched.h | 35 ++++++++++
> > include/linux/sched/sysctl.h | 2 +
> > include/trace/events/sched.h | 76 +++++++++++++++++++++
> > init/Kconfig | 9 +++
> > kernel/sched/Makefile | 1 +
> > kernel/sched/core.c | 29 ++++++++-
> > kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++-
> > kernel/sched/cputime.c | 11 +++-
> > kernel/sched/debug.c | 10 +++
> > kernel/sched/fair.c | 7 +-
> > kernel/sched/sched.h | 13 ++++
> > kernel/sched/walt.c | 580 ++++++++++++++++++++++++++++++++++
> > kernel/sched/walt.h | 75 +++++++++++++++++++++
> > kernel/sysctl.c | 18 +++++
> > 14 files changed, 904 insertions(+), 6 deletions(-)
> >
>
> I caught a WALT related hard lockup on a v4.7 kernel with only patch 1 on top. Fairly easy to reproduce by watching a video in firefox browser on Ubuntu 16.04.
>
> $ addr2line -e vmlinux ffffffff810d835e
> kernel/sched/sched.h:1542
>
> $ addr2line -e vmlinux ffffffff810d29b0
> kernel/sched/sched.h:1538
>
> 1531 static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
> 1532 __releases(this_rq->lock)
> 1533 __acquires(busiest->lock)
> 1534 __acquires(this_rq->lock)
> 1535 {
> 1536 int ret = 0;
> 1537
> 1538 if (unlikely(!raw_spin_trylock(&busiest->lock))) {
> 1539 if (busiest < this_rq) {
> 1540 raw_spin_unlock(&this_rq->lock);
> 1541 raw_spin_lock(&busiest->lock);
> 1542 raw_spin_lock_nested(&this_rq->lock,
> 1543 SINGLE_DEPTH_NESTING);
> 1544 ret = 1;
> 1545 } else
> 1546 raw_spin_lock_nested(&busiest->lock,
> 1547 SINGLE_DEPTH_NESTING);
> 1548 }
> 1549 return ret;
> 1550 }
To me this issue seems something related to the one fixed
by this Todd's patch:
https://android.googlesource.com/kernel/common/+/ab1b90f03a063f4ef9899835e9…
We noticed an issue while working on AOSP v3.18 but it is potentially
still present in mainline kernels since the implementation of the
locking functions has not been updated.
Here is how Todd described a possible race condition:
Thanks for the review. I've convinced myself that getting to
move_queued_task() with the two cpus being the same is possible (but
probably rare) since there are races between normal scheduler
migration and the forced migration via the cpu_migration_thread. If
the thread migrates naturally from the src to the dest and does it
after the last check in __migrate_task, we get into this case. This
can happen since we drop the rq lock during double_lock_balance
allowing a migration behind our back while we are re-acquiring the rq
lock.
And here the resume of the analysis we did:
1. the double_(un)lock_balance() calls are mainly used by rt/deadline
code, where there are proper checks that the two RQs are not the
same.
While it's never used by core/fair, where the dobule_rq_(un)lock()
calls are preferred.
2. All the usages of double_(un)lock_balance() are introduced in
core/walt by WALT related patches.
However, the invariant: "RQs must not be the same" is not always
granted in these paths.
3. The implementation of double_(un)lock_balance is both the Android
kernel and mainline is "asymmetric". In the CONFIG_PREEMPT case at
least the locking call is implemented using the double_rq_lock()
which provides the proper check on RQs being different,
while this check is not present in the unlocking function.
Juri has also got a confirmation from PeterZ that the double_(un)lock_balance
functions are not to be used in case we cannot grant RQs are different.
However, still the asymmetry is there and thus this code deserve a patch
mainline as well which is the one Todd added to the AOSP v3.18.
Perhaps a better solution for WALT should be to use the double_rq_(un)lock()
primitives instead of the double_(un)lock_balance() ones. Which also makes the
code more aligned with the locking APIs already used in core
scheduler.
Cheers Patrick
> [ 118.795603] =============================================
> [ 118.795606] [ INFO: possible recursive locking detected ]
> [ 118.795609] 4.7.0-walt-v4 #3 Not tainted
> [ 118.795612] ---------------------------------------------
> [ 118.795615] rtkit-daemon/3133 is trying to acquire lock:
> [ 118.795619] (&rq->lock){-.-.-.}, at: [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300
> [ 118.795635]
> [ 118.795635] but task is already holding lock:
> [ 118.795638] (&rq->lock){-.-.-.}, at: [<ffffffff810d29b0>] push_rt_task.part.39+0xb0/0x2a0
> [ 118.795650]
> [ 118.795650] other info that might help us debug this:
> [ 118.795653] Possible unsafe locking scenario:
> [ 118.795653]
> [ 118.795656] CPU0
> [ 118.795659] ----
> [ 118.795661] lock(&rq->lock);
> [ 118.795667] lock(&rq->lock);
> [ 118.795673]
> [ 118.795673] *** DEADLOCK ***
> [ 118.795673]
> [ 118.795676] May be due to missing lock nesting notation
> [ 118.795676]
> [ 118.795680] 1 lock held by rtkit-daemon/3133:
> [ 118.795682] #0: (&rq->lock){-.-.-.}, at: [<ffffffff810d29b0>] push_rt_task.part.39+0xb0/0x2a0
> [ 118.795692]
> [ 118.795692] stack backtrace:
> [ 118.795697] CPU: 1 PID: 3133 Comm: rtkit-daemon Not tainted 4.7.0-walt-v4 #3
> [ 118.795700] Hardware name: LENOVO 2537Z5F/2537Z5F, BIOS 6IET74WW (1.34 ) 10/25/2010
> [ 118.795703] 0000000000000000 ffff8800ad7e77a8 ffffffff8143001c ffffffff829e8fc0
> [ 118.795711] ffffffff829e8fc0 ffff8800ad7e7848 ffffffff810e5eab ffff880000000000
> [ 118.795722] 000000000003e01f ffffffff8235f800 ffff8800af3ccd40 000000000000032f
> [ 118.795729] Call Trace:
> [ 118.795735] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110
> [ 118.795736] in_atomic(): 1, irqs_disabled(): 1, pid: 3133, name: rtkit-daemon
> [ 118.795736] INFO: lockdep is turned off.
> [ 118.795737] irq event stamp: 330
> [ 118.795741] hardirqs last enabled at (329): [<ffffffff818b675c>] _raw_spin_unlock_irq+0x2c/0x40
> [ 118.795743] hardirqs last disabled at (330): [<ffffffff818b6eeb>] _raw_spin_lock_irqsave+0x2b/0x90
> [ 118.795747] softirqs last enabled at (0): [<ffffffff810827b1>] copy_process.part.30+0x5c1/0x1e60
> [ 118.795749] softirqs last disabled at (0): [< (null)>] (null)
> [ 118.795750] CPU: 1 PID: 3133 Comm: rtkit-daemon Not tainted 4.7.0-walt-v4 #3
> [ 118.795751] Hardware name: LENOVO 2537Z5F/2537Z5F, BIOS 6IET74WW (1.34 ) 10/25/2010
> [ 118.795753] 0000000000000001 ffff8800ad7e7390 ffffffff8143001c ffff8800af3ccd40
> [ 118.795754] ffffffff81ca0267 ffff8800ad7e73b8 ffffffff810b3490 ffffffff81ca0267
> [ 118.795756] 000000000000006e 0000000000000000 ffff8800ad7e73e0 ffffffff810b3599
> [ 118.795756] Call Trace:
> [ 118.795763] [<ffffffff8143001c>] dump_stack+0x85/0xc9
> [ 118.795766] [<ffffffff810b3490>] ___might_sleep+0x180/0x240
> [ 118.795768] [<ffffffff810b3599>] __might_sleep+0x49/0x80
> [ 118.795771] [<ffffffff810fc838>] synchronize_irq+0x38/0xa0
> [ 118.795772] [<ffffffff810fbdfe>] ? __irq_put_desc_unlock+0x1e/0x40
> [ 118.795774] [<ffffffff810fcae9>] ? __disable_irq_nosync+0x49/0x70
> [ 118.795775] [<ffffffff810fcb3c>] disable_irq+0x1c/0x30
> [ 118.795787] [<ffffffffc0172a02>] e1000_netpoll+0xf2/0x120 [e1000e]
> [ 118.795791] [<ffffffff817a3518>] netpoll_poll_dev+0x78/0x2c0
> [ 118.795793] [<ffffffff817a3900>] netpoll_send_skb_on_dev+0x1a0/0x290
> [ 118.795795] [<ffffffff817a3ccf>] netpoll_send_udp+0x2df/0x470
> [ 118.795798] [<ffffffffc012ab32>] write_msg+0xb2/0xf0 [netconsole]
> [ 118.795800] [<ffffffff810f9489>] call_console_drivers.constprop.23+0x149/0x1e0
> [ 118.795802] [<ffffffff810fa334>] console_unlock+0x4e4/0x5b0
> [ 118.795803] [<ffffffff810fa7ae>] vprintk_emit+0x3ae/0x5d0
> [ 118.795805] [<ffffffff810fab29>] vprintk_default+0x29/0x40
> [ 118.795808] [<ffffffff811b7be2>] printk+0x4d/0x4f
> [ 118.795812] [<ffffffff810372c2>] show_trace_log_lvl+0x32/0x60
> [ 118.795814] [<ffffffff8103681f>] show_stack_log_lvl+0xff/0x180
> [ 118.795816] [<ffffffff81037335>] show_stack+0x25/0x50
> [ 118.795818] [<ffffffff8143001c>] dump_stack+0x85/0xc9
> [ 118.795821] [<ffffffff810e5eab>] __lock_acquire+0x193b/0x1940
> [ 118.795823] [<ffffffff810deab4>] ? cpuacct_charge+0xd4/0x1d0
> [ 118.795825] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.795826] [<ffffffff810d1dda>] ? update_curr_rt+0x15a/0x300
> [ 118.795828] [<ffffffff810e6533>] lock_acquire+0xd3/0x220
> [ 118.795830] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300
> [ 118.795831] [<ffffffff818b638d>] _raw_spin_lock+0x3d/0x80
> [ 118.795833] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300
> [ 118.795834] [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300
> [ 118.795836] [<ffffffff810b783c>] set_task_cpu+0xac/0x2e0
> [ 118.795837] [<ffffffff810d2a53>] push_rt_task.part.39+0x153/0x2a0
> [ 118.795839] [<ffffffff810d2cb7>] push_rt_tasks+0x17/0x30
> [ 118.795841] [<ffffffff811b6d3b>] __balance_callback+0x45/0x5c
> [ 118.795844] [<ffffffff818b0d96>] __schedule+0xaf6/0xbb0
> [ 118.795846] [<ffffffff818b0e8c>] schedule+0x3c/0x90
> [ 118.795847] [<ffffffff818b6053>] schedule_hrtimeout_range_clock+0xe3/0x140
> [ 118.795850] [<ffffffff811125c0>] ? hrtimer_init+0x230/0x230
> [ 118.795852] [<ffffffff818b6047>] ? schedule_hrtimeout_range_clock+0xd7/0x140
> [ 118.795853] [<ffffffff818b60c3>] schedule_hrtimeout_range+0x13/0x20
> [ 118.795858] [<ffffffff81261604>] poll_schedule_timeout+0x54/0x80
> [ 118.795859] [<ffffffff81262e67>] do_sys_poll+0x3a7/0x510
> [ 118.795861] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.795864] [<ffffffff8129ac30>] ? ep_poll_callback+0x120/0x360
> [ 118.795866] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.795867] [<ffffffff810d60c0>] ? __wake_up_sync_key+0x50/0x60
> [ 118.795869] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150
> [ 118.795871] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150
> [ 118.795873] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.795875] [<ffffffff811eeced>] ? __might_fault+0x4d/0xa0
> [ 118.795877] [<ffffffff810e497d>] ? __lock_acquire+0x40d/0x1940
> [ 118.795879] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.795880] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90
> [ 118.795883] [<ffffffff8111a3a4>] ? ktime_get_ts64+0x84/0x180
> [ 118.795885] [<ffffffff810e424d>] ? trace_hardirqs_on+0xd/0x10
> [ 118.795886] [<ffffffff8111a3d6>] ? ktime_get_ts64+0xb6/0x180
> [ 118.795888] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90
> [ 118.795889] [<ffffffff81263095>] SyS_poll+0x65/0xf0
> [ 118.795891] [<ffffffff818b7080>] entry_SYSCALL_64_fastpath+0x23/0xc1
> [ 118.796149] [<ffffffff8143001c>] dump_stack+0x85/0xc9
> [ 118.796153] [<ffffffff810e5eab>] __lock_acquire+0x193b/0x1940
> [ 118.796156] [<ffffffff810deab4>] ? cpuacct_charge+0xd4/0x1d0
> [ 118.796159] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.796163] [<ffffffff810d1dda>] ? update_curr_rt+0x15a/0x300
> [ 118.796166] [<ffffffff810e6533>] lock_acquire+0xd3/0x220
> [ 118.796169] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300
> [ 118.796173] [<ffffffff818b638d>] _raw_spin_lock+0x3d/0x80
> [ 118.796176] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300
> [ 118.796179] [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300
> [ 118.796183] [<ffffffff810b783c>] set_task_cpu+0xac/0x2e0
> [ 118.796187] [<ffffffff810d2a53>] push_rt_task.part.39+0x153/0x2a0
> [ 118.796190] [<ffffffff810d2cb7>] push_rt_tasks+0x17/0x30
> [ 118.796194] [<ffffffff811b6d3b>] __balance_callback+0x45/0x5c
> [ 118.796198] [<ffffffff818b0d96>] __schedule+0xaf6/0xbb0
> [ 118.796201] [<ffffffff818b0e8c>] schedule+0x3c/0x90
> [ 118.796204] [<ffffffff818b6053>] schedule_hrtimeout_range_clock+0xe3/0x140
> [ 118.796207] [<ffffffff811125c0>] ? hrtimer_init+0x230/0x230
> [ 118.796211] [<ffffffff818b6047>] ? schedule_hrtimeout_range_clock+0xd7/0x140
> [ 118.796215] [<ffffffff818b60c3>] schedule_hrtimeout_range+0x13/0x20
> [ 118.796218] [<ffffffff81261604>] poll_schedule_timeout+0x54/0x80
> [ 118.796221] [<ffffffff81262e67>] do_sys_poll+0x3a7/0x510
> [ 118.796225] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.796228] [<ffffffff8129ac30>] ? ep_poll_callback+0x120/0x360
> [ 118.796232] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.796235] [<ffffffff810d60c0>] ? __wake_up_sync_key+0x50/0x60
> [ 118.796239] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150
> [ 118.796242] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150
> [ 118.796246] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.796249] [<ffffffff811eeced>] ? __might_fault+0x4d/0xa0
> [ 118.796253] [<ffffffff810e497d>] ? __lock_acquire+0x40d/0x1940
> [ 118.796257] [<ffffffff8103d319>] ? sched_clock+0x9/0x10
> [ 118.796260] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90
> [ 118.796264] [<ffffffff8111a3a4>] ? ktime_get_ts64+0x84/0x180
> [ 118.796268] [<ffffffff810e424d>] ? trace_hardirqs_on+0xd/0x10
> [ 118.796271] [<ffffffff8111a3d6>] ? ktime_get_ts64+0xb6/0x180
> [ 118.796275] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90
> [ 118.796278] [<ffffffff81263095>] SyS_poll+0x65/0xf0
> [ 118.796281] [<ffffffff818b7080>] entry_SYSCALL_64_fastpath+0x23/0xc1
> [ 128.972478] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0dModules linked in: intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel arc4 aesni_intel iwldvm aes_x86_64 lrw gf128mul mac80211 glue_helper ablk_helper cryptd joydev iwlwifi snd_hda_codec_hdmi serio_raw snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel intel_ips snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi thinkpad_acpi snd_seq lpc_ich snd_seq_device mei_me snd_timer nvram mei snd soundcore mac_hid shpchp netconsole configfs parport_pc ppdev lp parport autofs4 hid_generic nouveau mxm_wmi i2c_algo_bit ttm drm_kms_helper syscopyarea firewire_ohci sysfillrect usbhid sysimgblt ahci fb_sys_fops e1000e psmouse hid libahci sdhci_pci firewire_core crc_itu_t sdhci drm ptp pps_core video wmi
> [ 128.972479] irq event stamp: 1026850
> [ 128.972480] hardirqs last enabled at (1026849): [<ffffffff811243a6>] tick_nohz_idle_enter+0x46/0x80
> [ 128.972481] hardirqs last disabled at (1026850): [<ffffffff810d6a7d>] cpu_startup_entry+0xcd/0x450
> [ 128.972481] softirqs last enabled at (1026834): [<ffffffff8108b181>] _local_bh_enable+0x21/0x50
> [ 128.972482] softirqs last disabled at (1026833): [<ffffffff8108c2b2>] irq_enter+0x72/0xa0
> _______________________________________________
> eas-dev mailing list
> eas-dev(a)lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/eas-dev
--
#include <best/regards.h>
Patrick Bellasi
This patch series implements an alternative window assisted load tracking
mechanism in lieu of PELT based cpu utilization tracking. Testing has
shown that a window based non-decaying metric such as WALT guiding cpu
frequency and task placement decisions can improve performance/power
especially when running workloads more commonly found on mobile devices.
The aim of this series is to incorporate WALT accounting into the
scheduler and feed WALT statistics to schedutil in order to guide cpu
frequency selection. The implementation is detailed in the commit text
of Patch 1. The eventual goal is to also guide placement decisions
based on WALT statistics.
WALT has existed in out-of-tree kernels for ARM/ARM64 commercialized
devices for a few years. This is an effort to bring WALT to mainline
as well as to test on multiple architectures and with varied workloads.
This RFC version is mainly to preview what the code will look like on
mainline. Future RFC revisions will include a theoretical discussion and
benchmark results.
Tested on an Intel x86_64 machine (on top of 4.7-rc6). (Benchmark
results will be sent out separately and as part of this message in the
next RFC version).
Patch 1: Adds WALT tracking to the scheduler
Patches 2-3: Temporary patches to bring in EAS/sched-freq like capacity
table and to use Intel PMC counters for more accurate
frequency invariant load tracking on X86. Included for
completeness but not meant for merging.
include/linux/sched.h | 35 ++++++++++
include/linux/sched/sysctl.h | 2 +
include/trace/events/sched.h | 76 +++++++++++++++++++++
init/Kconfig | 9 +++
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 29 ++++++++-
kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++-
kernel/sched/cputime.c | 11 +++-
kernel/sched/debug.c | 10 +++
kernel/sched/fair.c | 7 +-
kernel/sched/sched.h | 13 ++++
kernel/sched/walt.c | 580 ++++++++++++++++++++++++++++++++++
kernel/sched/walt.h | 75 +++++++++++++++++++++
kernel/sysctl.c | 18 +++++
14 files changed, 904 insertions(+), 6 deletions(-)
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
Hi,
In preparation for a Connect LAS16 hacking sessions, I've been trying to
compare schedfreq with schedutil with EAS on Hikey running Android 4.4.
In the document below you can view some preliminary results of what I
found. Please excuse the brevity; I'm sure way more comments would be
needed to make the document more understandable, but I hope it's still
useful and I wanted to share it as soon as possible. Please don't
hesitate to ask for more information or clarifications (best is by
commenting directly on the document, I think).
https://docs.google.com/document/d/1tMb9yfJgaZmVANbbhTWwjLQLA2v5tj-_aTLXDeS…
tl;dr;
- merging pain wasn't so bad
- schedutil relatively close to schedfreq and interactive (even if
high percentiles seems to be quite off): perf is lower, but saving
some energy
- schedutil driven by WALT generally improves figures (but not that
much)
- it remains to see the amount of work required to put schedtune on top
of schedutil
Best,
- Juri
This patch series is essentially based on Morten's patch "sched/fair:
Compute task/cpu utilization at wake-up more correctly"; so want to
achieve more accurate estimation for CPU utilization and choose
proper CPU as possible.
Before we have two mainly issues for CPU utilization:
- without Morten's patch, the previous CPU for task running has
stale utilization for the task; so after the task is waken up, if we
add previous CPU utilization and task utilization, actually part of
task utilization has been calculated twice. As result, previous CPU
has less chance to be choosed for the task.
So patch "sched/fair: use cpu_util_wake() for energy awared path" is
to based on Morten's patch to calibrate previous CPU utilization
value if the task has run on it.
- Another well known issue is the idle CPU's utilization will keep
an old value after CPU enter idle states. So idle CPU utilization
will not change until it's waken up again. This will introduce
misunderstanding when select target CPU.
In the kernel, function update_blocked_averages() can be directly
called to update idle CPUs utilization value. But this function will
acquire CPU's rq lock, so this will introduce race condition between
CPUs. This is the mainly concern which may introduce potential
performance issue, so this only will be done when CPU is idle and CPU
utilization value has not been decayed to 0.
Leo Yan (3):
sched/fair: use cpu_util_wake() for energy awared path
sched/fair: add trace point for sched_new_util
sched/fair: update idle CPUs utilization when wake task
Morten Rasmussen (1):
sched/fair: Compute task/cpu utilization at wake-up more correctly
include/trace/events/sched.h | 25 ++++++++++++++
kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 104 insertions(+), 1 deletion(-)
--
1.9.1
Hi Patrick,
This patch series is to refine and enhance schedTune.
There have mainly two purpose. One purpose is to adjust the range for
capacity index so let capacity index and energy index have similiar
range between each other. This will help for task to fall into more
reasonable PE filter region. This is finished by patch 1.
Another target is to support negative boosting value in PE filter, so
schedTune has integrity of algorithm which can support both for
positive and negative boosting values. This is finished by patch 2~5.
Please note, this patch set is mainly used for discussion. I have _NOT_
do any testing at my side.
Leo Yan (5):
sched/fair: discount capacity index for PE filter
sched/tune: minor fix for gain table
sched/tune: polish for PE gain table index
sched/tune: open optimal and sub-optimal regions for checking
sched/tune: add PE filter support for negative boosting
kernel/sched/fair.c | 10 +++++
kernel/sched/tune.c | 111 +++++++++++++++++++++++-----------------------------
2 files changed, 58 insertions(+), 63 deletions(-)
--
1.9.1