Currently energy calculation in EAS has missed to consider RT pressure,
it's quite possible to select CPU for CFS tasks which has high RT
pressure and finally accumulate total utilization; as result the other
low RT pressure CPUs lose chance to run CFS tasks and reduce contention
between CFS and RT tasks, from performance view this is not optimal;
furthermore this also harms power data due pack RT task and CFS task on
single one CPU is more easily to trigger CPU frequency increasing.
We can measure the summed CPU utilization and calculate the CPU freqency
standard deviation to get to if the tasks can be well spreading within
the same cluster for middle workload case. So below is the comparison
result for video playback on Hikey960 for before and after applied this
patch set (Using schedutil CPUFreq governor):
Without Patch Set: With Patch Set:
CPU Min(Util) Mean(Util) Mean(Util) | Min(Util) Mean(Util) Mean(Util)
0 7 67 205 | 8 52 170
1 4 53 227 | 9 47 188
2 4 57 191 | 8 38 192
3 4 35 165 | 16 47 146
s.d. 1.5 13.3 25.9 | 3.9 5.83 20.9
4 0 35 160 | 10 34 129
5 0 24 129 | 0 30 115
6 0 18 123 | 0 18 95
7 0 12 84 | 0 21 73
s.d. 0 9.8 31.2 | 5 7.5 24.4
The standard diviation for CPU utilization mean value has been decreased
after applying this patch set (Little cluster: 13.3 vs 5.83, big cluster:
9.8 vs 7.5). This also confirm from the average CPU frequency:
Without Patch Set: With Patch Set:
Average Frequency | Average Frequency
LITTLT Cluster 737MHz | 646MHz
big Cluster 916MHz | 922MHz
Leo Yan (4):
sched/fair: Select maximum spare capacity for idle candidate CPUs
sched: Introduce cpu_util_sum()/__cpu_util_sum() functions
sched/fair: Consider RT pressure for find_best_target()
sched/fair: Consider RT/DL pressure for energy calculation
kernel/sched/fair.c | 22 +++++++++++++++++++---
kernel/sched/sched.h | 29 +++++++++++++++++++++++++++++
2 files changed, 48 insertions(+), 3 deletions(-)
--
1.9.1
Hi all,
First of all, this patch set is for energy comparison optimization.
The performance of energy comparison is important if we want to add more
candidate CPUs to pick best CPU.
Another meaningful point for this patch set is to evaluate for
energy calculation with task oriented. Current energy calculation
algorithm is calculate CPU energy, this patch set is to change the
concept so we get know what's the energy introduced by the waken task.
With this patch set, below are measured energy calculation duration,
the duration measurement relies on patch [1]; the statistics uses duration
mean value (unit: ns) and we can see the performance improvement with this
patchset:
wl: workload runtime percentage% with period = 5ms
Without Patches With Patches Opt %
wl: 1% 11858 8457 28.7%
wl: 5% 13028 9534 26.8%
wl: 10% 9361 7831 16.3%
wl: 20% 10736 7999 25.5%
wl: 30% 8216 7210 12.2%
wl: 40% 15222 9538 37.3%
You could check the detailed testing results with LISA scripts [2][3].
This is following up some discussion we have at SFO17 connect, so could you
reivew this patch set and let me know how if it's good to commit on gerrit
for Android common kernel?
[1] https://git.linaro.org/people/leo.yan/linux-eas-opt.git/commit/?h=android-h…
[2] https://github.com/Leo-Yan/lisa/blob/lisa_20180115_add_metrics/ipynb/exampl…
[2] https://github.com/Leo-Yan/lisa/blob/lisa_20180115_add_metrics/ipynb/exampl…
Leo Yan (3):
sched/fair: Optimize energy calculation with task oriented
sched/fair: Use per cpu data to maintain energy environment
sched/fair: Record energy and capacity data for every CPU
kernel/sched/fair.c | 364 +++++++++++++++++++++++++++++-----------------------
1 file changed, 204 insertions(+), 160 deletions(-)
--
1.9.1
CPU is active when have running tasks on it and CPUFreq governor can
select different operating points (OPP) according to different workload;
we use 'pstate' to present CPU state which have running tasks with one
specific OPP. On the other hand, CPU is idle which only idle task on
it, CPUIdle governor can select one specific idle state to power off
hardware logics; we use 'cstate' to present CPU idle state.
Based on trace events 'cpu_idle' and 'cpu_frequency' we can accomplish
the duration statistics for every state. Every time when CPU enters
into or exits from idle states, the trace event 'cpu_idle' is recorded;
trace event 'cpu_frequency' records the event for CPU OPP changing, so
it's easily to know how long time the CPU stays in the specified OPP,
and the CPU must be not in any idle state.
This patch is to utilize the mentioned trace events for pstate and
cstate statistics. To achieve more accurate profiling data, the program
uses below sequence to insure CPU running/idle time aren't missed:
- Before profiling the user space program wakes up all CPUs for once, so
can avoid to missing account time for CPU staying in idle state for
long time; the program forces to set 'scaling_max_freq' to lowest
frequency and then restore 'scaling_max_freq' to highest frequency,
this can ensure the frequency to be set to lowest frequency and later
after start to run workload the frequency can be easily to be changed
to higher frequency;
- User space program reads map data and update statistics for every 5s,
so this is same with other sample bpf programs for avoiding big
overload introduced by bpf program self;
- When send signal to terminate program, the signal handler wakes up
all CPUs, set lowest frequency and restore highest frequency to
'scaling_max_freq'; this is exactly same with the first step so
avoid to missing account CPU pstate and cstate time during last
stage. Finally it reports the latest statistics.
The program has been tested on Hikey board with octa CA53 CPUs, below
is the example for statistics result:
CPU 0
State : Duration(ms) Distribution
cstate 0 : 47555 |********************************* |
cstate 1 : 0 | |
cstate 2 : 0 | |
pstate 0 : 15239 |********* |
pstate 1 : 1521 | |
pstate 2 : 3188 |* |
pstate 3 : 1836 | |
pstate 4 : 94 | |
CPU 1
State : Duration(ms) Distribution
cstate 0 : 87 | |
cstate 1 : 16264 |********** |
cstate 2 : 50458 |*********************************** |
pstate 0 : 832 | |
pstate 1 : 131 | |
pstate 2 : 825 | |
pstate 3 : 787 | |
pstate 4 : 4 | |
CPU 2
State : Duration(ms) Distribution
cstate 0 : 177 | |
cstate 1 : 9363 |***** |
cstate 2 : 55835 |*************************************** |
pstate 0 : 1468 | |
pstate 1 : 350 | |
pstate 2 : 1062 | |
pstate 3 : 1164 | |
pstate 4 : 7 | |
CPU 3
State : Duration(ms) Distribution
cstate 0 : 89 | |
cstate 1 : 14546 |********* |
cstate 2 : 51591 |*********************************** |
pstate 0 : 907 | |
pstate 1 : 231 | |
pstate 2 : 894 | |
pstate 3 : 1154 | |
pstate 4 : 17 | |
CPU 4
State : Duration(ms) Distribution
cstate 0 : 101 | |
cstate 1 : 16904 |*********** |
cstate 2 : 49544 |********************************** |
pstate 0 : 678 | |
pstate 1 : 230 | |
pstate 2 : 770 | |
pstate 3 : 1065 | |
pstate 4 : 8 | |
CPU 5
State : Duration(ms) Distribution
cstate 0 : 95 | |
cstate 1 : 18377 |************ |
cstate 2 : 47609 |********************************* |
pstate 0 : 1165 | |
pstate 1 : 243 | |
pstate 2 : 818 | |
pstate 3 : 1007 | |
pstate 4 : 9 | |
CPU 6
State : Duration(ms) Distribution
cstate 0 : 102 | |
cstate 1 : 16629 |********** |
cstate 2 : 49335 |********************************** |
pstate 0 : 836 | |
pstate 1 : 253 | |
pstate 2 : 895 | |
pstate 3 : 1275 | |
pstate 4 : 6 | |
CPU 7
State : Duration(ms) Distribution
cstate 0 : 88 | |
cstate 1 : 16070 |********** |
cstate 2 : 50279 |*********************************** |
pstate 0 : 948 | |
pstate 1 : 214 | |
pstate 2 : 873 | |
pstate 3 : 952 | |
pstate 4 : 0 | |
Cc: Daniel Lezcano <daniel.lezcano(a)linaro.org>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
samples/bpf/Makefile | 4 +
samples/bpf/cpustat_kern.c | 281 +++++++++++++++++++++++++++++++++++++++++++++
samples/bpf/cpustat_user.c | 234 +++++++++++++++++++++++++++++++++++++
3 files changed, 519 insertions(+)
create mode 100644 samples/bpf/cpustat_kern.c
create mode 100644 samples/bpf/cpustat_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index adeaa13..e5d747f 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ hostprogs-y += xdp_redirect_map
hostprogs-y += xdp_redirect_cpu
hostprogs-y += xdp_monitor
hostprogs-y += syscall_tp
+hostprogs-y += cpustat
# Libbpf dependencies
LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -89,6 +90,7 @@ xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
+cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -137,6 +139,7 @@ always += xdp_redirect_map_kern.o
always += xdp_redirect_cpu_kern.o
always += xdp_monitor_kern.o
always += syscall_tp_kern.o
+always += cpustat_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -179,6 +182,7 @@ HOSTLOADLIBES_xdp_redirect_map += -lelf
HOSTLOADLIBES_xdp_redirect_cpu += -lelf
HOSTLOADLIBES_xdp_monitor += -lelf
HOSTLOADLIBES_syscall_tp += -lelf
+HOSTLOADLIBES_cpustat += -lelf
# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
# make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/cpustat_kern.c b/samples/bpf/cpustat_kern.c
new file mode 100644
index 0000000..68c84da
--- /dev/null
+++ b/samples/bpf/cpustat_kern.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/version.h>
+#include <linux/ptrace.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+/*
+ * The CPU number, cstate number and pstate number are based
+ * on 96boards Hikey with octa CA53 CPUs.
+ *
+ * Every CPU have three idle states for cstate:
+ * WFI, CPU_OFF, CLUSTER_OFF
+ *
+ * Every CPU have 5 operating points:
+ * 208MHz, 432MHz, 729MHz, 960MHz, 1200MHz
+ *
+ * This code is based on these assumption and other platforms
+ * need to adjust these definitions.
+ */
+#define MAX_CPU 8
+#define MAX_PSTATE_ENTRIES 5
+#define MAX_CSTATE_ENTRIES 3
+
+static int cpu_opps[] = { 208000, 432000, 729000, 960000, 1200000 };
+
+/*
+ * my_map structure is used to record cstate and pstate index and
+ * timestamp (Idx, Ts), when new event incoming we need to update
+ * combination for new state index and timestamp (Idx`, Ts`).
+ *
+ * Based on (Idx, Ts) and (Idx`, Ts`) we can calculate the time
+ * interval for the previous state: Duration(Idx) = Ts` - Ts.
+ *
+ * Every CPU has one below array for recording state index and
+ * timestamp, and record for cstate and pstate saperately:
+ *
+ * +--------------------------+
+ * | cstate timestamp |
+ * +--------------------------+
+ * | cstate index |
+ * +--------------------------+
+ * | pstate timestamp |
+ * +--------------------------+
+ * | pstate index |
+ * +--------------------------+
+ */
+#define MAP_OFF_CSTATE_TIME 0
+#define MAP_OFF_CSTATE_IDX 1
+#define MAP_OFF_PSTATE_TIME 2
+#define MAP_OFF_PSTATE_IDX 3
+#define MAP_OFF_NUM 4
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(u64),
+ .max_entries = MAX_CPU * MAP_OFF_NUM,
+};
+
+/* cstate_duration records duration time for every idle state per CPU */
+struct bpf_map_def SEC("maps") cstate_duration = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(u64),
+ .max_entries = MAX_CPU * MAX_CSTATE_ENTRIES,
+};
+
+/* pstate_duration records duration time for every operating point per CPU */
+struct bpf_map_def SEC("maps") pstate_duration = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(u64),
+ .max_entries = MAX_CPU * MAX_PSTATE_ENTRIES,
+};
+
+/*
+ * The trace events for cpu_idle and cpu_frequency are taken from:
+ * /sys/kernel/debug/tracing/events/power/cpu_idle/format
+ * /sys/kernel/debug/tracing/events/power/cpu_frequency/format
+ *
+ * These two events have same format, so define one common structure.
+ */
+struct cpu_args {
+ u64 pad;
+ u32 state;
+ u32 cpu_id;
+};
+
+/* calculate pstate index, returns MAX_PSTATE_ENTRIES for failure */
+static u32 find_cpu_pstate_idx(u32 frequency)
+{
+ u32 i;
+
+ for (i = 0; i < sizeof(cpu_opps) / sizeof(u32); i++) {
+ if (frequency == cpu_opps[i])
+ return i;
+ }
+
+ return i;
+}
+
+SEC("tracepoint/power/cpu_idle")
+int bpf_prog1(struct cpu_args *ctx)
+{
+ u64 *cts, *pts, *cstate, *pstate, prev_state, cur_ts, delta;
+ u32 key, cpu, pstate_idx;
+ u64 *val;
+
+ if (ctx->cpu_id > MAX_CPU)
+ return 0;
+
+ cpu = ctx->cpu_id;
+
+ key = cpu * MAP_OFF_NUM + MAP_OFF_CSTATE_TIME;
+ cts = bpf_map_lookup_elem(&my_map, &key);
+ if (!cts)
+ return 0;
+
+ key = cpu * MAP_OFF_NUM + MAP_OFF_CSTATE_IDX;
+ cstate = bpf_map_lookup_elem(&my_map, &key);
+ if (!cstate)
+ return 0;
+
+ key = cpu * MAP_OFF_NUM + MAP_OFF_PSTATE_TIME;
+ pts = bpf_map_lookup_elem(&my_map, &key);
+ if (!pts)
+ return 0;
+
+ key = cpu * MAP_OFF_NUM + MAP_OFF_PSTATE_IDX;
+ pstate = bpf_map_lookup_elem(&my_map, &key);
+ if (!pstate)
+ return 0;
+
+ prev_state = *cstate;
+ *cstate = ctx->state;
+
+ if (!*cts) {
+ *cts = bpf_ktime_get_ns();
+ return 0;
+ }
+
+ cur_ts = bpf_ktime_get_ns();
+ delta = cur_ts - *cts;
+ *cts = cur_ts;
+
+ /*
+ * When state doesn't equal to (u32)-1, the cpu will enter
+ * one idle state; for this case we need to record interval
+ * for the pstate.
+ *
+ * OPP2
+ * +---------------------+
+ * OPP1 | |
+ * ---------+ |
+ * | Idle state
+ * +---------------
+ *
+ * |<- pstate duration ->|
+ * ^ ^
+ * pts cur_ts
+ */
+ if (ctx->state != (u32)-1) {
+
+ /* record pstate after have first cpu_frequency event */
+ if (!*pts)
+ return 0;
+
+ delta = cur_ts - *pts;
+
+ pstate_idx = find_cpu_pstate_idx(*pstate);
+ if (pstate_idx >= MAX_PSTATE_ENTRIES)
+ return 0;
+
+ key = cpu * MAX_PSTATE_ENTRIES + pstate_idx;
+ val = bpf_map_lookup_elem(&pstate_duration, &key);
+ if (val)
+ __sync_fetch_and_add((long *)val, delta);
+
+ /*
+ * When state equal to (u32)-1, the cpu just exits from one
+ * specific idle state; for this case we need to record
+ * interval for the pstate.
+ *
+ * OPP2
+ * -----------+
+ * | OPP1
+ * | +-----------
+ * | Idle state |
+ * +---------------------+
+ *
+ * |<- cstate duration ->|
+ * ^ ^
+ * cts cur_ts
+ */
+ } else {
+
+ key = cpu * MAX_CSTATE_ENTRIES + prev_state;
+ val = bpf_map_lookup_elem(&cstate_duration, &key);
+ if (val)
+ __sync_fetch_and_add((long *)val, delta);
+ }
+
+ /* Update timestamp for pstate as new start time */
+ if (*pts)
+ *pts = cur_ts;
+
+ return 0;
+}
+
+SEC("tracepoint/power/cpu_frequency")
+int bpf_prog2(struct cpu_args *ctx)
+{
+ u64 *pts, *cstate, *pstate, prev_state, cur_ts, delta;
+ u32 key, cpu, pstate_idx;
+ u64 *val;
+
+ cpu = ctx->cpu_id;
+
+ key = cpu * MAP_OFF_NUM + MAP_OFF_PSTATE_TIME;
+ pts = bpf_map_lookup_elem(&my_map, &key);
+ if (!pts)
+ return 0;
+
+ key = cpu * MAP_OFF_NUM + MAP_OFF_PSTATE_IDX;
+ pstate = bpf_map_lookup_elem(&my_map, &key);
+ if (!pstate)
+ return 0;
+
+ key = cpu * MAP_OFF_NUM + MAP_OFF_CSTATE_IDX;
+ cstate = bpf_map_lookup_elem(&my_map, &key);
+ if (!cstate)
+ return 0;
+
+ prev_state = *pstate;
+ *pstate = ctx->state;
+
+ if (!*pts) {
+ *pts = bpf_ktime_get_ns();
+ return 0;
+ }
+
+ cur_ts = bpf_ktime_get_ns();
+ delta = cur_ts - *pts;
+ *pts = cur_ts;
+
+ /* When CPU is in idle, bail out to skip pstate statistics */
+ if (*cstate != (u32)(-1))
+ return 0;
+
+ /*
+ * The cpu changes to another different OPP (in below diagram
+ * change frequency from OPP3 to OPP1), need recording interval
+ * for previous frequency OPP3 and update timestamp as start
+ * time for new frequency OPP1.
+ *
+ * OPP3
+ * +---------------------+
+ * OPP2 | |
+ * ---------+ |
+ * | OPP1
+ * +---------------
+ *
+ * |<- pstate duration ->|
+ * ^ ^
+ * pts cur_ts
+ */
+ pstate_idx = find_cpu_pstate_idx(*pstate);
+ if (pstate_idx >= MAX_PSTATE_ENTRIES)
+ return 0;
+
+ key = cpu * MAX_PSTATE_ENTRIES + pstate_idx;
+ val = bpf_map_lookup_elem(&pstate_duration, &key);
+ if (val)
+ __sync_fetch_and_add((long *)val, delta);
+
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
new file mode 100644
index 0000000..e497f85
--- /dev/null
+++ b/samples/bpf/cpustat_user.c
@@ -0,0 +1,234 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sched.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <linux/bpf.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/wait.h>
+
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_CPU 8
+#define MAX_PSTATE_ENTRIES 5
+#define MAX_CSTATE_ENTRIES 3
+#define MAX_STARS 40
+
+#define CPUFREQ_MAX_SYSFS_PATH "/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq"
+#define CPUFREQ_LOWEST_FREQ "208000"
+#define CPUFREQ_HIGHEST_FREQ "12000000"
+
+struct cpu_hist {
+ unsigned long cstate[MAX_CSTATE_ENTRIES];
+ unsigned long pstate[MAX_PSTATE_ENTRIES];
+};
+
+static struct cpu_hist cpu_hist[MAX_CPU];
+static unsigned long max_data;
+
+static void stars(char *str, long val, long max, int width)
+{
+ int i;
+
+ for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+ str[i] = '*';
+ if (val > max)
+ str[i - 1] = '+';
+ str[i] = '\0';
+}
+
+static void print_hist(void)
+{
+ char starstr[MAX_STARS];
+ struct cpu_hist *hist;
+ int i, j;
+
+ /* ignore without data */
+ if (max_data == 0)
+ return;
+
+ /* clear screen */
+ printf("\033[2J");
+
+ for (j = 0; j < MAX_CPU; j++) {
+ hist = &cpu_hist[j];
+
+ printf("CPU %d\n", j);
+ printf("State : Duration(ms) Distribution\n");
+ for (i = 0; i < MAX_CSTATE_ENTRIES; i++) {
+ stars(starstr, hist->cstate[i], max_data, MAX_STARS);
+ printf("cstate %d : %-8ld |%-*s|\n", i,
+ hist->cstate[i] / 1000000, MAX_STARS, starstr);
+ }
+
+ for (i = 0; i < MAX_PSTATE_ENTRIES; i++) {
+ stars(starstr, hist->pstate[i], max_data, MAX_STARS);
+ printf("pstate %d : %-8ld |%-*s|\n", i,
+ hist->pstate[i] / 1000000, MAX_STARS, starstr);
+ }
+
+ printf("\n");
+ }
+}
+
+static void get_data(int cstate_fd, int pstate_fd)
+{
+ unsigned long key, value;
+ int c, i;
+
+ max_data = 0;
+
+ for (c = 0; c < MAX_CPU; c++) {
+ for (i = 0; i < MAX_CSTATE_ENTRIES; i++) {
+ key = c * MAX_CSTATE_ENTRIES + i;
+ bpf_map_lookup_elem(cstate_fd, &key, &value);
+ cpu_hist[c].cstate[i] = value;
+
+ if (value > max_data)
+ max_data = value;
+ }
+
+ for (i = 0; i < MAX_PSTATE_ENTRIES; i++) {
+ key = c * MAX_PSTATE_ENTRIES + i;
+ bpf_map_lookup_elem(pstate_fd, &key, &value);
+ cpu_hist[c].pstate[i] = value;
+
+ if (value > max_data)
+ max_data = value;
+ }
+ }
+}
+
+/*
+ * This function is copied from function idlestat_wake_all()
+ * in idlestate.c, it set the self task affinity to cpus
+ * one by one so can wake up the CPU to handle the scheduling;
+ * as result all cpus can be waken up once and produce trace
+ * event 'cpu_idle'.
+ */
+static int cpu_stat_inject_cpu_idle_event(void)
+{
+ int rcpu, i, ret;
+ cpu_set_t cpumask;
+ cpu_set_t original_cpumask;
+
+ ret = sysconf(_SC_NPROCESSORS_CONF);
+ if (ret < 0)
+ return -1;
+
+ rcpu = sched_getcpu();
+ if (rcpu < 0)
+ return -1;
+
+ /* Keep track of the CPUs we will run on */
+ sched_getaffinity(0, sizeof(original_cpumask), &original_cpumask);
+
+ for (i = 0; i < ret; i++) {
+
+ /* Pointless to wake up ourself */
+ if (i == rcpu)
+ continue;
+
+ /* Pointless to wake CPUs we will not run on */
+ if (!CPU_ISSET(i, &original_cpumask))
+ continue;
+
+ CPU_ZERO(&cpumask);
+ CPU_SET(i, &cpumask);
+
+ sched_setaffinity(0, sizeof(cpumask), &cpumask);
+ }
+
+ /* Enable all the CPUs of the original mask */
+ sched_setaffinity(0, sizeof(original_cpumask), &original_cpumask);
+ return 0;
+}
+
+/*
+ * It's possible to have long time have no any frequency change
+ * and cannot get trace event 'cpu_frequency' for long time, this
+ * can introduce big deviation for pstate statistics.
+ *
+ * To solve this issue, we can force to set 'scaling_max_freq' to
+ * trigger trace event 'cpu_frequency' and then we can recovery
+ * back the maximum frequency value. For this purpose, below
+ * firstly set highest frequency to 208MHz and then recovery to
+ * 1200MHz again.
+ */
+static int cpu_stat_inject_cpu_frequency_event(void)
+{
+ int len, fd;
+
+ fd = open(CPUFREQ_MAX_SYSFS_PATH, O_WRONLY);
+ if (fd < 0) {
+ printf("failed to open scaling_max_freq, errno=%d\n", errno);
+ return fd;
+ }
+
+ len = write(fd, CPUFREQ_LOWEST_FREQ, strlen(CPUFREQ_LOWEST_FREQ));
+ if (len < 0) {
+ printf("failed to open scaling_max_freq, errno=%d\n", errno);
+ goto err;
+ }
+
+ len = write(fd, CPUFREQ_HIGHEST_FREQ, strlen(CPUFREQ_HIGHEST_FREQ));
+ if (len < 0) {
+ printf("failed to open scaling_max_freq, errno=%d\n", errno);
+ goto err;
+ }
+
+err:
+ close(fd);
+ return len;
+}
+
+static void int_exit(int sig)
+{
+ cpu_stat_inject_cpu_idle_event();
+ cpu_stat_inject_cpu_frequency_event();
+ get_data(map_fd[1], map_fd[2]);
+ print_hist();
+ exit(0);
+}
+
+int main(int argc, char **argv)
+{
+ char filename[256];
+ int ret;
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ ret = cpu_stat_inject_cpu_idle_event();
+ if (ret < 0)
+ return 1;
+
+ ret = cpu_stat_inject_cpu_frequency_event();
+ if (ret < 0)
+ return 1;
+
+ signal(SIGINT, int_exit);
+ signal(SIGTERM, int_exit);
+
+ while (1) {
+ get_data(map_fd[1], map_fd[2]);
+ print_hist();
+ sleep(5);
+ }
+
+ return 0;
+}
--
2.7.4
From: Erin Yang <erin.yang(a)mstarsemi.com>
From: Erin Yang <erin.yang(a)mstarsemi.com>
The two conditions, “target_capacity” and “target_max_spare_cap”,
in find_best_target() are not enough to result in a good load
balance within a cluster.
For example, there are 2 little (say core 0 & 1) and 2 big (say core 2 & 3) cores.
The capacity of little and big cores are 500 and 1024, respectively.
Step 1: A task with task_util 100 and boost value 50 comes.
So, it starts from big cores. Finally, CPU 3 is selected.
Step 2: A task with task_util 10 and boost value 50 comes.
So, it starts from big cores. Finally, CPU 3 is selected again.
If we add an extra condition, "target_max_free_util",
CPU 2 can be selected instead at Step 2.
Assume current cpu utility is as follows.
Capacity_orig cpu_util
CPU 0 500 100
CPU 1 500 100
CPU 2 1024 100
CPU 3 1024 100
Step 2. A task with task_util 100 and boost value 50 comes. So, it starts from big cores. Finally, CPU 3 is selected.
Capacity_orig cpu_util New_util target_max_spare_cap target_max_free_util
CPU 0 500 100 562 (100 + (1024-100)*50% = 562) continued via “new_util > capacity_orig”
CPU 1 500 100 562 continued via “new_util > capacity_orig”
CPU 2 1024 100 562 462 924
CPU 3 1024 100 562 462 924 (CPU 3 is selected.)
Step 3. A task with task_util 10 and boost value 50 comes. So, it starts from big cores.
Capacity_orig cpu_util New_util target_max_spare_cap target_max_free_util
CPU 0 500 100 517 (10 + (1024-10)*50% = 517) continued via “new_util > capacity_orig”
CPU 1 500 100 517 continued via “new_util > capacity_orig”
CPU 2 1024 100 517 507 924
CPU 3 1024 200 517 507 824
to test it, this LISA notebook can create small boosted tasks with rtapp
https://github.com/realmz/lisa/blob/hikey960_v3/ipynb/tests/max_free_util_t…
Change-Id: I0ef662e584e1e750381039a9a3941e43c37c221f
Signed-off-by: Erin Yang <erin.yang(a)mstarsemi.com>
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a146ac4..5abf50c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6697,6 +6697,7 @@
unsigned long target_max_spare_cap = 0;
unsigned long target_util = ULONG_MAX;
unsigned long best_active_util = ULONG_MAX;
+ unsigned long target_max_free_util = 0;
int best_idle_cstate = INT_MAX;
struct sched_domain *sd;
struct sched_group *sg;
@@ -6910,8 +6911,8 @@
* that CPU at an higher OPP.
*
* Thus, this case keep track of the CPU with the
- * smallest maximum capacity and highest spare maximum
- * capacity.
+ * smallest maximum capacity, highest spare maximum
+ * capacity and highest free cpu utility.
*/
/* Favor CPUs with smaller capacity */
@@ -6922,8 +6923,13 @@
if ((capacity_orig - new_util) < target_max_spare_cap)
continue;
+ /* Favor CPUs with maximum free utilization */
+ if ((capacity_orig - cpu_util(i)) < target_max_free_util)
+ continue;
+
target_max_spare_cap = capacity_orig - new_util;
target_capacity = capacity_orig;
+ target_max_free_util = capacity_orig - cpu_util(i);
target_util = new_util;
target_cpu = i;
}
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hello,
I did some comparisons of Pelt and Walt and have some very interesting
performance results that I wanted to share with all of you. I haven't
got any power numbers as I don't have setup for that.
Key points:
- All the tests were done on Hikey960, with a 5V Fan placed over the
SoC to cool it down.
- HDMI port was disconnected while running tests.
- CONFIG_SCHED_TUNE was configured out to keep things simple.
- Only the PCmark bench was tested, with help of workload automation.
- Below number shows the average out of 3 runs, performed during a
single kernel boot cycle.
- Pelt 8/16/32 are the half-life periods.
- While testing Pelt, CONFIG_WALT was disabled.
+------------------+----------+------------+------------+-----------+
| | | | | |
| Test name | WALT | Pelt 8 ms | Pelt 16 ms | Pelt 32 ms|
+------------------+----------+------------+------------+-----------+
| | | | | |
| DataManipulation | 5341 | 5561 | 5453 | 5400 |
| | | | | |
| PhotoEditingV2 | 9015 | 8577 | 7911 | 6043 |
| | | | | |
| VideoEditing | 0 | 4291 | 3746 | 3755 |
| | | | | |
| WebV2 | 6202 | 6448 | 5465 | 4648 |
| | | | | |
| Workv2 | 0 | 5697 | 5069 | 4517 |
| | | | | |
| WritingV2 | 4302 | 4549 | 3811 | 3306 |
+------------------+----------+------------+------------+-----------+
As you can see in the results Pelt 8 is very much comparable to the
Walt results now. Hurray ? :)
A detailed report is present here with some more useful numbers:
https://goo.gl/eCx4Pk
How to replicate setup:
- Android kernel tree:
https://git.linaro.org/people/vireshk/mylinux.git android-4.9-hikey
This has several patches over latest 4.9-hikey aosp tree.
- Some patches to reduce disturbances, which Vincent shared earlier
with a document.
- "thermal: Add debugfs support for cooling devices" and "cpufreq:
stats: New sysfs attribute for clearing statistics" are used to
read some more data from userspace after tests are done which can
be used to build conclusions on working of pelt/walt and how they
are behaving differently.
For example, we can know the amount of time we spent on individual
cpu frequencies while the test was running. And also the time for
which cpu-cooling and devfreq (ddr) has throttled some
frequencies.
- Pelt 16 and pelt 8 patches.
The below changes are required to capture the extra data that I have
captured in my sheet above.
I have attached pelt_walt.sh script, which you need to push to /data:
$ adb push pelt_walt.sh /data
And I have updated the pcmark plugin file to run the script and
collect data. That is attached as well.
Happy testing !!
I heard from Vincent earlier that ARM did similar testing earlier on
but never found anything significant. Why ? I may have an answer to
that, not sure though.
I found a patch from Juri which someone is using:
https://android.googlesource.com/kernel/msm/+/b52bb1f248e4cef65edaece54a68c…
and one of the problem here is that the patch hasn't updated the
__accumulated_sum_N32 array, but only runnable_avg_yN_inv and
runnable_avg_yN_sum.
That's pretty much it. Thanks for reading.
--
viresh
The comment inside cpu_util_wake() clearly says that the task_util()
isn't subtracted from cpu_util() as WALT doesn't decay idle tasks like
PELT does. That probably works fine in most of the cases, but there is
at least one case (find_best_target()) where we will account for
task_util() twice.
There are significant side effects of this, the most observed one is
that it makes the task move to a big CPU instead of the LITTLE ones. And
thus provide better results with various benchmarks, like PCMark.
Fix that.
Reported-by: vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
---
Hi,
I don't have knowledge in great depths of either Walt or Pelt, but we
noticed something incorrect and wanted to check with others if the
finding is correct. If others agree that this is indeed the right fix,
then I will send it for Android gerrit.
This was tested as part of my Pelt Vs Walt work and I have noticed
significant performance difference with and without this patch. With
this patch, the amount of time spent by the big cluster in the highest
OPP is reduced significantly and thus we get a bit lower numbers with
PCMark for example.
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f5925cc541f..06246e02ea09 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6812,6 +6812,14 @@ static inline int find_best_target(struct task_struct *p, int *backup_cpu,
* accounting. However, the blocked utilization may be zero.
*/
wake_util = cpu_util_wake(i, p);
+
+#ifdef CONFIG_SCHED_WALT
+ if (!walt_disabled && sysctl_sched_use_walt_cpu_util &&
+ i == task_cpu(p)) {
+ wake_util -= task_util(p);
+ }
+#endif
+
new_util = wake_util + task_util(p);
/*
--
2.15.0.194.g9af6a3dea062
Hi Joonwoo, Chris,
When porting EAS1.4 to our platform which is SMP(4*A7, k4.4), we
encountered kernel panic frequently after applied following patches:
* | 9e293db sched: EAS: upmigrate misfit current task
* | dc626b2 sched: avoid pushing tasks to an offline CPU
* | 2da014c sched: Extend active balance to accept 'push_task' argument
After applying these three patches, leaving EAS disabled and doing a
stability test which includes some random cpu plugin/plugout, kernel panic
sometimes happened, always with the same stack as below:
[ 214.742695] c1 ------------[ cut here ]------------
[ 214.742709] c1 kernel BUG at
/space/builder/repo/sprdroid8.1_trunk/kernel/kernel/smpboot.c:136!
[ 214.742718] c1 Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[ 214.748750] c0 Modules linked in: mtty marlin2_fm mali(O)
[ 214.748785] c1 CPU: 1 PID: 18 Comm: migration/2 Tainted: G W
O 4.4.83-00912-g370f62c #1
[ 214.748795] c1 Hardware name: Generic DT based system
[ 214.748805] c1 task: ef2d9680 task.stack: ee862000
[ 214.748821] c1 PC is at smpboot_thread_fn+0x168/0x270
[ 214.748832] c1 LR is at smpboot_thread_fn+0xe4/0x270
[ 214.748843] c1 pc : [<c014d71c>] lr : [<c014d698>] psr: 200e0113
sp : ee863f38 ip : ee863f38 fp : ee863f5c
[ 214.748854] c1 r10: 00000000 r9 : 00000000 r8 : 00000000
[ 214.748862] c1 r7 : 00000001 r6 : c111a814 r5 : ee846140 r4 : ee862000
[ 214.748871] c1 r3 : 00000001 r2 : ee863f28 r1 : 00000000 r0 : 00000002
[ 214.748881] c1 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM
Segment none
[ 214.748890] c1 Control: 10c5387d Table: 9b9e406a DAC: 00000051
...
[ 214.821339] c1 [<c014d71c>] (smpboot_thread_fn) from [<c0149ee4>]
(kthread+0x118/0x12c)
[ 214.821363] c1 [<c0149ee4>] (kthread) from [<c0108310>]
(ret_from_fork+0x14/0x24)
[ 214.821378] c1 Code: e5950000 e5943010 e1500003 0a000000 (e7f001f2)
kernel/kernel/smpboot.c:136:
BUG_ON(td->cpu != smp_processor_id());
It seems that OOPS was caused by migration/2 actually running on cpu1.
Do you have any suggestions for this? Thanks in advance.