On 03/09/16 00:27, markivx@codeaurora.org wrote:
From: Srivatsa Vaddagiri vatsa@codeaurora.org
This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to the registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
(window size determined by walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed during the current (not yet completed) window prev_runnable_sum: aggregate utilization of all tasks that executed during the most recent completed window
prev_runnable_sum is the primary stastic used to guide CPU frequency in lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., margin) before deciding the next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity : util;' in cpu_util()? Just asking because otherwise IMHO this is no different to PELT util.
Corresponding to the aggregate statistics, WALT also mantains the following stats per task:
curr_window - represents the cpu utilization of the task in its most recently tracked window prev_window - represents cpu utilization of task in the window prior to the one being tracked by curr_window
WALT statistic updates are event driven, with updates occurring in scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch), task wakeup and during task migration. Migration simply involves removing a tasks's curr_window and prev_window from the source CPU's curr_runnable sum and prev_runnable_sum, and adding the per-task counters to the destination CPU's aggregate CPU counters.
PELT util updates are event-driven as well. The difference is that WALT operates in core.c whereas PELT util only considers the CFS class.
Execution time in an IRQ handler is accounted in a CPU's curr_runnable_sum statistic, provided that the CPU was also executing the idle task for the duration of the interrupt handler.
Idle task handling is modified by walt_io_is_busy; when set to 1, if a CPU rq has tasks blocked on IO, idle-task execution is accounted in per-task and per-CPU counters. Setting walt_io_is_busy will also cause interrupt handlers in the idle task to update counters as if the idle task was executing (instead of just the interrupt handler execution time).
The major tunable provided by WALT is walt_ravg_window, which represents window size (in nanoseconds) and is set to 20ms by default. walt_io_is_busy (described above) is set to 0 by default.
Potential upcoming changes/improvements include: the use of sched_clock instead of ktime_get as a time source, support for an unsynchronized (across CPUs) time source, and integration with mainlined CPU efficiency APIs.
Signed-off-by: Srivatsa Vaddagiri vatsa@codeaurora.org Signed-off-by: Vikram Mulukutla markivx@codeaurora.org
include/linux/sched.h | 35 +++ include/linux/sched/sysctl.h | 1 + include/trace/events/sched.h | 76 ++++++ init/Kconfig | 9 + kernel/sched/Makefile | 1 + kernel/sched/core.c | 28 +- kernel/sched/cpufreq_schedutil.c | 7 +- kernel/sched/cputime.c | 11 +- kernel/sched/debug.c | 10 + kernel/sched/fair.c | 7 +- kernel/sched/sched.h | 10 + kernel/sched/walt.c | 540 +++++++++++++++++++++++++++++++++++++++ kernel/sched/walt.h | 73 ++++++ kernel/sysctl.c | 9 + 14 files changed, 812 insertions(+), 5 deletions(-) create mode 100644 kernel/sched/walt.c create mode 100644 kernel/sched/walt.h
diff --git a/include/linux/sched.h b/include/linux/sched.h index 253538f..56e708f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -314,6 +314,17 @@ extern char ___assert_task_state[1 - 2*!!( /* Task command name length */ #define TASK_COMM_LEN 16
+enum task_event {
- PUT_PREV_TASK = 0,
- PICK_NEXT_TASK = 1,
- TASK_WAKE = 2,
- TASK_MIGRATE = 3,
- TASK_UPDATE = 4,
- IRQ_UPDATE = 5,
+};
I always ask myself why WALT has fine-grained event types where PELT can do with running/not running? Is there a benefit or are they used in account_cpu_busy_time() essentially as running/not running ?
[...]
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 9b90c57..2adf245 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -562,6 +562,82 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
TP_printk("cpu=%d", __entry->cpu) );
+TRACE_EVENT(sched_walt_util,
- TP_PROTO(int cpu, unsigned long cfs_util_avg, unsigned long walt_util),
- TP_ARGS(cpu, cfs_util_avg, walt_util),
- TP_STRUCT__entry(
__field( int, cpu )
__field( unsigned long, cfs_util_avg )
__field( unsigned long, walt_util )
- ),
- TP_fast_assign(
__entry->cpu = cpu;
__entry->cfs_util_avg = cfs_util_avg;
__entry->walt_util = walt_util;
- ),
- TP_printk("cpu %d cfs_util_avg %lu walt_util %lu", __entry->cpu, __entry->cfs_util_avg, __entry->walt_util)
+);
+struct rq;
+TRACE_EVENT(sched_walt_update_task_ravg,
- TP_PROTO(struct task_struct *p, struct rq *rq, enum task_event evt, u64 wallclock, u64 irqtime),
- TP_ARGS(p, rq, evt, wallclock, irqtime),
- TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( pid_t, cur_pid )
__field(unsigned int, cpu )
__field(unsigned int, cur_freq )
__field( u64, wallclock )
__field( u64, mark_start )
__field( u64, win_start )
__field( u64, irqtime )
__field(enum task_event, evt )
__field( u64, rq_cs )
__field( u64, rq_ps )
__field( u32, curr_window )
__field( u32, prev_window )
- ),
- TP_fast_assign(
__entry->wallclock = wallclock;
__entry->win_start = rq->window_start;
__entry->evt = evt;
__entry->cpu = rq->cpu;
__entry->cur_pid = rq->curr->pid;
__entry->cur_freq = rq->cur_freq;
memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
__entry->pid = p->pid;
__entry->mark_start = p->ravg.mark_start;
__entry->irqtime = irqtime;
__entry->rq_cs = rq->curr_runnable_sum;
__entry->rq_ps = rq->prev_runnable_sum;
__entry->curr_window = p->ravg.curr_window;
__entry->prev_window = p->ravg.prev_window;
- ),
- TP_printk("wc %llu ws %llu event %s cpu %d cur_freq %u cur_pid %d task %d (%s) ms %llu irqtime %llu rq_cs %llu rq_ps %llu cur_window %u prev_window %u"
, __entry->wallclock, __entry->win_start,
task_event_names[__entry->evt], __entry->cpu,
__entry->cur_freq, __entry->cur_pid,
__entry->pid, __entry->comm, __entry->mark_start,
__entry->irqtime,
__entry->rq_cs, __entry->rq_ps,
__entry->curr_window, __entry->prev_window
)
+);
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
IMHO, trace_events can go into another patch.
If you try to compile w/o CONFIG_SCHED_WALT they break the build.
[...]
@@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_acquire(!p->on_cpu);
- raw_spin_lock(&task_rq(p)->lock);
This is extra locking, right?
[...]
diff --git a/kernel/sched/walt.c b/kernel/sched/walt.c new file mode 100644 index 0000000..203e02d --- /dev/null +++ b/kernel/sched/walt.c @@ -0,0 +1,540 @@ +/*
- Copyright (c) 2016, The Linux Foundation. All rights reserved.
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License version 2 and
- only version 2 as published by the Free Software Foundation.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- Window Assisted Load Tracking (WALT) implementation credits:
- Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
- Pavan Kumar Kondeti, Olav Haugan
- 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
and Todd Kjos
- 2016-08-31: Integration with mainline by Srivatsa Vaddagiri
and Vikram Mulukutla
- */
+#include <linux/syscore_ops.h> +#include <linux/cpufreq.h> +#include <trace/events/sched.h> +#include "sched.h" +#include "walt.h"
+char *task_event_names[] = {"PUT_PREV_TASK", "PICK_NEXT_TASK",
"TASK_WAKE", "TASK_MIGRATE", "TASK_UPDATE",
"IRQ_UPDATE"};
+__read_mostly unsigned int sysctl_sched_use_walt_metrics = 1;
+static __read_mostly unsigned int walt_freq_account_wait_time; +static __read_mostly unsigned int walt_io_is_busy;
+/* 1 -> use PELT based load stats, 0 -> use window-based load stats */ +static unsigned int __read_mostly walt_disabled;
+static unsigned int max_possible_efficiency = 1024;
+/*
- Maximum possible frequency across all cpus. Task demand and cpu
- capacity (cpu_power) metrics are scaled in reference to it.
- */
+static unsigned int max_possible_freq = 1;
+/* Window size (in ns) */ +__read_mostly unsigned int walt_ravg_window = 20000000;
+/* Min window size (in ns) = 10ms */ +#define MIN_SCHED_RAVG_WINDOW 10000000
+/* Max window size (in ns) = 1s */ +#define MAX_SCHED_RAVG_WINDOW 1000000000
+static unsigned int sync_cpu;
What is the business of this sync_cpu?
[...]
+unsigned long __weak arch_get_cpu_efficiency(int cpu) +{
- return 1024;
+}
So this should be connected to arch_scale_cpu_capacity(NULL, cpu) instead. You mentioned this already somewhere.
+void walt_init_cpu_efficiency(void) +{
- int i, efficiency;
- unsigned int max = 0;
- for_each_possible_cpu(i) {
efficiency = arch_get_cpu_efficiency(i);
cpu_rq(i)->efficiency = efficiency;
if (efficiency > max)
max = efficiency;
- }
- if (max)
max_possible_efficiency = max;
+}
[...]
+void walt_account_irqtime(int cpu, struct task_struct *curr,
u64 delta, u64 wallclock)
+{
- struct rq *rq = cpu_rq(cpu);
- unsigned long flags;
- raw_spin_lock_irqsave(&rq->lock, flags);
- /*
* cputime (wallclock) uses sched_clock so use the same here for
* consistency.
*/
- delta += sched_clock_cpu(cpu) - wallclock;
- walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), delta);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+int fast_switching = 1;
An assumption that you're on x86. Should be mentioned somewhere in the patch-header.
+void walt_freq_transition(int cpu, unsigned long new_freq) +{
- int i;
- unsigned long flags;
- local_irq_save(flags);
- for_each_cpu(i, &cpu_rq(cpu)->freq_domain_cpumask) {
struct rq *rq = cpu_rq(i);
if (!fast_switching ||
(fast_switching && smp_processor_id() != i))
raw_spin_lock(&rq->lock);
walt_update_task_ravg(rq->curr, rq, TASK_UPDATE,
walt_ktime_clock(), 0);
rq->cur_freq = new_freq;
if (!fast_switching ||
(fast_switching && smp_processor_id() != i))
raw_spin_unlock(&rq->lock);
if (!rq->window_start)
walt_set_window_start(rq);
- }
- local_irq_restore(flags);
+}
+static int cpufreq_notifier_trans(struct notifier_block *nb,
unsigned long val, void *data)
+{
- struct cpufreq_freqs *freq = (struct cpufreq_freqs *)data;
- unsigned int cpu = freq->cpu, new_freq = freq->new;
- if (val != CPUFREQ_POSTCHANGE)
return 0;
- BUG_ON(!new_freq);
- if (cpu_rq(cpu)->cur_freq == new_freq)
return 0;
- walt_freq_transition(cpu, new_freq);
- return 0;
+}
+static struct notifier_block notifier_policy_block = {
- .notifier_call = cpufreq_notifier_policy
+};
+static struct notifier_block notifier_trans_block = {
- .notifier_call = cpufreq_notifier_trans
+};
+static int register_sched_callback(void) +{
- int ret;
- ret = cpufreq_register_notifier(¬ifier_policy_block,
CPUFREQ_POLICY_NOTIFIER);
- if (!fast_switching)
ret = cpufreq_register_notifier(¬ifier_trans_block,
CPUFREQ_TRANSITION_NOTIFIER);
- return 0;
+}
+/*
- cpufreq callbacks can be registered at core_initcall or later time.
- Any registration done prior to that is "forgotten" by cpufreq. See
- initialization of variable init_cpufreq_transition_notifier_list_called
- for further information.
- */
+core_initcall(register_sched_callback);
Why not using the mainline frequency invariance interface 'arch_scale_freq_capacity()' for all of WALT's frequency invariance needs and implement the FIE (Frequency Invariance Engine in the Arch) plus link it to its users by '#define arch_scale_freq_capacity foo_scale_freq_capacity'
[...]