This patch series implements an alternative window assisted load tracking mechanism in lieu of PELT based cpu utilization tracking. Testing has shown that a window based non-decaying metric such as WALT guiding cpu frequency and task placement decisions can improve performance/power especially when running workloads more commonly found on mobile devices. The aim of this series is to incorporate WALT accounting into the scheduler and feed WALT statistics to schedutil in order to guide cpu frequency selection. The implementation is detailed in the commit text of Patch 1. The eventual goal is to also guide placement decisions based on WALT statistics.
WALT has existed in out-of-tree kernels for ARM/ARM64 commercialized devices for a few years. This is an effort to bring WALT to mainline as well as to test on multiple architectures and with varied workloads.
This RFC version is mainly to preview what the code will look like on mainline. Future RFC revisions will include a theoretical discussion and benchmark results.
Tested on an Intel x86_64 machine (on top of 4.7-rc6). (Benchmark results will be sent out separately and as part of this message in the next RFC version).
Patch 1: Adds WALT tracking to the scheduler
Patches 2-3: Temporary patches to bring in EAS/sched-freq like capacity table and to use Intel PMC counters for more accurate frequency invariant load tracking on X86. Included for completeness but not meant for merging.
include/linux/sched.h | 35 ++++++++++ include/linux/sched/sysctl.h | 2 + include/trace/events/sched.h | 76 +++++++++++++++++++++ init/Kconfig | 9 +++ kernel/sched/Makefile | 1 + kernel/sched/core.c | 29 ++++++++- kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++- kernel/sched/cputime.c | 11 +++- kernel/sched/debug.c | 10 +++ kernel/sched/fair.c | 7 +- kernel/sched/sched.h | 13 ++++ kernel/sched/walt.c | 580 ++++++++++++++++++++++++++++++++++ kernel/sched/walt.h | 75 +++++++++++++++++++++ kernel/sysctl.c | 18 +++++ 14 files changed, 904 insertions(+), 6 deletions(-)
-- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
From: Srivatsa Vaddagiri vatsa@codeaurora.org
This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to the registered utilization callback listeners. A windowed view of time (window size determined by walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed during the current (not yet completed) window
prev_runnable_sum: aggregate utilization of all tasks that executed during the most recent completed window
prev_runnable_sum is the primary stastic used to guide CPU frequency in lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., margin) before deciding the next P-state.
Corresponding to the aggregate statistics, WALT also mantains the following stats per task:
curr_window - represents the cpu utilization of the task in its most recently tracked window
prev_window - represents cpu utilization of task in the window prior to the one being tracked by curr_window
WALT statistic updates are event driven, with updates occurring in scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch), task wakeup and during task migration. Migration simply involves removing a tasks's curr_window and prev_window from the source CPU's curr_runnable sum and prev_runnable_sum, and adding the per-task counters to the destination CPU's aggregate CPU counters. Execution time in an IRQ handler is accounted in a CPU's curr_runnable_sum statistic, provided that the CPU was also executing the idle task for the duration of the interrupt handler.
Idle task handling is modified by walt_io_is_busy; when set to 1, if a CPU rq has tasks blocked on IO, idle-task execution is accounted in per-task and per-CPU counters. Setting walt_io_is_busy will also cause interrupt handlers in the idle task to update counters as if the idle task was executing (instead of just the interrupt handler execution time).
The major tunable provided by WALT is walt_ravg_window, which represents window size (in nanoseconds) and is set to 20ms by default. walt_io_is_busy (described above) is set to 0 by default.
Potential upcoming changes/improvements include: the use of sched_clock instead of ktime_get as a time source, support for an unsynchronized (across CPUs) time source, and integration with mainlined CPU efficiency APIs.
Signed-off-by: Srivatsa Vaddagiri vatsa@codeaurora.org Signed-off-by: Vikram Mulukutla markivx@codeaurora.org --- include/linux/sched.h | 35 +++ include/linux/sched/sysctl.h | 1 + include/trace/events/sched.h | 76 ++++++ init/Kconfig | 9 + kernel/sched/Makefile | 1 + kernel/sched/core.c | 28 +- kernel/sched/cpufreq_schedutil.c | 7 +- kernel/sched/cputime.c | 11 +- kernel/sched/debug.c | 10 + kernel/sched/fair.c | 7 +- kernel/sched/sched.h | 10 + kernel/sched/walt.c | 540 +++++++++++++++++++++++++++++++++++++++ kernel/sched/walt.h | 73 ++++++ kernel/sysctl.c | 9 + 14 files changed, 812 insertions(+), 5 deletions(-) create mode 100644 kernel/sched/walt.c create mode 100644 kernel/sched/walt.h
diff --git a/include/linux/sched.h b/include/linux/sched.h index 253538f..56e708f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -314,6 +314,17 @@ extern char ___assert_task_state[1 - 2*!!( /* Task command name length */ #define TASK_COMM_LEN 16
+enum task_event { + PUT_PREV_TASK = 0, + PICK_NEXT_TASK = 1, + TASK_WAKE = 2, + TASK_MIGRATE = 3, + TASK_UPDATE = 4, + IRQ_UPDATE = 5, +}; + +extern char *task_event_names[]; + #include <linux/spinlock.h>
/* @@ -1318,6 +1329,25 @@ struct sched_statistics { }; #endif
+#ifdef CONFIG_SCHED_WALT + +/* ravg represents capacity scaled cpu-usage of tasks */ +struct ravg { + /* + * 'mark_start' marks the most recent event for a task + * + * 'curr_window' represents task's cpu usage in its most recent + * window + * + * 'prev_window' represents task's cpu usage in the window prior + * to the one represented by 'curr_window' + */ + u64 mark_start; + u32 curr_window, prev_window; +}; +#endif + + struct sched_entity { struct load_weight load; /* for load-balancing */ struct rb_node run_node; @@ -1478,6 +1508,11 @@ struct task_struct { const struct sched_class *sched_class; struct sched_entity se; struct sched_rt_entity rt; + +#ifdef CONFIG_SCHED_WALT + struct ravg ravg; +#endif + #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; #endif diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 22db1e6..7007815 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -31,6 +31,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; +extern unsigned int sysctl_sched_use_walt_metrics;
#ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 9b90c57..2adf245 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -562,6 +562,82 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
TP_printk("cpu=%d", __entry->cpu) ); + +TRACE_EVENT(sched_walt_util, + + TP_PROTO(int cpu, unsigned long cfs_util_avg, unsigned long walt_util), + + TP_ARGS(cpu, cfs_util_avg, walt_util), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( unsigned long, cfs_util_avg ) + __field( unsigned long, walt_util ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->cfs_util_avg = cfs_util_avg; + __entry->walt_util = walt_util; + ), + + TP_printk("cpu %d cfs_util_avg %lu walt_util %lu", __entry->cpu, __entry->cfs_util_avg, __entry->walt_util) +); + +struct rq; + +TRACE_EVENT(sched_walt_update_task_ravg, + + TP_PROTO(struct task_struct *p, struct rq *rq, enum task_event evt, u64 wallclock, u64 irqtime), + + TP_ARGS(p, rq, evt, wallclock, irqtime), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( pid_t, cur_pid ) + __field(unsigned int, cpu ) + __field(unsigned int, cur_freq ) + __field( u64, wallclock ) + __field( u64, mark_start ) + __field( u64, win_start ) + __field( u64, irqtime ) + __field(enum task_event, evt ) + __field( u64, rq_cs ) + __field( u64, rq_ps ) + __field( u32, curr_window ) + __field( u32, prev_window ) + ), + + TP_fast_assign( + __entry->wallclock = wallclock; + __entry->win_start = rq->window_start; + __entry->evt = evt; + __entry->cpu = rq->cpu; + __entry->cur_pid = rq->curr->pid; + __entry->cur_freq = rq->cur_freq; + memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __entry->pid = p->pid; + __entry->mark_start = p->ravg.mark_start; + __entry->irqtime = irqtime; + __entry->rq_cs = rq->curr_runnable_sum; + __entry->rq_ps = rq->prev_runnable_sum; + __entry->curr_window = p->ravg.curr_window; + __entry->prev_window = p->ravg.prev_window; + ), + + TP_printk("wc %llu ws %llu event %s cpu %d cur_freq %u cur_pid %d task %d (%s) ms %llu irqtime %llu rq_cs %llu rq_ps %llu cur_window %u prev_window %u" + , __entry->wallclock, __entry->win_start, + task_event_names[__entry->evt], __entry->cpu, + __entry->cur_freq, __entry->cur_pid, + __entry->pid, __entry->comm, __entry->mark_start, + __entry->irqtime, + __entry->rq_cs, __entry->rq_ps, + __entry->curr_window, __entry->prev_window + ) +); + + #endif /* _TRACE_SCHED_H */
/* This part must be outside protection */ diff --git a/init/Kconfig b/init/Kconfig index f755a60..e259273 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -388,6 +388,15 @@ config IRQ_TIME_ACCOUNTING
endchoice
+config SCHED_WALT + bool "Support window based load tracking" + depends on SMP + help + This feature will allow the scheduler to maintain a tunable window + based set of metrics for tasks and runqueues. These metrics can be + used to guide task placement as well as task frequency requirements + for cpufreq governors. + config BSD_PROCESS_ACCT bool "BSD Process Accounting" depends on MULTIUSER diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 5e59b83..41ada04 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,6 +19,7 @@ obj-y += core.o loadavg.o clock.o cputime.o obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o obj-y += wait.o swait.o completion.o idle.o obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o +obj-$(CONFIG_SCHED_WALT) += walt.o obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 51d7105..068bde8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -90,6 +90,8 @@ #define CREATE_TRACE_POINTS #include <trace/events/sched.h>
+#include "walt.h" + DEFINE_MUTEX(sched_domains_mutex); DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -1241,6 +1243,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) p->sched_class->migrate_task_rq(p); p->se.nr_migrations++; perf_event_task_migrate(p); + + walt_fixup_busy_time(p, new_cpu); }
__set_task_cpu(p, new_cpu); @@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_acquire(!p->on_cpu);
+ raw_spin_lock(&task_rq(p)->lock); + walt_update_task_ravg(p, task_rq(p), TASK_WAKE, walt_ktime_clock(), 0); + raw_spin_unlock(&task_rq(p)->lock); + p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING;
@@ -2106,8 +2114,10 @@ static void try_to_wake_up_local(struct task_struct *p, struct pin_cookie cookie
trace_sched_waking(p);
- if (!task_on_rq_queued(p)) + if (!task_on_rq_queued(p)) { + walt_update_task_ravg(p, rq, TASK_WAKE, walt_ktime_clock(), 0); ttwu_activate(rq, p, ENQUEUE_WAKEUP); + }
ttwu_do_wakeup(rq, p, 0, cookie); if (schedstat_enabled()) @@ -2173,6 +2183,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.nr_migrations = 0; p->se.vruntime = 0; INIT_LIST_HEAD(&p->se.group_node); + walt_init_new_task_load(p);
#ifdef CONFIG_FAIR_GROUP_SCHED p->se.cfs_rq = NULL; @@ -2540,6 +2551,8 @@ void wake_up_new_task(struct task_struct *p) rq = __task_rq_lock(p, &rf); post_init_entity_util_avg(&p->se);
+ walt_mark_task_starting(p); + activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); @@ -3023,6 +3036,8 @@ void scheduler_tick(void) update_rq_clock(rq); curr->sched_class->task_tick(rq, curr, 0); cpu_load_update_active(rq); + walt_update_task_ravg(rq->curr, rq, TASK_UPDATE, + walt_ktime_clock(), 0); calc_global_load_tick(rq); raw_spin_unlock(&rq->lock);
@@ -3271,6 +3286,7 @@ static void __sched notrace __schedule(bool preempt) struct pin_cookie cookie; struct rq *rq; int cpu; + u64 wallclock;
cpu = smp_processor_id(); rq = cpu_rq(cpu); @@ -3334,6 +3350,9 @@ static void __sched notrace __schedule(bool preempt) update_rq_clock(rq);
next = pick_next_task(rq, prev, cookie); + wallclock = walt_ktime_clock(); + walt_update_task_ravg(prev, rq, PUT_PREV_TASK, wallclock, 0); + walt_update_task_ravg(next, rq, PICK_NEXT_TASK, wallclock, 0); clear_tsk_need_resched(prev); clear_preempt_need_resched(); rq->clock_skip_update = 0; @@ -7229,6 +7248,9 @@ int sched_cpu_deactivate(unsigned int cpu) static void sched_rq_cpu_starting(unsigned int cpu) { struct rq *rq = cpu_rq(cpu); + + if (!rq->window_start) + walt_set_window_start(rq);
rq->calc_load_update = calc_load_update; account_reset_rq(rq); @@ -7251,6 +7273,9 @@ int sched_cpu_dying(unsigned int cpu) /* Handle pending wakeups and then migrate everything off */ sched_ttwu_pending(); raw_spin_lock_irqsave(&rq->lock, flags); + + walt_migrate_sync_cpu(cpu); + if (rq->rd) { BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span)); set_rq_offline(rq); @@ -7270,6 +7295,7 @@ void __init sched_init_smp(void) { cpumask_var_t non_isolated_cpus;
+ walt_init_cpu_efficiency(); alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL); alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 14c4aa2..2eef34d 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -15,8 +15,10 @@ #include <linux/module.h> #include <linux/slab.h> #include <trace/events/power.h> +#include <trace/events/sched.h>
#include "sched.h" +#include "walt.h"
struct sugov_tunables { struct gov_attr_set attr_set; @@ -97,6 +99,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
policy->cur = next_freq; trace_cpu_frequency(next_freq, smp_processor_id()); + walt_freq_transition(smp_processor_id(), next_freq); } else if (sg_policy->next_freq != next_freq) { sg_policy->next_freq = next_freq; sg_policy->work_in_progress = true; @@ -125,7 +128,9 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, static unsigned int get_next_freq(struct cpufreq_policy *policy, unsigned long util, unsigned long max) { - unsigned int freq = arch_scale_freq_invariant() ? + int invariant = (sysctl_sched_use_walt_metrics || + arch_scale_freq_invariant()); + unsigned int freq = invariant ? policy->cpuinfo.max_freq : policy->cur;
return (freq + (freq >> 2)) * util / max; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 75f98c5..af9cf3e 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -52,6 +52,8 @@ void irqtime_account_irq(struct task_struct *curr) unsigned long flags; s64 delta; int cpu; + u64 wallclock; + boot account = true;
if (!sched_clock_irqtime) return; @@ -59,7 +61,8 @@ void irqtime_account_irq(struct task_struct *curr) local_irq_save(flags);
cpu = smp_processor_id(); - delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time); + wallclock = sched_clock_cpu(cpu); + delta = wallclock - __this_cpu_read(irq_start_time); __this_cpu_add(irq_start_time, delta);
irq_time_write_begin(); @@ -73,8 +76,14 @@ void irqtime_account_irq(struct task_struct *curr) __this_cpu_add(cpu_hardirq_time, delta); else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) __this_cpu_add(cpu_softirq_time, delta); + else + account = false;
irq_time_write_end(); + + if (account && is_idle_task(curr)) + walt_account_irqtime(cpu, curr, delta, wallclock); + local_irq_restore(flags); } EXPORT_SYMBOL_GPL(irqtime_account_irq); diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 0368c39..3fe8b89 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -607,6 +607,16 @@ do { \ P(nr_switches); P(nr_load_updates); P(nr_uninterruptible); +#ifdef CONFIG_SMP + P(cpu_capacity_orig); + P(cpu_capacity); +#ifdef CONFIG_SCHED_WALT + P(window_start); + P(cur_freq); + P(curr_runnable_sum); + P(prev_runnable_sum); +#endif +#endif PN(next_balance); SEQ_printf(m, " .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr))); PN(clock); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bdcbeea..8724299 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -29,11 +29,13 @@ #include <linux/interrupt.h> #include <linux/mempolicy.h> #include <linux/migrate.h> +#include <linux/module.h> #include <linux/task_work.h>
#include <trace/events/sched.h>
#include "sched.h" +#include "walt.h"
/* * Targeted preemption latency for CPU-bound tasks: @@ -2882,6 +2884,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) { unsigned long max = rq->cpu_capacity_orig; + unsigned long util = cpu_walt_util(rq);
/* * There are a few boundary cases this might miss but it should @@ -2899,8 +2902,8 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) * * See cpu_util(). */ - cpufreq_update_util(rq_clock(rq), - min(cfs_rq->avg.util_avg, max), max); + + cpufreq_update_util(rq_clock(rq), min(util, max), max); } }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7cbeb92..52a0ac5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -663,6 +663,16 @@ struct rq { u64 max_idle_balance_cost; #endif
+#ifdef CONFIG_SCHED_WALT + unsigned int cur_freq; + struct cpumask freq_domain_cpumask; + + int efficiency; /* Differentiate cpus with different IPC capability */ + u64 window_start; + u64 curr_runnable_sum; + u64 prev_runnable_sum; +#endif /* CONFIG_SCHED_WALT */ + #ifdef CONFIG_IRQ_TIME_ACCOUNTING u64 prev_irq_time; #endif diff --git a/kernel/sched/walt.c b/kernel/sched/walt.c new file mode 100644 index 0000000..203e02d --- /dev/null +++ b/kernel/sched/walt.c @@ -0,0 +1,540 @@ +/* + * Copyright (c) 2016, The Linux Foundation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 and + * only version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * + * Window Assisted Load Tracking (WALT) implementation credits: + * Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park, + * Pavan Kumar Kondeti, Olav Haugan + * + * 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla + * and Todd Kjos + * 2016-08-31: Integration with mainline by Srivatsa Vaddagiri + * and Vikram Mulukutla + */ + +#include <linux/syscore_ops.h> +#include <linux/cpufreq.h> +#include <trace/events/sched.h> +#include "sched.h" +#include "walt.h" + + +char *task_event_names[] = {"PUT_PREV_TASK", "PICK_NEXT_TASK", + "TASK_WAKE", "TASK_MIGRATE", "TASK_UPDATE", + "IRQ_UPDATE"}; + +__read_mostly unsigned int sysctl_sched_use_walt_metrics = 1; + +static __read_mostly unsigned int walt_freq_account_wait_time; +static __read_mostly unsigned int walt_io_is_busy; + +/* 1 -> use PELT based load stats, 0 -> use window-based load stats */ +static unsigned int __read_mostly walt_disabled; + +static unsigned int max_possible_efficiency = 1024; + +/* + * Maximum possible frequency across all cpus. Task demand and cpu + * capacity (cpu_power) metrics are scaled in reference to it. + */ +static unsigned int max_possible_freq = 1; + +/* Window size (in ns) */ +__read_mostly unsigned int walt_ravg_window = 20000000; + +/* Min window size (in ns) = 10ms */ +#define MIN_SCHED_RAVG_WINDOW 10000000 + +/* Max window size (in ns) = 1s */ +#define MAX_SCHED_RAVG_WINDOW 1000000000 + +static unsigned int sync_cpu; +static ktime_t ktime_last; +static bool walt_ktime_suspended; + +u64 walt_ktime_clock(void) +{ + if (unlikely(walt_ktime_suspended)) + return ktime_to_ns(ktime_last); + return ktime_get_ns(); +} + +static void walt_resume(void) +{ + walt_ktime_suspended = false; +} + +static int walt_suspend(void) +{ + ktime_last = ktime_get(); + walt_ktime_suspended = true; + return 0; +} + +static struct syscore_ops walt_syscore_ops = { + .resume = walt_resume, + .suspend = walt_suspend +}; + +static int __init walt_init_ops(void) +{ + register_syscore_ops(&walt_syscore_ops); + return 0; +} +late_initcall(walt_init_ops); + +static int __init set_walt_ravg_window(char *str) +{ + get_option(&str, &walt_ravg_window); + + walt_disabled = (walt_ravg_window < MIN_SCHED_RAVG_WINDOW || + walt_ravg_window > MAX_SCHED_RAVG_WINDOW); + return 0; +} + +early_param("walt_ravg_window", set_walt_ravg_window); + +static void +update_window_start(struct rq *rq, u64 wallclock) +{ + s64 delta; + int nr_windows; + u64 prev_sum = 0; + + delta = wallclock - rq->window_start; + BUG_ON(delta < 0); + if (delta < walt_ravg_window) + return; + + nr_windows = div64_u64(delta, walt_ravg_window); + if (nr_windows == 1) + prev_sum = rq->curr_runnable_sum; + + rq->prev_runnable_sum = prev_sum; + rq->curr_runnable_sum = 0; + + rq->window_start += (u64)nr_windows * (u64)walt_ravg_window; +} + +static u64 scale_exec_time(u64 delta, struct rq *rq) +{ + unsigned int cur_freq = rq->cur_freq; + int sf; + + /* round up div64 */ + delta = div64_u64(delta * cur_freq + max_possible_freq - 1, + max_possible_freq); + + sf = DIV_ROUND_UP(rq->efficiency * 1024, max_possible_efficiency); + + delta *= sf; + delta >>= 10; + + return delta; +} + +static int cpu_is_waiting_on_io(struct rq *rq) +{ + if (!walt_io_is_busy) + return 0; + + return atomic_read(&rq->nr_iowait); +} + +static int account_cpu_busy_time(struct rq *rq, struct task_struct *p, + u64 irqtime, int event) +{ + if (is_idle_task(p)) { + /* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */ + if (event == PICK_NEXT_TASK) + return 0; + + /* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */ + return irqtime || cpu_is_waiting_on_io(rq); + } + + if (event == TASK_WAKE) + return 0; + + if (event == PUT_PREV_TASK || event == IRQ_UPDATE || + event == TASK_UPDATE) + return 1; + + /* Only TASK_MIGRATE && PICK_NEXT_TASK left */ + return walt_freq_account_wait_time; +} + +/* + * Account cpu activity in its busy time counters (rq->curr/prev_runnable_sum) + */ +static void update_cpu_busy_time(struct task_struct *p, struct rq *rq, + int event, u64 wallclock, u64 irqtime) +{ + int new_window, nr_full_windows = 0; + u64 mark_start = p->ravg.mark_start; + u64 window_start = rq->window_start; + u32 window_size = walt_ravg_window; + u64 delta; + + new_window = mark_start < window_start; + if (new_window) + nr_full_windows = div64_u64((window_start - mark_start), + window_size); + + /* Handle window rollover */ + if (new_window) { + if (!is_idle_task(p)) { + u32 curr_window = 0; + + if (!nr_full_windows) + curr_window = p->ravg.curr_window; + + p->ravg.prev_window = curr_window; + p->ravg.curr_window = 0; + } + } + + if (!account_cpu_busy_time(rq, p, irqtime, event)) + return; + + if (!new_window) { + if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) + delta = wallclock - mark_start; + else + delta = irqtime; + delta = scale_exec_time(delta, rq); + rq->curr_runnable_sum += delta; + if (!is_idle_task(p)) + p->ravg.curr_window += delta; + + return; + } + + if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) { + if (!nr_full_windows) { + /* A full window hasn't elapsed, account partial + * contribution to previous completed window. */ + delta = scale_exec_time(window_start - mark_start, rq); + p->ravg.prev_window += delta; + } else { + /* Since at least one full window has elapsed, + * the contribution to the previous window is the + * full window (window_size). */ + delta = scale_exec_time(window_size, rq); + p->ravg.prev_window = delta; + } + rq->prev_runnable_sum += delta; + + /* Account piece of busy time in the current window. */ + delta = scale_exec_time(wallclock - window_start, rq); + rq->curr_runnable_sum += delta; + p->ravg.curr_window = delta; + + return; + } + + if (irqtime) { + /* IRQ busy time start = wallclock - irqtime */ + mark_start = wallclock - irqtime; + + if (mark_start > window_start) { + rq->curr_runnable_sum += scale_exec_time(irqtime, rq); + return; + } + + /* + * IRQ busy time spanned multiple windows. Process the + * busy time preceding the current window first + */ + delta = window_start - mark_start; + if (delta > window_size) + delta = window_size; + delta = scale_exec_time(delta, rq); + rq->prev_runnable_sum += delta; + + /* Process the remaining IRQ busy time in the current window. */ + delta = wallclock - window_start; + rq->curr_runnable_sum += scale_exec_time(delta, rq); + + return; + } + + BUG(); +} + +/* Reflect task activity on its demand and cpu's busy time statistics */ +void walt_update_task_ravg(struct task_struct *p, struct rq *rq, + enum task_event event, u64 wallclock, u64 irqtime) +{ + if (walt_disabled || !rq->window_start) + return; + + lockdep_assert_held(&rq->lock); + + update_window_start(rq, wallclock); + + if (!p->ravg.mark_start) + goto done; + + update_cpu_busy_time(p, rq, event, wallclock, irqtime); + +done: + trace_sched_walt_update_task_ravg(p, rq, event, wallclock, irqtime); + + p->ravg.mark_start = wallclock; +} + +unsigned long __weak arch_get_cpu_efficiency(int cpu) +{ + return 1024; +} + +void walt_init_cpu_efficiency(void) +{ + int i, efficiency; + unsigned int max = 0; + + for_each_possible_cpu(i) { + efficiency = arch_get_cpu_efficiency(i); + cpu_rq(i)->efficiency = efficiency; + + if (efficiency > max) + max = efficiency; + } + + if (max) + max_possible_efficiency = max; +} + +void walt_mark_task_starting(struct task_struct *p) +{ + u64 wallclock; + struct rq *rq = task_rq(p); + + if (!rq->window_start) + return; + + wallclock = walt_ktime_clock(); + p->ravg.mark_start = wallclock; +} + +void walt_set_window_start(struct rq *rq) +{ + int cpu = cpu_of(rq); + struct rq *sync_rq = cpu_rq(sync_cpu); + unsigned long flags; + + if (!rq->cur_freq || rq->window_start || + walt_ktime_clock() < walt_ravg_window) + return; + + if (cpu == sync_cpu) { + raw_spin_lock_irqsave(&rq->lock, flags); + rq->window_start = walt_ktime_clock(); + rq->curr_runnable_sum = rq->prev_runnable_sum = 0; + raw_spin_unlock_irqrestore(&rq->lock, flags); + } else { + local_irq_save(flags); + double_rq_lock(rq, sync_rq); + rq->window_start = cpu_rq(sync_cpu)->window_start; + rq->curr_runnable_sum = rq->prev_runnable_sum = 0; + double_rq_unlock(rq, sync_rq); + local_irq_restore(flags); + } +} + +void walt_migrate_sync_cpu(int cpu) +{ + if (cpu == sync_cpu) + sync_cpu = smp_processor_id(); +} + +void walt_fixup_busy_time(struct task_struct *p, int new_cpu) +{ + struct rq *src_rq = task_rq(p); + struct rq *dest_rq = cpu_rq(new_cpu); + u64 wallclock; + + if (!p->on_rq && p->state != TASK_WAKING) + return; + + if (p->state == TASK_WAKING) + double_rq_lock(src_rq, dest_rq); + else + double_lock_balance(src_rq, dest_rq); + + /* Note that same wallclock reference is used for all 3 events below */ + wallclock = walt_ktime_clock(); + + /* Update counters on both cpus first */ + walt_update_task_ravg(task_rq(p)->curr, task_rq(p), + TASK_UPDATE, wallclock, 0); + walt_update_task_ravg(dest_rq->curr, dest_rq, + TASK_UPDATE, wallclock, 0); + + /* Update task's counters */ + walt_update_task_ravg(p, task_rq(p), TASK_MIGRATE, wallclock, 0); + + /* Fixup busy time */ + if (p->ravg.curr_window) { + src_rq->curr_runnable_sum -= p->ravg.curr_window; + if (!dest_rq->window_start) { + p->ravg.curr_window = 0; + p->ravg.mark_start = 0; + } + dest_rq->curr_runnable_sum += p->ravg.curr_window; + } + + if (p->ravg.prev_window) { + src_rq->prev_runnable_sum -= p->ravg.prev_window; + if (!dest_rq->window_start) + p->ravg.prev_window = 0; + dest_rq->prev_runnable_sum += p->ravg.prev_window; + } + + if ((s64)src_rq->prev_runnable_sum < 0) { + src_rq->prev_runnable_sum = 0; + WARN_ON(1); + } + if ((s64)src_rq->curr_runnable_sum < 0) { + src_rq->curr_runnable_sum = 0; + WARN_ON(1); + } + + if (p->state == TASK_WAKING) + double_rq_unlock(src_rq, dest_rq); + else + double_unlock_balance(src_rq, dest_rq); +} + +static int cpufreq_notifier_policy(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct cpufreq_policy *policy = (struct cpufreq_policy *)data; + int i; + + if (val != CPUFREQ_NOTIFY) + return 0; + + for_each_cpu(i, policy->related_cpus) { + cpumask_copy(&cpu_rq(i)->freq_domain_cpumask, + policy->related_cpus); + cpu_rq(i)->cur_freq = policy->cur; + if (!cpu_rq(i)->window_start) + walt_set_window_start(cpu_rq(i)); + } + + max_possible_freq = max(max_possible_freq, policy->cpuinfo.max_freq); + + return 0; +} + +void walt_account_irqtime(int cpu, struct task_struct *curr, + u64 delta, u64 wallclock) +{ + struct rq *rq = cpu_rq(cpu); + unsigned long flags; + + raw_spin_lock_irqsave(&rq->lock, flags); + + /* + * cputime (wallclock) uses sched_clock so use the same here for + * consistency. + */ + delta += sched_clock_cpu(cpu) - wallclock; + + walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), delta); + + raw_spin_unlock_irqrestore(&rq->lock, flags); +} + +int fast_switching = 1; + +void walt_freq_transition(int cpu, unsigned long new_freq) +{ + int i; + unsigned long flags; + + local_irq_save(flags); + for_each_cpu(i, &cpu_rq(cpu)->freq_domain_cpumask) { + struct rq *rq = cpu_rq(i); + + if (!fast_switching || + (fast_switching && smp_processor_id() != i)) + raw_spin_lock(&rq->lock); + + walt_update_task_ravg(rq->curr, rq, TASK_UPDATE, + walt_ktime_clock(), 0); + rq->cur_freq = new_freq; + + if (!fast_switching || + (fast_switching && smp_processor_id() != i)) + raw_spin_unlock(&rq->lock); + if (!rq->window_start) + walt_set_window_start(rq); + } + local_irq_restore(flags); +} + +static int cpufreq_notifier_trans(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct cpufreq_freqs *freq = (struct cpufreq_freqs *)data; + unsigned int cpu = freq->cpu, new_freq = freq->new; + + if (val != CPUFREQ_POSTCHANGE) + return 0; + + BUG_ON(!new_freq); + + if (cpu_rq(cpu)->cur_freq == new_freq) + return 0; + + walt_freq_transition(cpu, new_freq); + + return 0; +} + +static struct notifier_block notifier_policy_block = { + .notifier_call = cpufreq_notifier_policy +}; + +static struct notifier_block notifier_trans_block = { + .notifier_call = cpufreq_notifier_trans +}; + +static int register_sched_callback(void) +{ + int ret; + + ret = cpufreq_register_notifier(¬ifier_policy_block, + CPUFREQ_POLICY_NOTIFIER); + + if (!fast_switching) + ret = cpufreq_register_notifier(¬ifier_trans_block, + CPUFREQ_TRANSITION_NOTIFIER); + + return 0; +} + +/* + * cpufreq callbacks can be registered at core_initcall or later time. + * Any registration done prior to that is "forgotten" by cpufreq. See + * initialization of variable init_cpufreq_transition_notifier_list_called + * for further information. + */ +core_initcall(register_sched_callback); + +void walt_init_new_task_load(struct task_struct *p) +{ + memset(&p->ravg, 0, sizeof(struct ravg)); +} diff --git a/kernel/sched/walt.h b/kernel/sched/walt.h new file mode 100644 index 0000000..5b03995 --- /dev/null +++ b/kernel/sched/walt.h @@ -0,0 +1,73 @@ +/* + * Copyright (c) 2016, The Linux Foundation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 and + * only version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef __WALT_H +#define __WALT_H + +#ifdef CONFIG_SCHED_WALT + +void walt_update_task_ravg(struct task_struct *p, struct rq *rq, enum task_event event, + u64 wallclock, u64 irqtime); +void walt_fixup_busy_time(struct task_struct *p, int new_cpu); +void walt_init_new_task_load(struct task_struct *p); +void walt_mark_task_starting(struct task_struct *p); +void walt_set_window_start(struct rq *rq); +void walt_migrate_sync_cpu(int cpu); +void walt_init_cpu_efficiency(void); +u64 walt_ktime_clock(void); +void walt_account_irqtime(int cpu, struct task_struct *curr, u64 delta, + u64 wallclock); +void walt_freq_transition(int cpu, unsigned long new_freq); + +extern unsigned int sysctl_sched_use_walt_metrics; +extern unsigned int walt_ravg_window; + +/* Fold into cpu_util */ +static inline unsigned long cpu_walt_util(struct rq *rq) +{ + trace_sched_walt_util(cpu_of(rq), rq->cfs.avg.util_avg, rq->prev_runnable_sum * + rq->cpu_capacity_orig / walt_ravg_window); + + if (!sysctl_sched_use_walt_metrics) + return rq->cfs.avg.util_avg; + + return (rq->prev_runnable_sum * rq->cpu_capacity_orig) / + walt_ravg_window; +} + +#else /* CONFIG_SCHED_WALT */ + +static inline void walt_update_task_ravg(struct task_struct *p, struct rq *rq, + int event, u64 wallclock, u64 irqtime) { } +static inline void walt_fixup_busy_time(struct task_struct *p, int new_cpu) { } +static inline void walt_init_new_task_load(struct task_struct *p) { } +static inline void walt_mark_task_starting(struct task_struct *p) { } +static inline void walt_set_window_start(struct rq *rq) { } +static inline void walt_migrate_sync_cpu(int cpu) { } +static inline void walt_init_cpu_efficiency(void) { } +static inline u64 walt_ktime_clock(void) { return 0; } +static inline void walt_account_irqtime(int cpu, struct task_struct *curr, + u64 delta, u64 wallclock) { } + +static inline unsigned long cpu_walt_util(struct rq *rq) +{ + return rq->cfs.avg.util_avg; +} + +static inline void walt_freq_transition(int cpu, unsigned long new_freq) { }; + +#endif /* CONFIG_SCHED_WALT */ + +extern unsigned int walt_ravg_window; + +#endif diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 87b2fc3..4669a34 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -439,6 +439,15 @@ static struct ctl_table kern_table[] = { .extra2 = &one, }, #endif + { + .procname = "sched_use_walt_metrics", + .data = &sysctl_sched_use_walt_metrics, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, #ifdef CONFIG_CFS_BANDWIDTH { .procname = "sched_cfs_bandwidth_slice_us", -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
On 03/09/16 00:27, markivx@codeaurora.org wrote:
From: Srivatsa Vaddagiri vatsa@codeaurora.org
This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to the registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
(window size determined by walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed during the current (not yet completed) window prev_runnable_sum: aggregate utilization of all tasks that executed during the most recent completed window
prev_runnable_sum is the primary stastic used to guide CPU frequency in lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., margin) before deciding the next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity : util;' in cpu_util()? Just asking because otherwise IMHO this is no different to PELT util.
Corresponding to the aggregate statistics, WALT also mantains the following stats per task:
curr_window - represents the cpu utilization of the task in its most recently tracked window prev_window - represents cpu utilization of task in the window prior to the one being tracked by curr_window
WALT statistic updates are event driven, with updates occurring in scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch), task wakeup and during task migration. Migration simply involves removing a tasks's curr_window and prev_window from the source CPU's curr_runnable sum and prev_runnable_sum, and adding the per-task counters to the destination CPU's aggregate CPU counters.
PELT util updates are event-driven as well. The difference is that WALT operates in core.c whereas PELT util only considers the CFS class.
Execution time in an IRQ handler is accounted in a CPU's curr_runnable_sum statistic, provided that the CPU was also executing the idle task for the duration of the interrupt handler.
Idle task handling is modified by walt_io_is_busy; when set to 1, if a CPU rq has tasks blocked on IO, idle-task execution is accounted in per-task and per-CPU counters. Setting walt_io_is_busy will also cause interrupt handlers in the idle task to update counters as if the idle task was executing (instead of just the interrupt handler execution time).
The major tunable provided by WALT is walt_ravg_window, which represents window size (in nanoseconds) and is set to 20ms by default. walt_io_is_busy (described above) is set to 0 by default.
Potential upcoming changes/improvements include: the use of sched_clock instead of ktime_get as a time source, support for an unsynchronized (across CPUs) time source, and integration with mainlined CPU efficiency APIs.
Signed-off-by: Srivatsa Vaddagiri vatsa@codeaurora.org Signed-off-by: Vikram Mulukutla markivx@codeaurora.org
include/linux/sched.h | 35 +++ include/linux/sched/sysctl.h | 1 + include/trace/events/sched.h | 76 ++++++ init/Kconfig | 9 + kernel/sched/Makefile | 1 + kernel/sched/core.c | 28 +- kernel/sched/cpufreq_schedutil.c | 7 +- kernel/sched/cputime.c | 11 +- kernel/sched/debug.c | 10 + kernel/sched/fair.c | 7 +- kernel/sched/sched.h | 10 + kernel/sched/walt.c | 540 +++++++++++++++++++++++++++++++++++++++ kernel/sched/walt.h | 73 ++++++ kernel/sysctl.c | 9 + 14 files changed, 812 insertions(+), 5 deletions(-) create mode 100644 kernel/sched/walt.c create mode 100644 kernel/sched/walt.h
diff --git a/include/linux/sched.h b/include/linux/sched.h index 253538f..56e708f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -314,6 +314,17 @@ extern char ___assert_task_state[1 - 2*!!( /* Task command name length */ #define TASK_COMM_LEN 16
+enum task_event {
- PUT_PREV_TASK = 0,
- PICK_NEXT_TASK = 1,
- TASK_WAKE = 2,
- TASK_MIGRATE = 3,
- TASK_UPDATE = 4,
- IRQ_UPDATE = 5,
+};
I always ask myself why WALT has fine-grained event types where PELT can do with running/not running? Is there a benefit or are they used in account_cpu_busy_time() essentially as running/not running ?
[...]
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 9b90c57..2adf245 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -562,6 +562,82 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
TP_printk("cpu=%d", __entry->cpu) );
+TRACE_EVENT(sched_walt_util,
- TP_PROTO(int cpu, unsigned long cfs_util_avg, unsigned long walt_util),
- TP_ARGS(cpu, cfs_util_avg, walt_util),
- TP_STRUCT__entry(
__field( int, cpu )
__field( unsigned long, cfs_util_avg )
__field( unsigned long, walt_util )
- ),
- TP_fast_assign(
__entry->cpu = cpu;
__entry->cfs_util_avg = cfs_util_avg;
__entry->walt_util = walt_util;
- ),
- TP_printk("cpu %d cfs_util_avg %lu walt_util %lu", __entry->cpu, __entry->cfs_util_avg, __entry->walt_util)
+);
+struct rq;
+TRACE_EVENT(sched_walt_update_task_ravg,
- TP_PROTO(struct task_struct *p, struct rq *rq, enum task_event evt, u64 wallclock, u64 irqtime),
- TP_ARGS(p, rq, evt, wallclock, irqtime),
- TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( pid_t, cur_pid )
__field(unsigned int, cpu )
__field(unsigned int, cur_freq )
__field( u64, wallclock )
__field( u64, mark_start )
__field( u64, win_start )
__field( u64, irqtime )
__field(enum task_event, evt )
__field( u64, rq_cs )
__field( u64, rq_ps )
__field( u32, curr_window )
__field( u32, prev_window )
- ),
- TP_fast_assign(
__entry->wallclock = wallclock;
__entry->win_start = rq->window_start;
__entry->evt = evt;
__entry->cpu = rq->cpu;
__entry->cur_pid = rq->curr->pid;
__entry->cur_freq = rq->cur_freq;
memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
__entry->pid = p->pid;
__entry->mark_start = p->ravg.mark_start;
__entry->irqtime = irqtime;
__entry->rq_cs = rq->curr_runnable_sum;
__entry->rq_ps = rq->prev_runnable_sum;
__entry->curr_window = p->ravg.curr_window;
__entry->prev_window = p->ravg.prev_window;
- ),
- TP_printk("wc %llu ws %llu event %s cpu %d cur_freq %u cur_pid %d task %d (%s) ms %llu irqtime %llu rq_cs %llu rq_ps %llu cur_window %u prev_window %u"
, __entry->wallclock, __entry->win_start,
task_event_names[__entry->evt], __entry->cpu,
__entry->cur_freq, __entry->cur_pid,
__entry->pid, __entry->comm, __entry->mark_start,
__entry->irqtime,
__entry->rq_cs, __entry->rq_ps,
__entry->curr_window, __entry->prev_window
)
+);
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
IMHO, trace_events can go into another patch.
If you try to compile w/o CONFIG_SCHED_WALT they break the build.
[...]
@@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_acquire(!p->on_cpu);
- raw_spin_lock(&task_rq(p)->lock);
This is extra locking, right?
[...]
diff --git a/kernel/sched/walt.c b/kernel/sched/walt.c new file mode 100644 index 0000000..203e02d --- /dev/null +++ b/kernel/sched/walt.c @@ -0,0 +1,540 @@ +/*
- Copyright (c) 2016, The Linux Foundation. All rights reserved.
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License version 2 and
- only version 2 as published by the Free Software Foundation.
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
- Window Assisted Load Tracking (WALT) implementation credits:
- Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
- Pavan Kumar Kondeti, Olav Haugan
- 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
and Todd Kjos
- 2016-08-31: Integration with mainline by Srivatsa Vaddagiri
and Vikram Mulukutla
- */
+#include <linux/syscore_ops.h> +#include <linux/cpufreq.h> +#include <trace/events/sched.h> +#include "sched.h" +#include "walt.h"
+char *task_event_names[] = {"PUT_PREV_TASK", "PICK_NEXT_TASK",
"TASK_WAKE", "TASK_MIGRATE", "TASK_UPDATE",
"IRQ_UPDATE"};
+__read_mostly unsigned int sysctl_sched_use_walt_metrics = 1;
+static __read_mostly unsigned int walt_freq_account_wait_time; +static __read_mostly unsigned int walt_io_is_busy;
+/* 1 -> use PELT based load stats, 0 -> use window-based load stats */ +static unsigned int __read_mostly walt_disabled;
+static unsigned int max_possible_efficiency = 1024;
+/*
- Maximum possible frequency across all cpus. Task demand and cpu
- capacity (cpu_power) metrics are scaled in reference to it.
- */
+static unsigned int max_possible_freq = 1;
+/* Window size (in ns) */ +__read_mostly unsigned int walt_ravg_window = 20000000;
+/* Min window size (in ns) = 10ms */ +#define MIN_SCHED_RAVG_WINDOW 10000000
+/* Max window size (in ns) = 1s */ +#define MAX_SCHED_RAVG_WINDOW 1000000000
+static unsigned int sync_cpu;
What is the business of this sync_cpu?
[...]
+unsigned long __weak arch_get_cpu_efficiency(int cpu) +{
- return 1024;
+}
So this should be connected to arch_scale_cpu_capacity(NULL, cpu) instead. You mentioned this already somewhere.
+void walt_init_cpu_efficiency(void) +{
- int i, efficiency;
- unsigned int max = 0;
- for_each_possible_cpu(i) {
efficiency = arch_get_cpu_efficiency(i);
cpu_rq(i)->efficiency = efficiency;
if (efficiency > max)
max = efficiency;
- }
- if (max)
max_possible_efficiency = max;
+}
[...]
+void walt_account_irqtime(int cpu, struct task_struct *curr,
u64 delta, u64 wallclock)
+{
- struct rq *rq = cpu_rq(cpu);
- unsigned long flags;
- raw_spin_lock_irqsave(&rq->lock, flags);
- /*
* cputime (wallclock) uses sched_clock so use the same here for
* consistency.
*/
- delta += sched_clock_cpu(cpu) - wallclock;
- walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), delta);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+int fast_switching = 1;
An assumption that you're on x86. Should be mentioned somewhere in the patch-header.
+void walt_freq_transition(int cpu, unsigned long new_freq) +{
- int i;
- unsigned long flags;
- local_irq_save(flags);
- for_each_cpu(i, &cpu_rq(cpu)->freq_domain_cpumask) {
struct rq *rq = cpu_rq(i);
if (!fast_switching ||
(fast_switching && smp_processor_id() != i))
raw_spin_lock(&rq->lock);
walt_update_task_ravg(rq->curr, rq, TASK_UPDATE,
walt_ktime_clock(), 0);
rq->cur_freq = new_freq;
if (!fast_switching ||
(fast_switching && smp_processor_id() != i))
raw_spin_unlock(&rq->lock);
if (!rq->window_start)
walt_set_window_start(rq);
- }
- local_irq_restore(flags);
+}
+static int cpufreq_notifier_trans(struct notifier_block *nb,
unsigned long val, void *data)
+{
- struct cpufreq_freqs *freq = (struct cpufreq_freqs *)data;
- unsigned int cpu = freq->cpu, new_freq = freq->new;
- if (val != CPUFREQ_POSTCHANGE)
return 0;
- BUG_ON(!new_freq);
- if (cpu_rq(cpu)->cur_freq == new_freq)
return 0;
- walt_freq_transition(cpu, new_freq);
- return 0;
+}
+static struct notifier_block notifier_policy_block = {
- .notifier_call = cpufreq_notifier_policy
+};
+static struct notifier_block notifier_trans_block = {
- .notifier_call = cpufreq_notifier_trans
+};
+static int register_sched_callback(void) +{
- int ret;
- ret = cpufreq_register_notifier(¬ifier_policy_block,
CPUFREQ_POLICY_NOTIFIER);
- if (!fast_switching)
ret = cpufreq_register_notifier(¬ifier_trans_block,
CPUFREQ_TRANSITION_NOTIFIER);
- return 0;
+}
+/*
- cpufreq callbacks can be registered at core_initcall or later time.
- Any registration done prior to that is "forgotten" by cpufreq. See
- initialization of variable init_cpufreq_transition_notifier_list_called
- for further information.
- */
+core_initcall(register_sched_callback);
Why not using the mainline frequency invariance interface 'arch_scale_freq_capacity()' for all of WALT's frequency invariance needs and implement the FIE (Frequency Invariance Engine in the Arch) plus link it to its users by '#define arch_scale_freq_capacity foo_scale_freq_capacity'
[...]
* Dietmar Eggemann dietmar.eggemann@arm.com [2016-09-13 12:50:15]:
This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to the registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
Anyone that has interest in WALT data, mainly cpufreq governors. Also, a minor correction here - I don't think we envisoned any registration mechanism here to share WALT data. In our production kernel, cpufreq (interactive) governor pulls the data whenever it needs by calling an exported function of scheduler.
(window size determined by walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed during the current (not yet completed) window prev_runnable_sum: aggregate utilization of all tasks that executed during the most recent completed window
prev_runnable_sum is the primary stastic used to guide CPU frequency in lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., margin) before deciding the next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity : util;' in cpu_util()? Just asking because otherwise IMHO this is no different to PELT util.
Not sure I follow you here. Which "former" paragraph is being referred here?
To add some clarity on "policy" stuff Vikram is referring to here, prev_runnable_sum refers to actual busy time incurred in previous window. How that is used to decide on next frequency involves consideration of desired headroom or idle time. For example, a CPU that was busy for 99% of the previous window when it was running at some frequency f1, may or may not result in a frequency increase for the next window, depending on the "idle time" goals set by user (which is the policy aspect involved here).
WALT statistic updates are event driven, with updates occurring in scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch), task wakeup and during task migration. Migration simply involves removing a tasks's curr_window and prev_window from the source CPU's curr_runnable sum and prev_runnable_sum, and adding the per-task counters to the destination CPU's aggregate CPU counters.
PELT util updates are event-driven as well. The difference is that WALT operates in core.c whereas PELT util only considers the CFS class.
I think the main point of difference between PELT and WALT is the speed at which "forecasted" demand of task can be updated. For example, consider a task that was using 1% of cpu for a long time and hence considered a little task, fit to be run on little cpu. It experiences a sudden burst of work, which makes it consume 100% of a little cpu (say at little cpu's max_frequency). The speed with which we can classify the task as needing big cpu is faster with WALT then PELT afaics. With WALT, it can be classified as 'big' task as soon as one-window completes (typically 10-20ms latency) whereas with PELT, it will take longer. It could however be argued that the task may not exhibhit the same high demand in next window (and hence the conservative approach of PELT is better), but from our experience with mobile workloads (especially that involve graphic processing which is largely window/frame-based) the window-based approach for predicting immediate cpu/frequency needs of task has worked better. I believe Google has also done fair amount of validation on usefulness of WALT (for mobile workloads) that we have now come to the point of discussing the approach more widely.
+enum task_event {
- PUT_PREV_TASK = 0,
- PICK_NEXT_TASK = 1,
- TASK_WAKE = 2,
- TASK_MIGRATE = 3,
- TASK_UPDATE = 4,
- IRQ_UPDATE = 5,
+};
I always ask myself why WALT has fine-grained event types where PELT can do with running/not running? Is there a benefit or are they used in account_cpu_busy_time() essentially as running/not running ?
I think we wanted some ability to discount wait-time from reflecting in task's cpu demand/usage. In PICK_NEXT_TASK event, for example, we can ignore task's wait time (despite it being in running status) and not add that as cpu/task's busy time. Its appears possible to achieve that goal by depending on just on_rq and on_cpu fields of a task (and do away with various event types that we currently have). We will explore that optimization for next iteration.
@@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_acquire(!p->on_cpu);
- raw_spin_lock(&task_rq(p)->lock);
This is extra locking, right?
rq->lock needs to be held before calling walt_update_task_ravg() and so this locking is unavoidable.
+static unsigned int sync_cpu;
What is the business of this sync_cpu?
WALT, as implemented currently, depends on maintaining windows that are synchronized across all cpus. This helps adjust cpu busy counters when task migrate. This requires we have a hardware clock synchronized across cpus (not a hard requirement, but nice-2-have one, easily met on ARM) and also one cpu to be nominated as "reference" for window_start of currently active window. cpu0 is default sync_cpu and its window_start is initialized during bootup, based on the hardware clock (sched_clock()) value seen at that point. CPU0's window_start value is advanced periodically to reflect expiration of windows of fixed-size. Other cpus window_start value is initialized later during bootup in reference to CPU0's window_start value. Once the secondary cpu's window_start has been 'synchronized' with CPU0's window_start, there is no further synchronization required later.
We realize that not all architectures have hardware clock that is synchronized across CPUs and I think it should still be possible to have synchronized windows as long as the frequency of hardware clock is same on all cpus. That would be the next major change WALT need to address.
Why not using the mainline frequency invariance interface 'arch_scale_freq_capacity()' for all of WALT's frequency invariance needs and implement the FIE (Frequency Invariance Engine in the Arch) plus link it to its users by '#define arch_scale_freq_capacity foo_scale_freq_capacity'
Yes that's on our todo-list (to adopt upstream available mechanisms for scaling cpu stats).
- vatsa
Hi,
On 15/09/16 20:13, Srivatsa Vaddagiri wrote:
- Dietmar Eggemann dietmar.eggemann@arm.com [2016-09-13 12:50:15]:
This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to the registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
Anyone that has interest in WALT data, mainly cpufreq governors. Also, a minor correction here - I don't think we envisoned any registration mechanism here to share WALT data. In our production kernel, cpufreq (interactive) governor pulls the data whenever it needs by calling an exported function of scheduler.
(window size determined by walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed during the current (not yet completed) window prev_runnable_sum: aggregate utilization of all tasks that executed during the most recent completed window
prev_runnable_sum is the primary stastic used to guide CPU frequency in lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., margin) before deciding the next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity : util;' in cpu_util()? Just asking because otherwise IMHO this is no different to PELT util.
Not sure I follow you here. Which "former" paragraph is being referred here?
To add some clarity on "policy" stuff Vikram is referring to here, prev_runnable_sum refers to actual busy time incurred in previous window. How that is used to decide on next frequency involves consideration of desired headroom or idle time. For example, a CPU that was busy for 99% of the previous window when it was running at some frequency f1, may or may not result in a frequency increase for the next window, depending on the "idle time" goals set by user (which is the policy aspect involved here).
WALT statistic updates are event driven, with updates occurring in scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch), task wakeup and during task migration. Migration simply involves removing a tasks's curr_window and prev_window from the source CPU's curr_runnable sum and prev_runnable_sum, and adding the per-task counters to the destination CPU's aggregate CPU counters.
PELT util updates are event-driven as well. The difference is that WALT operates in core.c whereas PELT util only considers the CFS class.
I think the main point of difference between PELT and WALT is the speed at which "forecasted" demand of task can be updated. For example, consider a task that was using 1% of cpu for a long time and hence considered a little task, fit to be run on little cpu. It experiences a sudden burst of work, which makes it consume 100% of a little cpu (say at little cpu's max_frequency). The speed with which we can classify the task as needing big cpu is faster with WALT then PELT afaics. With WALT, it can be classified as 'big' task as soon as one-window completes (typically 10-20ms latency) whereas with PELT, it will take longer. It could however be argued that the task may not exhibhit the same high demand in next window (and hence the conservative approach of PELT is better), but from our experience with mobile workloads (especially that involve graphic processing which is largely window/frame-based) the window-based approach for predicting immediate cpu/frequency needs of task has worked better.
I think they are both noticeable differences and should be mentioned in the 'why' part of the cover letter. Your paragraph above is quite clear, and should be saved for the cover letter as well, IMHO.
I believe Google has also done fair amount of validation on usefulness of WALT (for mobile workloads) that we have now come to the point of discussing the approach more widely.
+enum task_event {
- PUT_PREV_TASK = 0,
- PICK_NEXT_TASK = 1,
- TASK_WAKE = 2,
- TASK_MIGRATE = 3,
- TASK_UPDATE = 4,
- IRQ_UPDATE = 5,
+};
I always ask myself why WALT has fine-grained event types where PELT can do with running/not running? Is there a benefit or are they used in account_cpu_busy_time() essentially as running/not running ?
I think we wanted some ability to discount wait-time from reflecting in task's cpu demand/usage. In PICK_NEXT_TASK event, for example, we can ignore task's wait time (despite it being in running status) and not add that as cpu/task's busy time. Its appears possible to achieve that goal by depending on just on_rq and on_cpu fields of a task (and do away with various event types that we currently have). We will explore that optimization for next iteration.
Right, that should be doable. You should also be able to rely on enqueue/dequeue flags for this matter.
@@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_acquire(!p->on_cpu);
- raw_spin_lock(&task_rq(p)->lock);
This is extra locking, right?
rq->lock needs to be held before calling walt_update_task_ravg() and so this locking is unavoidable.
+static unsigned int sync_cpu;
What is the business of this sync_cpu?
WALT, as implemented currently, depends on maintaining windows that are synchronized across all cpus. This helps adjust cpu busy counters when task migrate. This requires we have a hardware clock synchronized across cpus (not a hard requirement, but nice-2-have one, easily met on ARM) and also one cpu to be nominated as "reference" for window_start of currently active window. cpu0 is default sync_cpu and its window_start is initialized during bootup, based on the hardware clock (sched_clock()) value seen at that point. CPU0's window_start value is advanced periodically to reflect expiration of windows of fixed-size. Other cpus window_start value is initialized later during bootup in reference to CPU0's window_start value. Once the secondary cpu's window_start has been 'synchronized' with CPU0's window_start, there is no further synchronization required later.
We realize that not all architectures have hardware clock that is synchronized across CPUs and I think it should still be possible to have synchronized windows as long as the frequency of hardware clock is same on all cpus. That would be the next major change WALT need to address.
Do you think it might be actually possible to relax the synchronization constraint and implement WALT with un-synchronized windows?
Best,
- Juri
* Juri Lelli juri.lelli@arm.com [2016-09-16 10:22:52]:
We realize that not all architectures have hardware clock that is synchronized across CPUs and I think it should still be possible to have synchronized windows as long as the frequency of hardware clock is same on all cpus. That would be the next major change WALT need to address.
Do you think it might be actually possible to relax the synchronization constraint and implement WALT with un-synchronized windows?
The main difficulty would be to adjust busy counters when task migrate. Synchronized windows would make this pretty trivial. We subtract task's current window contribution from src_cpu and add that to dst_cpu.
In our early version of WALT, we did not have synchronized windows across CPU. Windows applied to just tasks and not cpus. Each task tracked its own window_start and cpus did not even track windows. The cited benefit of WALT (rapid reclassification of tasks) can still be had from such a scheme.
The additional advantage we get from synchronized windows and busy-time adjustment upon migration is related to frequency. Lets say task is migrating between little cpu cluster and big cpu cluster at the end of a window (because it got classified as big task towards end of window). Synchronized windows allow us to migrate task's busy time away from its little cpu to big cpu. The load reported for little cpu after this adjustment ensures that little cpu's frequency for next window does not include a representation of migrated task's needs. Vice versa for big cpu.
I haven't given much thought on how impossible this would be to achieve on other architectures. Does anyone foresee this to be a show-stopper on any architecture?
Regarding overheads associated with synchronization, there is only a small overhead during bootup when secondary cpus need to sync up on 'window_start' for first time. After that they roll on their own (provided there is some constant offset between hardware clock of various cpus). The other subtle overhead related to synchronization is that we require both src_rq and dst_rq lock to be held during migration (so that we can fixup busy times). I need to think some more and see if we may be able to relax that requirement.
- vatsa
-- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
On 16/09/16 19:42, Srivatsa Vaddagiri wrote:
- Juri Lelli juri.lelli@arm.com [2016-09-16 10:22:52]:
We realize that not all architectures have hardware clock that is synchronized across CPUs and I think it should still be possible to have synchronized windows as long as the frequency of hardware clock is same on all cpus. That would be the next major change WALT need to address.
Do you think it might be actually possible to relax the synchronization constraint and implement WALT with un-synchronized windows?
The main difficulty would be to adjust busy counters when task migrate. Synchronized windows would make this pretty trivial. We subtract task's current window contribution from src_cpu and add that to dst_cpu.
Right. PELT does the removed_{load,utilization} atomic dance to solve this problem. But, of course, is not as immediate as an atomic src/dst update as you do. Signals are still an approximation of real execution though, so it remains to see how much the added locking pays off.
In our early version of WALT, we did not have synchronized windows across CPU. Windows applied to just tasks and not cpus. Each task tracked its own window_start and cpus did not even track windows. The cited benefit of WALT (rapid reclassification of tasks) can still be had from such a scheme.
Yes.
The additional advantage we get from synchronized windows and busy-time adjustment upon migration is related to frequency. Lets say task is migrating between little cpu cluster and big cpu cluster at the end of a window (because it got classified as big task towards end of window). Synchronized windows allow us to migrate task's busy time away from its little cpu to big cpu. The load reported for little cpu after this adjustment ensures that little cpu's frequency for next window does not include a representation of migrated task's needs. Vice versa for big cpu.
Yeah. I remember we had that problem on product codeline before switching to WALT and the way it was fixed involved kicking a utilization update in the src runqueue (little cpu in your example) and then a frequency re-evalutation for src runqueue's cluster. Big CPU update is already happening as consequence of an enqueue there.
I haven't given much thought on how impossible this would be to achieve on other architectures. Does anyone foresee this to be a show-stopper on any architecture?
I'm mostly afraid of the fact that you basically reintroduce double locking on migration after it has been removed (for CFS) a couple of years ago with commit 163122b7fcfa "sched/fair: Remove double_lock_balance() from load_balance()".
Regarding overheads associated with synchronization, there is only a small overhead during bootup when secondary cpus need to sync up on 'window_start' for first time. After that they roll on their own (provided there is some constant offset between hardware clock of various cpus).
Not sure we can assume this for all architectures.
The other subtle overhead related to synchronization is that we require both src_rq and dst_rq lock to be held during migration (so that we can fixup busy times). I need to think some more and see if we may be able to relax that requirement.
- vatsa
-- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
* Juri Lelli juri.lelli@arm.com [2016-09-16 16:14:34]:
I haven't given much thought on how impossible this would be to achieve on other architectures. Does anyone foresee this to be a show-stopper on any architecture?
I'm mostly afraid of the fact that you basically reintroduce double locking on migration after it has been removed (for CFS) a couple of years ago with commit 163122b7fcfa "sched/fair: Remove double_lock_balance() from load_balance()".
I think it should be possible to adjust counters without relying on double-locking. We will include that change in next version.
Regarding overheads associated with synchronization, there is only a small overhead during bootup when secondary cpus need to sync up on 'window_start' for first time. After that they roll on their own (provided there is some constant offset between hardware clock of various cpus).
Not sure we can assume this for all architectures.
I guess we can rely on ktime as fall-back option on architectures where the constant offset prerequisite can't be met. We can discuss that further on lkml.
- vatsa
-- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
On 15/09/16 15:43, Srivatsa Vaddagiri wrote:
- Dietmar Eggemann dietmar.eggemann@arm.com [2016-09-13 12:50:15]:
This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to the registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
Anyone that has interest in WALT data, mainly cpufreq governors. Also, a minor correction here - I don't think we envisoned any registration mechanism here to share WALT data. In our production kernel, cpufreq (interactive) governor pulls the data whenever it needs by calling an exported function of scheduler.
Ok, understood ... just stumbled upon this registration thing ...
(window size determined by walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed during the current (not yet completed) window prev_runnable_sum: aggregate utilization of all tasks that executed during the most recent completed window
prev_runnable_sum is the primary stastic used to guide CPU frequency in lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., margin) before deciding the next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity : util;' in cpu_util()? Just asking because otherwise IMHO this is no different to PELT util.
Not sure I follow you here. Which "former" paragraph is being referred here?
I was referring to the text above.
To add some clarity on "policy" stuff Vikram is referring to here, prev_runnable_sum refers to actual busy time incurred in previous window. How that is used to decide on next frequency involves consideration of desired headroom or idle time. For example, a CPU that was busy for 99% of the previous window when it was running at some frequency f1, may or may not result in a frequency increase for the next window, depending on the "idle time" goals set by user (which is the policy aspect involved here).
Understood, will have another look into the code trying to grasp it.
WALT statistic updates are event driven, with updates occurring in scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch), task wakeup and during task migration. Migration simply involves removing a tasks's curr_window and prev_window from the source CPU's curr_runnable sum and prev_runnable_sum, and adding the per-task counters to the destination CPU's aggregate CPU counters.
PELT util updates are event-driven as well. The difference is that WALT operates in core.c whereas PELT util only considers the CFS class.
I think the main point of difference between PELT and WALT is the speed at which "forecasted" demand of task can be updated. For example, consider a task that was using 1% of cpu for a long time and hence considered a little task, fit to be run on little cpu. It experiences a sudden burst of work, which makes it consume 100% of a little cpu (say at little cpu's max_frequency). The speed with which we can classify the task as needing big cpu is faster with WALT then PELT afaics. With WALT, it can be classified as 'big' task as soon as one-window completes (typically 10-20ms latency) whereas with PELT, it will take longer. It could however be argued that the task may not exhibhit the same high demand in next window (and hence the conservative approach of PELT is better), but from our experience with mobile workloads (especially that involve graphic processing which is largely window/frame-based) the window-based approach for predicting immediate cpu/frequency needs of task has worked better. I believe Google has also done fair amount of validation on usefulness of WALT (for mobile workloads) that we have now come to the point of discussing the approach more widely.
Makes sense.
+enum task_event {
- PUT_PREV_TASK = 0,
- PICK_NEXT_TASK = 1,
- TASK_WAKE = 2,
- TASK_MIGRATE = 3,
- TASK_UPDATE = 4,
- IRQ_UPDATE = 5,
+};
I always ask myself why WALT has fine-grained event types where PELT can do with running/not running? Is there a benefit or are they used in account_cpu_busy_time() essentially as running/not running ?
I think we wanted some ability to discount wait-time from reflecting in task's cpu demand/usage. In PICK_NEXT_TASK event, for example, we can ignore task's wait time (despite it being in running status) and not add that as cpu/task's busy time. Its appears possible to achieve that goal by depending on just on_rq and on_cpu fields of a task (and do away with various event types that we currently have). We will explore that optimization for next iteration.
Clearer now.
@@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_acquire(!p->on_cpu);
- raw_spin_lock(&task_rq(p)->lock);
This is extra locking, right?
rq->lock needs to be held before calling walt_update_task_ravg() and so this locking is unavoidable.
+static unsigned int sync_cpu;
What is the business of this sync_cpu?
WALT, as implemented currently, depends on maintaining windows that are synchronized across all cpus. This helps adjust cpu busy counters when task migrate. This requires we have a hardware clock synchronized across cpus (not a hard requirement, but nice-2-have one, easily met on ARM) and also one cpu to be nominated as "reference" for window_start of currently active window. cpu0 is default sync_cpu and its window_start is initialized during bootup, based on the hardware clock (sched_clock()) value seen at that point. CPU0's window_start value is advanced periodically to reflect expiration of windows of fixed-size. Other cpus window_start value is initialized later during bootup in reference to CPU0's window_start value. Once the secondary cpu's window_start has been 'synchronized' with CPU0's window_start, there is no further synchronization required later.
We realize that not all architectures have hardware clock that is synchronized across CPUs and I think it should still be possible to have synchronized windows as long as the frequency of hardware clock is same on all cpus. That would be the next major change WALT need to address.
Thanks for the explanation.
Why not using the mainline frequency invariance interface 'arch_scale_freq_capacity()' for all of WALT's frequency invariance needs and implement the FIE (Frequency Invariance Engine in the Arch) plus link it to its users by '#define arch_scale_freq_capacity foo_scale_freq_capacity'
Yes that's on our todo-list (to adopt upstream available mechanisms for scaling cpu stats).
Cool.
-- Dietmar
- vatsa
On 2016-09-16 10:58, Dietmar Eggemann wrote:
On 15/09/16 15:43, Srivatsa Vaddagiri wrote:
- Dietmar Eggemann dietmar.eggemann@arm.com [2016-09-13 12:50:15]:
This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to the registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
Anyone that has interest in WALT data, mainly cpufreq governors. Also, a minor correction here - I don't think we envisoned any registration mechanism here to share WALT data. In our production kernel, cpufreq (interactive) governor pulls the data whenever it needs by calling an exported function of scheduler.
Ok, understood ... just stumbled upon this registration thing ...
sched/cpufreq.c exposes the registration API for governors. It currently allows just one callback function, but I suppose that may change in the future. I'll disambiguate this a bit.
[...]
(window size determined by walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that
executed during the current (not yet completed) window
prev_runnable_sum: aggregate utilization of all tasks that
executed during the most recent completed window
prev_runnable_sum is the primary stastic used to guide CPU frequency in lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., margin) before deciding the next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity : util;' in cpu_util()? Just asking because otherwise IMHO this is no different to PELT util.
Not sure I follow you here. Which "former" paragraph is being referred here?
I was referring to the text above.
To add some clarity on "policy" stuff Vikram is referring to here, prev_runnable_sum refers to actual busy time incurred in previous window. How that is used to decide on next frequency involves consideration of desired headroom or idle time. For example, a CPU that was busy for 99% of the previous window when it was running at some frequency f1, may or may not result in a frequency increase for the next window, depending on the "idle time" goals set by user (which is the policy aspect involved here).
Understood, will have another look into the code trying to grasp it.
This was more of a standalone intro to schedutil+WALT, I really wanted to say that we don't do anything different from schedutil+PELT in terms of enforcing policy, and that the reported number is true frequency/capacity invariant utilization.
Thanks, Vikram
From: Vikram Mulukutla markivx@codeaurora.org
Translating utilization to frequency may not be a simple operation since on some architectures, certain frequencies represent "boost" frequencies that may allow hardware to boost frequency to beyond what is represented in software. For example, Intel x86 machines have a max frequency that is only 1MHz greater than the next highest frequency in cpufreq tables, but can provide 200MHz more capacity depending on the number of non-idle CPUs.
This is a temporary/hack patch to use a translation table in cpufreq_schedutil to translate scheduler utilization to the next_freq value in get_next_freq. The capacity values in the table are calculated by running appropriate workloads (like sysbench) at each P-state.
Signed-off-by: Vikram Mulukutla markivx@codeaurora.org --- include/linux/sched/sysctl.h | 1 + kernel/sched/cpufreq_schedutil.c | 37 ++++++++++++++++++++++++++++++++++++- kernel/sysctl.c | 9 +++++++++ 3 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 7007815..3b2dac1 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -32,6 +32,7 @@ extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; extern unsigned int sysctl_sched_use_walt_metrics; +extern unsigned int sysctl_sched_use_cap_table;
#ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 2eef34d..ef688216 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -107,6 +107,27 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, } }
+struct util_freq { + unsigned int util; + unsigned long freq; +}; + +static struct util_freq cap_table[] = { + {589, 3401000}, + {526, 3400000}, + {494, 3200000}, + {463, 3000000}, + {433, 2800000}, + {401, 2600000}, + {362, 2400000}, + {339, 2200000}, + {308, 2000000}, + {276, 1800000}, + {245, 1600000}, +}; + +unsigned int sysctl_sched_use_cap_table; + /** * get_next_freq - Compute a new frequency for a given cpufreq policy. * @policy: cpufreq policy object to compute the new frequency for. @@ -132,8 +153,21 @@ static unsigned int get_next_freq(struct cpufreq_policy *policy, arch_scale_freq_invariant()); unsigned int freq = invariant ? policy->cpuinfo.max_freq : policy->cur; + unsigned int next_freq; + int j = 1; + + if (sysctl_sched_use_cap_table) { + if (!invariant) + util = (util * policy->cur) / policy->cpuinfo.max_freq; + util += util >> 2; + while ((j < ARRAY_SIZE(cap_table)) && (util < cap_table[j].util)) + j++; + next_freq = cap_table[j-1].freq; + } else { + next_freq = (freq + (freq >> 2)) * util / max; + }
- return (freq + (freq >> 2)) * util / max; + return next_freq; }
static void sugov_update_single(struct update_util_data *hook, u64 time, @@ -150,6 +184,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq : get_next_freq(policy, util, max); sugov_update_commit(sg_policy, time, next_f); + }
static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 4669a34..2605758 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -448,6 +448,15 @@ static struct ctl_table kern_table[] = { .extra1 = &zero, .extra2 = &one, }, + { + .procname = "sched_use_cap_table", + .data = &sysctl_sched_use_cap_table, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, #ifdef CONFIG_CFS_BANDWIDTH { .procname = "sched_cfs_bandwidth_slice_us", -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
On 03/09/16 00:27, markivx@codeaurora.org wrote:
From: Vikram Mulukutla markivx@codeaurora.org
Translating utilization to frequency may not be a simple operation since on some architectures, certain frequencies represent "boost" frequencies that may allow hardware to boost frequency to beyond what is represented in software. For example, Intel x86 machines have a max frequency that is only 1MHz greater than the next highest frequency in cpufreq tables, but can provide 200MHz more capacity depending on the number of non-idle CPUs.
This is a temporary/hack patch to use a translation table in cpufreq_schedutil to translate scheduler utilization to the next_freq value in get_next_freq. The capacity values in the table are calculated by running appropriate workloads (like sysbench) at each P-state.
Signed-off-by: Vikram Mulukutla markivx@codeaurora.org
include/linux/sched/sysctl.h | 1 + kernel/sched/cpufreq_schedutil.c | 37 ++++++++++++++++++++++++++++++++++++- kernel/sysctl.c | 9 +++++++++ 3 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 7007815..3b2dac1 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -32,6 +32,7 @@ extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; extern unsigned int sysctl_sched_use_walt_metrics; +extern unsigned int sysctl_sched_use_cap_table;
#ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 2eef34d..ef688216 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -107,6 +107,27 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, } }
+struct util_freq {
- unsigned int util;
- unsigned long freq;
+};
+static struct util_freq cap_table[] = {
- {589, 3401000},
Another thing, the util of a single hw-thread on a two threaded core is 589 but cpu_efficiency returns 1024. There is ongoing discussion about this issue on LKML:
https://lkml.org/lkml/2016/8/18/993
Most part of the discussion is about load-balancing though.
- {526, 3400000},
- {494, 3200000},
- {463, 3000000},
- {433, 2800000},
- {401, 2600000},
- {362, 2400000},
- {339, 2200000},
- {308, 2000000},
- {276, 1800000},
- {245, 1600000},
+};
This probably makes only sense on a certain type of recent x86 platform.
My i7-4750HQ gives me:
2001000 2000000 1900000 1800000 1700000 1600000 1500000 1400000 1300000 1200000 1100000 1000000 900000 800000
I ran with 'intel_pstate=disable' which I guess it's worth mentioning as well.
+unsigned int sysctl_sched_use_cap_table;
Requires CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y. Maybe worth mentioning in the patch header since it's not part of make defconfig.
[...]
On 2016-09-13 04:50, Dietmar Eggemann wrote:
On 03/09/16 00:27, markivx@codeaurora.org wrote:
From: Vikram Mulukutla markivx@codeaurora.org
Translating utilization to frequency may not be a simple operation since on some architectures, certain frequencies represent "boost" frequencies that may allow hardware to boost frequency to beyond what is represented in software. For example, Intel x86 machines have a max frequency that is only 1MHz greater than the next highest frequency in cpufreq tables, but can provide 200MHz more capacity depending on the number of non-idle CPUs.
This is a temporary/hack patch to use a translation table in cpufreq_schedutil to translate scheduler utilization to the next_freq value in get_next_freq. The capacity values in the table are calculated by running appropriate workloads (like sysbench) at each P-state.
Signed-off-by: Vikram Mulukutla markivx@codeaurora.org
include/linux/sched/sysctl.h | 1 + kernel/sched/cpufreq_schedutil.c | 37 ++++++++++++++++++++++++++++++++++++- kernel/sysctl.c | 9 +++++++++ 3 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 7007815..3b2dac1 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -32,6 +32,7 @@ extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; extern unsigned int sysctl_sched_use_walt_metrics; +extern unsigned int sysctl_sched_use_cap_table;
#ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 2eef34d..ef688216 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -107,6 +107,27 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, } }
+struct util_freq {
- unsigned int util;
- unsigned long freq;
+};
+static struct util_freq cap_table[] = {
- {589, 3401000},
Another thing, the util of a single hw-thread on a two threaded core is 589 but cpu_efficiency returns 1024. There is ongoing discussion about this issue on LKML:
Thank you for that pointer; it seems the problem remains unsolved. I am sort of glad that we don't have to deal with SMT on ARM! I think that on 4.7-rc6 at least the PELT stats are not freq or cap invariant, so I do see that util_avg is scaling to 589 (rq->max_cap_orig) at max.
Most part of the discussion is about load-balancing though.
- {526, 3400000},
- {494, 3200000},
- {463, 3000000},
- {433, 2800000},
- {401, 2600000},
- {362, 2400000},
- {339, 2200000},
- {308, 2000000},
- {276, 1800000},
- {245, 1600000},
+};
This probably makes only sense on a certain type of recent x86 platform.
My i7-4750HQ gives me:
2001000 2000000 1900000 1800000 1700000 1600000 1500000 1400000 1300000 1200000 1100000 1000000 900000 800000
I ran with 'intel_pstate=disable' which I guess it's worth mentioning as well.
Yes, have to modify the commandline. At some point we'll have to probably look at schedutil vs intel_pstate seriously. Seems pstate is enabled by default if available instead of software governors.
Thanks, Vikram
From: Vikram Mulukutla markivx@codeaurora.org
Actual frequency on X86 may be a function of the number of non-idle CPUs, the highest frequency amongst the cpufreq policies corresponding to the CPUs, and whether a turbo-boost frequency has been selected. For example, if cpu0 has voted for 2GHz and cpu1 for 1GHz, cpu1 can also run at 2GHz for as long as cpu0 is not idle.
This can result in somewhat inaccurate load tracking when a load statistic needs to be frequency invariant. To address this, calculate the current frequency on supported Intel X86 machines using the PMC counters. As far as I can tell, these counters can't be read by a remote CPU, and IPIs would be too expensive, so try to keep the frequency updated as often as possible on a CPU.
This is a temporary/hack patch included for completeness and not intended for merging. Your machine may crash if the rdpmc instruction is unsupported!
Signed-off-by: Vikram Mulukutla markivx@codeaurora.org --- kernel/sched/core.c | 1 + kernel/sched/sched.h | 3 +++ kernel/sched/walt.c | 42 +++++++++++++++++++++++++++++++++++++++++- kernel/sched/walt.h | 2 ++ 4 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 068bde8..250014c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -111,6 +111,7 @@ void update_rq_clock(struct rq *rq) return; rq->clock += delta; update_rq_clock_task(rq, delta); + walt_update_cpu_freq(rq); }
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 52a0ac5..e6acd66 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -666,6 +666,9 @@ struct rq { #ifdef CONFIG_SCHED_WALT unsigned int cur_freq; struct cpumask freq_domain_cpumask; + u64 cycle_update_time; + unsigned long cycles; + unsigned int cur_freq_hw;
int efficiency; /* Differentiate cpus with different IPC capability */ u64 window_start; diff --git a/kernel/sched/walt.c b/kernel/sched/walt.c index 203e02d..ccea3f4 100644 --- a/kernel/sched/walt.c +++ b/kernel/sched/walt.c @@ -125,9 +125,48 @@ update_window_start(struct rq *rq, u64 wallclock) rq->window_start += (u64)nr_windows * (u64)walt_ravg_window; }
+unsigned long rdpmc_actual_cycles(void) +{ + unsigned a, d, c; + c = (1<<30)+1; + __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); + return ((unsigned long)a) | (((unsigned long)d) << 32);; +} + +static int start_hw_freq_update; + +inline void walt_update_cpu_freq(struct rq *rq) +{ + u64 cur_time; + unsigned long cur_cycles; + + if (unlikely(!start_hw_freq_update)) + return; + + if (cpu_of(rq) != smp_processor_id()) + return; + + cur_time = sched_clock_cpu(cpu_of(rq)); + cur_cycles = rdpmc_actual_cycles(); + + rq->cur_freq_hw = ((cur_cycles - rq->cycles) * 1000 * 1000)/ + (cur_time - rq->cycle_update_time); + + rq->cycle_update_time = cur_time; + rq->cycles = cur_cycles; +} + +static int __init walt_init_late(void) +{ + start_hw_freq_update = 1; + return 0; +} +late_initcall(walt_init_late); + static u64 scale_exec_time(u64 delta, struct rq *rq) { - unsigned int cur_freq = rq->cur_freq; + unsigned int cur_freq = rq->cur_freq_hw > rq->cur_freq ? rq->cur_freq_hw + : rq->cur_freq; int sf;
/* round up div64 */ @@ -285,6 +324,7 @@ void walt_update_task_ravg(struct task_struct *p, struct rq *rq, if (!p->ravg.mark_start) goto done;
+ walt_update_cpu_freq(rq); update_cpu_busy_time(p, rq, event, wallclock, irqtime);
done: diff --git a/kernel/sched/walt.h b/kernel/sched/walt.h index 5b03995..555192d 100644 --- a/kernel/sched/walt.h +++ b/kernel/sched/walt.h @@ -16,6 +16,7 @@
#ifdef CONFIG_SCHED_WALT
+void walt_update_cpu_freq(struct rq *rq); void walt_update_task_ravg(struct task_struct *p, struct rq *rq, enum task_event event, u64 wallclock, u64 irqtime); void walt_fixup_busy_time(struct task_struct *p, int new_cpu); @@ -47,6 +48,7 @@ static inline unsigned long cpu_walt_util(struct rq *rq)
#else /* CONFIG_SCHED_WALT */
+static inline void walt_update_cpu_freq(struct rq *rq) { }; static inline void walt_update_task_ravg(struct task_struct *p, struct rq *rq, int event, u64 wallclock, u64 irqtime) { } static inline void walt_fixup_busy_time(struct task_struct *p, int new_cpu) { } -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
On 03/09/16 00:27, markivx@codeaurora.org wrote:
From: Vikram Mulukutla markivx@codeaurora.org
Actual frequency on X86 may be a function of the number of non-idle CPUs, the highest frequency amongst the cpufreq policies corresponding to the CPUs, and whether a turbo-boost frequency has been selected. For example, if cpu0 has voted for 2GHz and cpu1 for 1GHz, cpu1 can also run at 2GHz for as long as cpu0 is not idle.
This can result in somewhat inaccurate load tracking when a load statistic needs to be frequency invariant. To address this, calculate the current frequency on supported Intel X86 machines using the PMC counters. As far as I can tell, these counters can't be read by a remote CPU, and IPIs would be too expensive, so try to keep the frequency updated as often as possible on a CPU.
This is a temporary/hack patch included for completeness and not intended for merging. Your machine may crash if the rdpmc instruction is unsupported!
There was this aperf/mperf approach on x86 which was already mainlined commit 47fe38fcff05 ("x86: sched: Provide arch implementations using aperf/mperf"). It has been removed since then by commit 61c63e5ed3b9 ("cpufreq: Remove unused APERF/MPERF support") and commit ee08d1284ea9 ("sched/x86: Remove broken power estimation") but people mentioned this in the meantime when it came to frequency invariance on x86.
Any particular reason why you prefer 'rdpmc' instead?
[...]
On 2016-09-13 04:50, Dietmar Eggemann wrote:
On 03/09/16 00:27, markivx@codeaurora.org wrote:
From: Vikram Mulukutla markivx@codeaurora.org
Actual frequency on X86 may be a function of the number of non-idle CPUs, the highest frequency amongst the cpufreq policies corresponding to the CPUs, and whether a turbo-boost frequency has been selected. For example, if cpu0 has voted for 2GHz and cpu1 for 1GHz, cpu1 can also run at 2GHz for as long as cpu0 is not idle.
This can result in somewhat inaccurate load tracking when a load statistic needs to be frequency invariant. To address this, calculate the current frequency on supported Intel X86 machines using the PMC counters. As far as I can tell, these counters can't be read by a remote CPU, and IPIs would be too expensive, so try to keep the frequency updated as often as possible on a CPU.
This is a temporary/hack patch included for completeness and not intended for merging. Your machine may crash if the rdpmc instruction is unsupported!
There was this aperf/mperf approach on x86 which was already mainlined commit 47fe38fcff05 ("x86: sched: Provide arch implementations using aperf/mperf"). It has been removed since then by commit 61c63e5ed3b9 ("cpufreq: Remove unused APERF/MPERF support") and commit ee08d1284ea9 ("sched/x86: Remove broken power estimation") but people mentioned this in the meantime when it came to frequency invariance on x86.
Any particular reason why you prefer 'rdpmc' instead?
rdmpc to me seemed to be a more recent addition and the simplest and fastest way to get raw cpu cycles with a single register read. I wanted to trace the actual cycle count too in order to compare it with what software has voted and C-states etc. The aperf/mperf ratio derived from the MSRs seems less direct in that respect (not sure how turbo-boost would affect the max ratio value for instance).
I do need to reduce the overhead associated with this patch, perhaps by reading the counter during tick and context switches and a very few more places.
Thanks, Vikram
Hi Vikram,
I ran the patch set on my x86-box to get familiar with the WALT internals. Unfortunately I'm not part of the EAS product initiative so this is my first encounter with it. However, my humble knowledge of PELT might help creating a WALT patch-set LKML folk might appreciate.
Cheers,
-- Dietmar
On 03/09/16 00:27, markivx@codeaurora.org wrote:
This patch series implements an alternative window assisted load tracking mechanism in lieu of PELT based cpu utilization tracking. Testing has shown that a window based non-decaying metric such as WALT guiding cpu frequency and task placement decisions can improve performance/power especially when running workloads more commonly found on mobile devices. The aim of this series is to incorporate WALT accounting into the scheduler and feed WALT statistics to schedutil in order to guide cpu frequency selection. The implementation is detailed in the commit text of Patch 1. The eventual goal is to also guide placement decisions based on WALT statistics.
By placement you mean EAS/capacity aware wakeup or does it include load-balance?
WALT has existed in out-of-tree kernels for ARM/ARM64 commercialized devices for a few years. This is an effort to bring WALT to mainline as well as to test on multiple architectures and with varied workloads.
This RFC version is mainly to preview what the code will look like on mainline. Future RFC revisions will include a theoretical discussion and benchmark results.
This would be the more interesting part (the 'why' (TM Morten R.) we would have to replace PELT util with WALT. And this also in perspective to the recent effort from Vincent G. to fix util for task groups on LKML.
Tested on an Intel x86_64 machine (on top of 4.7-rc6). (Benchmark results will be sent out separately and as part of this message in the next RFC version).
Run 'perf bench sched messaging -g 20 -l 5000' on my 'Intel Core i7-4750HQ CPU @ 2.00GHz' on v4.7 and v4.7+ and didn't see much difference. So we probably need some more clever workloads to evaluate the overhead of this extra locking.
[...]
Hi Dietmar,
On 2016-09-13 04:50, Dietmar Eggemann wrote:
Hi Vikram,
I ran the patch set on my x86-box to get familiar with the WALT internals. Unfortunately I'm not part of the EAS product initiative so this is my first encounter with it. However, my humble knowledge of PELT might help creating a WALT patch-set LKML folk might appreciate.
Cheers,
-- Dietmar
Thank you for taking the time to actually run the code in addition to reviewing it! Very much appreciated. You and Juri have already brought up good points.
On 03/09/16 00:27, markivx@codeaurora.org wrote:
This patch series implements an alternative window assisted load tracking mechanism in lieu of PELT based cpu utilization tracking. Testing has shown that a window based non-decaying metric such as WALT guiding cpu frequency and task placement decisions can improve performance/power especially when running workloads more commonly found on mobile devices. The aim of this series is to incorporate WALT accounting into the scheduler and feed WALT statistics to schedutil in order to guide cpu frequency selection. The implementation is detailed in the commit text of Patch 1. The eventual goal is to also guide placement decisions based on WALT statistics.
By placement you mean EAS/capacity aware wakeup or does it include load-balance?
Wake-up placement mostly, I should clarify that. WALT statistics in load balancing decisions is something to be investigated. This is also leading to the question of whether we want to completely eliminate PELT stats for a true complete comparison.
WALT has existed in out-of-tree kernels for ARM/ARM64 commercialized devices for a few years. This is an effort to bring WALT to mainline as well as to test on multiple architectures and with varied workloads.
This RFC version is mainly to preview what the code will look like on mainline. Future RFC revisions will include a theoretical discussion and benchmark results.
This would be the more interesting part (the 'why' (TM Morten R.) we would have to replace PELT util with WALT. And this also in perspective to the recent effort from Vincent G. to fix util for task groups on LKML.
Tested on an Intel x86_64 machine (on top of 4.7-rc6). (Benchmark results will be sent out separately and as part of this message in the next RFC version).
Run 'perf bench sched messaging -g 20 -l 5000' on my 'Intel Core i7-4750HQ CPU @ 2.00GHz' on v4.7 and v4.7+ and didn't see much difference. So we probably need some more clever workloads to evaluate the overhead of this extra locking.
[...]
Perhaps measuring function call execution times during migration and other cases where we have additional locking might at least provide a data point. The original CFS patch that removed double_lock_balance from load balancing did not have data along with the patch..
I will also instrument some of the WALT update functions to see how much overhead is added in comparison to the existing __update_load_avg code. It will also be interesting to see how much PELT math costs as well.
Also, power measurement using RAPL counters is fairly easy on Intel. It does seem to give me consistent results and we're looking for A-B results anyway. WALT utilization does pretty well on things like video playback.
Thanks, Vikram
On 03/09/16 00:27, markivx@codeaurora.org wrote:
This patch series implements an alternative window assisted load tracking mechanism in lieu of PELT based cpu utilization tracking. Testing has shown that a window based non-decaying metric such as WALT guiding cpu frequency and task placement decisions can improve performance/power especially when running workloads more commonly found on mobile devices. The aim of this series is to incorporate WALT accounting into the scheduler and feed WALT statistics to schedutil in order to guide cpu frequency selection. The implementation is detailed in the commit text of Patch 1. The eventual goal is to also guide placement decisions based on WALT statistics.
WALT has existed in out-of-tree kernels for ARM/ARM64 commercialized devices for a few years. This is an effort to bring WALT to mainline as well as to test on multiple architectures and with varied workloads.
This RFC version is mainly to preview what the code will look like on mainline. Future RFC revisions will include a theoretical discussion and benchmark results.
Tested on an Intel x86_64 machine (on top of 4.7-rc6). (Benchmark results will be sent out separately and as part of this message in the next RFC version).
Patch 1: Adds WALT tracking to the scheduler
Patches 2-3: Temporary patches to bring in EAS/sched-freq like capacity table and to use Intel PMC counters for more accurate frequency invariant load tracking on X86. Included for completeness but not meant for merging.
include/linux/sched.h | 35 ++++++++++ include/linux/sched/sysctl.h | 2 + include/trace/events/sched.h | 76 +++++++++++++++++++++ init/Kconfig | 9 +++ kernel/sched/Makefile | 1 + kernel/sched/core.c | 29 ++++++++- kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++- kernel/sched/cputime.c | 11 +++- kernel/sched/debug.c | 10 +++ kernel/sched/fair.c | 7 +- kernel/sched/sched.h | 13 ++++ kernel/sched/walt.c | 580 ++++++++++++++++++++++++++++++++++ kernel/sched/walt.h | 75 +++++++++++++++++++++ kernel/sysctl.c | 18 +++++ 14 files changed, 904 insertions(+), 6 deletions(-)
I caught a WALT related hard lockup on a v4.7 kernel with only patch 1 on top. Fairly easy to reproduce by watching a video in firefox browser on Ubuntu 16.04.
$ addr2line -e vmlinux ffffffff810d835e kernel/sched/sched.h:1542
$ addr2line -e vmlinux ffffffff810d29b0 kernel/sched/sched.h:1538
1531 static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest) 1532 __releases(this_rq->lock) 1533 __acquires(busiest->lock) 1534 __acquires(this_rq->lock) 1535 { 1536 int ret = 0; 1537 1538 if (unlikely(!raw_spin_trylock(&busiest->lock))) { 1539 if (busiest < this_rq) { 1540 raw_spin_unlock(&this_rq->lock); 1541 raw_spin_lock(&busiest->lock); 1542 raw_spin_lock_nested(&this_rq->lock, 1543 SINGLE_DEPTH_NESTING); 1544 ret = 1; 1545 } else 1546 raw_spin_lock_nested(&busiest->lock, 1547 SINGLE_DEPTH_NESTING); 1548 } 1549 return ret; 1550 }
[ 118.795603] ============================================= [ 118.795606] [ INFO: possible recursive locking detected ] [ 118.795609] 4.7.0-walt-v4 #3 Not tainted [ 118.795612] --------------------------------------------- [ 118.795615] rtkit-daemon/3133 is trying to acquire lock: [ 118.795619] (&rq->lock){-.-.-.}, at: [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300 [ 118.795635] [ 118.795635] but task is already holding lock: [ 118.795638] (&rq->lock){-.-.-.}, at: [<ffffffff810d29b0>] push_rt_task.part.39+0xb0/0x2a0 [ 118.795650] [ 118.795650] other info that might help us debug this: [ 118.795653] Possible unsafe locking scenario: [ 118.795653] [ 118.795656] CPU0 [ 118.795659] ---- [ 118.795661] lock(&rq->lock); [ 118.795667] lock(&rq->lock); [ 118.795673] [ 118.795673] *** DEADLOCK *** [ 118.795673] [ 118.795676] May be due to missing lock nesting notation [ 118.795676] [ 118.795680] 1 lock held by rtkit-daemon/3133: [ 118.795682] #0: (&rq->lock){-.-.-.}, at: [<ffffffff810d29b0>] push_rt_task.part.39+0xb0/0x2a0 [ 118.795692] [ 118.795692] stack backtrace: [ 118.795697] CPU: 1 PID: 3133 Comm: rtkit-daemon Not tainted 4.7.0-walt-v4 #3 [ 118.795700] Hardware name: LENOVO 2537Z5F/2537Z5F, BIOS 6IET74WW (1.34 ) 10/25/2010 [ 118.795703] 0000000000000000 ffff8800ad7e77a8 ffffffff8143001c ffffffff829e8fc0 [ 118.795711] ffffffff829e8fc0 ffff8800ad7e7848 ffffffff810e5eab ffff880000000000 [ 118.795722] 000000000003e01f ffffffff8235f800 ffff8800af3ccd40 000000000000032f [ 118.795729] Call Trace: [ 118.795735] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110 [ 118.795736] in_atomic(): 1, irqs_disabled(): 1, pid: 3133, name: rtkit-daemon [ 118.795736] INFO: lockdep is turned off. [ 118.795737] irq event stamp: 330 [ 118.795741] hardirqs last enabled at (329): [<ffffffff818b675c>] _raw_spin_unlock_irq+0x2c/0x40 [ 118.795743] hardirqs last disabled at (330): [<ffffffff818b6eeb>] _raw_spin_lock_irqsave+0x2b/0x90 [ 118.795747] softirqs last enabled at (0): [<ffffffff810827b1>] copy_process.part.30+0x5c1/0x1e60 [ 118.795749] softirqs last disabled at (0): [< (null)>] (null) [ 118.795750] CPU: 1 PID: 3133 Comm: rtkit-daemon Not tainted 4.7.0-walt-v4 #3 [ 118.795751] Hardware name: LENOVO 2537Z5F/2537Z5F, BIOS 6IET74WW (1.34 ) 10/25/2010 [ 118.795753] 0000000000000001 ffff8800ad7e7390 ffffffff8143001c ffff8800af3ccd40 [ 118.795754] ffffffff81ca0267 ffff8800ad7e73b8 ffffffff810b3490 ffffffff81ca0267 [ 118.795756] 000000000000006e 0000000000000000 ffff8800ad7e73e0 ffffffff810b3599 [ 118.795756] Call Trace: [ 118.795763] [<ffffffff8143001c>] dump_stack+0x85/0xc9 [ 118.795766] [<ffffffff810b3490>] ___might_sleep+0x180/0x240 [ 118.795768] [<ffffffff810b3599>] __might_sleep+0x49/0x80 [ 118.795771] [<ffffffff810fc838>] synchronize_irq+0x38/0xa0 [ 118.795772] [<ffffffff810fbdfe>] ? __irq_put_desc_unlock+0x1e/0x40 [ 118.795774] [<ffffffff810fcae9>] ? __disable_irq_nosync+0x49/0x70 [ 118.795775] [<ffffffff810fcb3c>] disable_irq+0x1c/0x30 [ 118.795787] [<ffffffffc0172a02>] e1000_netpoll+0xf2/0x120 [e1000e] [ 118.795791] [<ffffffff817a3518>] netpoll_poll_dev+0x78/0x2c0 [ 118.795793] [<ffffffff817a3900>] netpoll_send_skb_on_dev+0x1a0/0x290 [ 118.795795] [<ffffffff817a3ccf>] netpoll_send_udp+0x2df/0x470 [ 118.795798] [<ffffffffc012ab32>] write_msg+0xb2/0xf0 [netconsole] [ 118.795800] [<ffffffff810f9489>] call_console_drivers.constprop.23+0x149/0x1e0 [ 118.795802] [<ffffffff810fa334>] console_unlock+0x4e4/0x5b0 [ 118.795803] [<ffffffff810fa7ae>] vprintk_emit+0x3ae/0x5d0 [ 118.795805] [<ffffffff810fab29>] vprintk_default+0x29/0x40 [ 118.795808] [<ffffffff811b7be2>] printk+0x4d/0x4f [ 118.795812] [<ffffffff810372c2>] show_trace_log_lvl+0x32/0x60 [ 118.795814] [<ffffffff8103681f>] show_stack_log_lvl+0xff/0x180 [ 118.795816] [<ffffffff81037335>] show_stack+0x25/0x50 [ 118.795818] [<ffffffff8143001c>] dump_stack+0x85/0xc9 [ 118.795821] [<ffffffff810e5eab>] __lock_acquire+0x193b/0x1940 [ 118.795823] [<ffffffff810deab4>] ? cpuacct_charge+0xd4/0x1d0 [ 118.795825] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795826] [<ffffffff810d1dda>] ? update_curr_rt+0x15a/0x300 [ 118.795828] [<ffffffff810e6533>] lock_acquire+0xd3/0x220 [ 118.795830] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300 [ 118.795831] [<ffffffff818b638d>] _raw_spin_lock+0x3d/0x80 [ 118.795833] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300 [ 118.795834] [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300 [ 118.795836] [<ffffffff810b783c>] set_task_cpu+0xac/0x2e0 [ 118.795837] [<ffffffff810d2a53>] push_rt_task.part.39+0x153/0x2a0 [ 118.795839] [<ffffffff810d2cb7>] push_rt_tasks+0x17/0x30 [ 118.795841] [<ffffffff811b6d3b>] __balance_callback+0x45/0x5c [ 118.795844] [<ffffffff818b0d96>] __schedule+0xaf6/0xbb0 [ 118.795846] [<ffffffff818b0e8c>] schedule+0x3c/0x90 [ 118.795847] [<ffffffff818b6053>] schedule_hrtimeout_range_clock+0xe3/0x140 [ 118.795850] [<ffffffff811125c0>] ? hrtimer_init+0x230/0x230 [ 118.795852] [<ffffffff818b6047>] ? schedule_hrtimeout_range_clock+0xd7/0x140 [ 118.795853] [<ffffffff818b60c3>] schedule_hrtimeout_range+0x13/0x20 [ 118.795858] [<ffffffff81261604>] poll_schedule_timeout+0x54/0x80 [ 118.795859] [<ffffffff81262e67>] do_sys_poll+0x3a7/0x510 [ 118.795861] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795864] [<ffffffff8129ac30>] ? ep_poll_callback+0x120/0x360 [ 118.795866] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795867] [<ffffffff810d60c0>] ? __wake_up_sync_key+0x50/0x60 [ 118.795869] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150 [ 118.795871] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150 [ 118.795873] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795875] [<ffffffff811eeced>] ? __might_fault+0x4d/0xa0 [ 118.795877] [<ffffffff810e497d>] ? __lock_acquire+0x40d/0x1940 [ 118.795879] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795880] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90 [ 118.795883] [<ffffffff8111a3a4>] ? ktime_get_ts64+0x84/0x180 [ 118.795885] [<ffffffff810e424d>] ? trace_hardirqs_on+0xd/0x10 [ 118.795886] [<ffffffff8111a3d6>] ? ktime_get_ts64+0xb6/0x180 [ 118.795888] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90 [ 118.795889] [<ffffffff81263095>] SyS_poll+0x65/0xf0 [ 118.795891] [<ffffffff818b7080>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 118.796149] [<ffffffff8143001c>] dump_stack+0x85/0xc9 [ 118.796153] [<ffffffff810e5eab>] __lock_acquire+0x193b/0x1940 [ 118.796156] [<ffffffff810deab4>] ? cpuacct_charge+0xd4/0x1d0 [ 118.796159] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796163] [<ffffffff810d1dda>] ? update_curr_rt+0x15a/0x300 [ 118.796166] [<ffffffff810e6533>] lock_acquire+0xd3/0x220 [ 118.796169] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300 [ 118.796173] [<ffffffff818b638d>] _raw_spin_lock+0x3d/0x80 [ 118.796176] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300 [ 118.796179] [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300 [ 118.796183] [<ffffffff810b783c>] set_task_cpu+0xac/0x2e0 [ 118.796187] [<ffffffff810d2a53>] push_rt_task.part.39+0x153/0x2a0 [ 118.796190] [<ffffffff810d2cb7>] push_rt_tasks+0x17/0x30 [ 118.796194] [<ffffffff811b6d3b>] __balance_callback+0x45/0x5c [ 118.796198] [<ffffffff818b0d96>] __schedule+0xaf6/0xbb0 [ 118.796201] [<ffffffff818b0e8c>] schedule+0x3c/0x90 [ 118.796204] [<ffffffff818b6053>] schedule_hrtimeout_range_clock+0xe3/0x140 [ 118.796207] [<ffffffff811125c0>] ? hrtimer_init+0x230/0x230 [ 118.796211] [<ffffffff818b6047>] ? schedule_hrtimeout_range_clock+0xd7/0x140 [ 118.796215] [<ffffffff818b60c3>] schedule_hrtimeout_range+0x13/0x20 [ 118.796218] [<ffffffff81261604>] poll_schedule_timeout+0x54/0x80 [ 118.796221] [<ffffffff81262e67>] do_sys_poll+0x3a7/0x510 [ 118.796225] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796228] [<ffffffff8129ac30>] ? ep_poll_callback+0x120/0x360 [ 118.796232] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796235] [<ffffffff810d60c0>] ? __wake_up_sync_key+0x50/0x60 [ 118.796239] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150 [ 118.796242] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150 [ 118.796246] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796249] [<ffffffff811eeced>] ? __might_fault+0x4d/0xa0 [ 118.796253] [<ffffffff810e497d>] ? __lock_acquire+0x40d/0x1940 [ 118.796257] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796260] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90 [ 118.796264] [<ffffffff8111a3a4>] ? ktime_get_ts64+0x84/0x180 [ 118.796268] [<ffffffff810e424d>] ? trace_hardirqs_on+0xd/0x10 [ 118.796271] [<ffffffff8111a3d6>] ? ktime_get_ts64+0xb6/0x180 [ 118.796275] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90 [ 118.796278] [<ffffffff81263095>] SyS_poll+0x65/0xf0 [ 118.796281] [<ffffffff818b7080>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 128.972478] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0dModules linked in: intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel arc4 aesni_intel iwldvm aes_x86_64 lrw gf128mul mac80211 glue_helper ablk_helper cryptd joydev iwlwifi snd_hda_codec_hdmi serio_raw snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel intel_ips snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi thinkpad_acpi snd_seq lpc_ich snd_seq_device mei_me snd_timer nvram mei snd soundcore mac_hid shpchp netconsole configfs parport_pc ppdev lp parport autofs4 hid_generic nouveau mxm_wmi i2c_algo_bit ttm drm_kms_helper syscopyarea firewire_ohci sysfillrect usbhid sysimgblt ahci fb_sys_fops e1000e psmouse hid libahci sdhci_pci firewire_core crc_itu_t sdhci drm ptp pps_core video wmi [ 128.972479] irq event stamp: 1026850 [ 128.972480] hardirqs last enabled at (1026849): [<ffffffff811243a6>] tick_nohz_idle_enter+0x46/0x80 [ 128.972481] hardirqs last disabled at (1026850): [<ffffffff810d6a7d>] cpu_startup_entry+0xcd/0x450 [ 128.972481] softirqs last enabled at (1026834): [<ffffffff8108b181>] _local_bh_enable+0x21/0x50 [ 128.972482] softirqs last disabled at (1026833): [<ffffffff8108c2b2>] irq_enter+0x72/0xa0
On 2016-09-16 11:09, Dietmar Eggemann wrote:
I caught a WALT related hard lockup on a v4.7 kernel with only patch 1 on top. Fairly easy to reproduce by watching a video in firefox browser on Ubuntu 16.04.
$ addr2line -e vmlinux ffffffff810d835e
Thanks a lot for testing and reporting this Dietmar. This is because we need to take both rq locks during migration, and currently the code path doesn't consider that one of the locks may already be taken in some paths. Earlier bug-free versions of code had the double_lock/unlock around set_task_cpu as needed, but since that seemed expensive (bigger critical sections) we moved them into the walt_fixup_busy_time. We'll figure this one out soon.
Thanks, Vikram
On 16-Sep 19:09, Dietmar Eggemann wrote:
On 03/09/16 00:27, markivx@codeaurora.org wrote:
This patch series implements an alternative window assisted load tracking mechanism in lieu of PELT based cpu utilization tracking. Testing has shown that a window based non-decaying metric such as WALT guiding cpu frequency and task placement decisions can improve performance/power especially when running workloads more commonly found on mobile devices. The aim of this series is to incorporate WALT accounting into the scheduler and feed WALT statistics to schedutil in order to guide cpu frequency selection. The implementation is detailed in the commit text of Patch 1. The eventual goal is to also guide placement decisions based on WALT statistics.
WALT has existed in out-of-tree kernels for ARM/ARM64 commercialized devices for a few years. This is an effort to bring WALT to mainline as well as to test on multiple architectures and with varied workloads.
This RFC version is mainly to preview what the code will look like on mainline. Future RFC revisions will include a theoretical discussion and benchmark results.
Tested on an Intel x86_64 machine (on top of 4.7-rc6). (Benchmark results will be sent out separately and as part of this message in the next RFC version).
Patch 1: Adds WALT tracking to the scheduler
Patches 2-3: Temporary patches to bring in EAS/sched-freq like capacity table and to use Intel PMC counters for more accurate frequency invariant load tracking on X86. Included for completeness but not meant for merging.
include/linux/sched.h | 35 ++++++++++ include/linux/sched/sysctl.h | 2 + include/trace/events/sched.h | 76 +++++++++++++++++++++ init/Kconfig | 9 +++ kernel/sched/Makefile | 1 + kernel/sched/core.c | 29 ++++++++- kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++- kernel/sched/cputime.c | 11 +++- kernel/sched/debug.c | 10 +++ kernel/sched/fair.c | 7 +- kernel/sched/sched.h | 13 ++++ kernel/sched/walt.c | 580 ++++++++++++++++++++++++++++++++++ kernel/sched/walt.h | 75 +++++++++++++++++++++ kernel/sysctl.c | 18 +++++ 14 files changed, 904 insertions(+), 6 deletions(-)
I caught a WALT related hard lockup on a v4.7 kernel with only patch 1 on top. Fairly easy to reproduce by watching a video in firefox browser on Ubuntu 16.04.
$ addr2line -e vmlinux ffffffff810d835e kernel/sched/sched.h:1542
$ addr2line -e vmlinux ffffffff810d29b0 kernel/sched/sched.h:1538
1531 static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest) 1532 __releases(this_rq->lock) 1533 __acquires(busiest->lock) 1534 __acquires(this_rq->lock) 1535 { 1536 int ret = 0; 1537 1538 if (unlikely(!raw_spin_trylock(&busiest->lock))) { 1539 if (busiest < this_rq) { 1540 raw_spin_unlock(&this_rq->lock); 1541 raw_spin_lock(&busiest->lock); 1542 raw_spin_lock_nested(&this_rq->lock, 1543 SINGLE_DEPTH_NESTING); 1544 ret = 1; 1545 } else 1546 raw_spin_lock_nested(&busiest->lock, 1547 SINGLE_DEPTH_NESTING); 1548 } 1549 return ret; 1550 }
To me this issue seems something related to the one fixed by this Todd's patch:
https://android.googlesource.com/kernel/common/+/ab1b90f03a063f4ef9899835e9d...
We noticed an issue while working on AOSP v3.18 but it is potentially still present in mainline kernels since the implementation of the locking functions has not been updated.
Here is how Todd described a possible race condition:
Thanks for the review. I've convinced myself that getting to move_queued_task() with the two cpus being the same is possible (but probably rare) since there are races between normal scheduler migration and the forced migration via the cpu_migration_thread. If the thread migrates naturally from the src to the dest and does it after the last check in __migrate_task, we get into this case. This can happen since we drop the rq lock during double_lock_balance allowing a migration behind our back while we are re-acquiring the rq lock.
And here the resume of the analysis we did:
1. the double_(un)lock_balance() calls are mainly used by rt/deadline code, where there are proper checks that the two RQs are not the same. While it's never used by core/fair, where the dobule_rq_(un)lock() calls are preferred.
2. All the usages of double_(un)lock_balance() are introduced in core/walt by WALT related patches. However, the invariant: "RQs must not be the same" is not always granted in these paths.
3. The implementation of double_(un)lock_balance is both the Android kernel and mainline is "asymmetric". In the CONFIG_PREEMPT case at least the locking call is implemented using the double_rq_lock() which provides the proper check on RQs being different, while this check is not present in the unlocking function.
Juri has also got a confirmation from PeterZ that the double_(un)lock_balance functions are not to be used in case we cannot grant RQs are different. However, still the asymmetry is there and thus this code deserve a patch mainline as well which is the one Todd added to the AOSP v3.18.
Perhaps a better solution for WALT should be to use the double_rq_(un)lock() primitives instead of the double_(un)lock_balance() ones. Which also makes the code more aligned with the locking APIs already used in core scheduler.
Cheers Patrick
[ 118.795603] ============================================= [ 118.795606] [ INFO: possible recursive locking detected ] [ 118.795609] 4.7.0-walt-v4 #3 Not tainted [ 118.795612] --------------------------------------------- [ 118.795615] rtkit-daemon/3133 is trying to acquire lock: [ 118.795619] (&rq->lock){-.-.-.}, at: [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300 [ 118.795635] [ 118.795635] but task is already holding lock: [ 118.795638] (&rq->lock){-.-.-.}, at: [<ffffffff810d29b0>] push_rt_task.part.39+0xb0/0x2a0 [ 118.795650] [ 118.795650] other info that might help us debug this: [ 118.795653] Possible unsafe locking scenario: [ 118.795653] [ 118.795656] CPU0 [ 118.795659] ---- [ 118.795661] lock(&rq->lock); [ 118.795667] lock(&rq->lock); [ 118.795673] [ 118.795673] *** DEADLOCK *** [ 118.795673] [ 118.795676] May be due to missing lock nesting notation [ 118.795676] [ 118.795680] 1 lock held by rtkit-daemon/3133: [ 118.795682] #0: (&rq->lock){-.-.-.}, at: [<ffffffff810d29b0>] push_rt_task.part.39+0xb0/0x2a0 [ 118.795692] [ 118.795692] stack backtrace: [ 118.795697] CPU: 1 PID: 3133 Comm: rtkit-daemon Not tainted 4.7.0-walt-v4 #3 [ 118.795700] Hardware name: LENOVO 2537Z5F/2537Z5F, BIOS 6IET74WW (1.34 ) 10/25/2010 [ 118.795703] 0000000000000000 ffff8800ad7e77a8 ffffffff8143001c ffffffff829e8fc0 [ 118.795711] ffffffff829e8fc0 ffff8800ad7e7848 ffffffff810e5eab ffff880000000000 [ 118.795722] 000000000003e01f ffffffff8235f800 ffff8800af3ccd40 000000000000032f [ 118.795729] Call Trace: [ 118.795735] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110 [ 118.795736] in_atomic(): 1, irqs_disabled(): 1, pid: 3133, name: rtkit-daemon [ 118.795736] INFO: lockdep is turned off. [ 118.795737] irq event stamp: 330 [ 118.795741] hardirqs last enabled at (329): [<ffffffff818b675c>] _raw_spin_unlock_irq+0x2c/0x40 [ 118.795743] hardirqs last disabled at (330): [<ffffffff818b6eeb>] _raw_spin_lock_irqsave+0x2b/0x90 [ 118.795747] softirqs last enabled at (0): [<ffffffff810827b1>] copy_process.part.30+0x5c1/0x1e60 [ 118.795749] softirqs last disabled at (0): [< (null)>] (null) [ 118.795750] CPU: 1 PID: 3133 Comm: rtkit-daemon Not tainted 4.7.0-walt-v4 #3 [ 118.795751] Hardware name: LENOVO 2537Z5F/2537Z5F, BIOS 6IET74WW (1.34 ) 10/25/2010 [ 118.795753] 0000000000000001 ffff8800ad7e7390 ffffffff8143001c ffff8800af3ccd40 [ 118.795754] ffffffff81ca0267 ffff8800ad7e73b8 ffffffff810b3490 ffffffff81ca0267 [ 118.795756] 000000000000006e 0000000000000000 ffff8800ad7e73e0 ffffffff810b3599 [ 118.795756] Call Trace: [ 118.795763] [<ffffffff8143001c>] dump_stack+0x85/0xc9 [ 118.795766] [<ffffffff810b3490>] ___might_sleep+0x180/0x240 [ 118.795768] [<ffffffff810b3599>] __might_sleep+0x49/0x80 [ 118.795771] [<ffffffff810fc838>] synchronize_irq+0x38/0xa0 [ 118.795772] [<ffffffff810fbdfe>] ? __irq_put_desc_unlock+0x1e/0x40 [ 118.795774] [<ffffffff810fcae9>] ? __disable_irq_nosync+0x49/0x70 [ 118.795775] [<ffffffff810fcb3c>] disable_irq+0x1c/0x30 [ 118.795787] [<ffffffffc0172a02>] e1000_netpoll+0xf2/0x120 [e1000e] [ 118.795791] [<ffffffff817a3518>] netpoll_poll_dev+0x78/0x2c0 [ 118.795793] [<ffffffff817a3900>] netpoll_send_skb_on_dev+0x1a0/0x290 [ 118.795795] [<ffffffff817a3ccf>] netpoll_send_udp+0x2df/0x470 [ 118.795798] [<ffffffffc012ab32>] write_msg+0xb2/0xf0 [netconsole] [ 118.795800] [<ffffffff810f9489>] call_console_drivers.constprop.23+0x149/0x1e0 [ 118.795802] [<ffffffff810fa334>] console_unlock+0x4e4/0x5b0 [ 118.795803] [<ffffffff810fa7ae>] vprintk_emit+0x3ae/0x5d0 [ 118.795805] [<ffffffff810fab29>] vprintk_default+0x29/0x40 [ 118.795808] [<ffffffff811b7be2>] printk+0x4d/0x4f [ 118.795812] [<ffffffff810372c2>] show_trace_log_lvl+0x32/0x60 [ 118.795814] [<ffffffff8103681f>] show_stack_log_lvl+0xff/0x180 [ 118.795816] [<ffffffff81037335>] show_stack+0x25/0x50 [ 118.795818] [<ffffffff8143001c>] dump_stack+0x85/0xc9 [ 118.795821] [<ffffffff810e5eab>] __lock_acquire+0x193b/0x1940 [ 118.795823] [<ffffffff810deab4>] ? cpuacct_charge+0xd4/0x1d0 [ 118.795825] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795826] [<ffffffff810d1dda>] ? update_curr_rt+0x15a/0x300 [ 118.795828] [<ffffffff810e6533>] lock_acquire+0xd3/0x220 [ 118.795830] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300 [ 118.795831] [<ffffffff818b638d>] _raw_spin_lock+0x3d/0x80 [ 118.795833] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300 [ 118.795834] [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300 [ 118.795836] [<ffffffff810b783c>] set_task_cpu+0xac/0x2e0 [ 118.795837] [<ffffffff810d2a53>] push_rt_task.part.39+0x153/0x2a0 [ 118.795839] [<ffffffff810d2cb7>] push_rt_tasks+0x17/0x30 [ 118.795841] [<ffffffff811b6d3b>] __balance_callback+0x45/0x5c [ 118.795844] [<ffffffff818b0d96>] __schedule+0xaf6/0xbb0 [ 118.795846] [<ffffffff818b0e8c>] schedule+0x3c/0x90 [ 118.795847] [<ffffffff818b6053>] schedule_hrtimeout_range_clock+0xe3/0x140 [ 118.795850] [<ffffffff811125c0>] ? hrtimer_init+0x230/0x230 [ 118.795852] [<ffffffff818b6047>] ? schedule_hrtimeout_range_clock+0xd7/0x140 [ 118.795853] [<ffffffff818b60c3>] schedule_hrtimeout_range+0x13/0x20 [ 118.795858] [<ffffffff81261604>] poll_schedule_timeout+0x54/0x80 [ 118.795859] [<ffffffff81262e67>] do_sys_poll+0x3a7/0x510 [ 118.795861] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795864] [<ffffffff8129ac30>] ? ep_poll_callback+0x120/0x360 [ 118.795866] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795867] [<ffffffff810d60c0>] ? __wake_up_sync_key+0x50/0x60 [ 118.795869] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150 [ 118.795871] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150 [ 118.795873] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795875] [<ffffffff811eeced>] ? __might_fault+0x4d/0xa0 [ 118.795877] [<ffffffff810e497d>] ? __lock_acquire+0x40d/0x1940 [ 118.795879] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.795880] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90 [ 118.795883] [<ffffffff8111a3a4>] ? ktime_get_ts64+0x84/0x180 [ 118.795885] [<ffffffff810e424d>] ? trace_hardirqs_on+0xd/0x10 [ 118.795886] [<ffffffff8111a3d6>] ? ktime_get_ts64+0xb6/0x180 [ 118.795888] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90 [ 118.795889] [<ffffffff81263095>] SyS_poll+0x65/0xf0 [ 118.795891] [<ffffffff818b7080>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 118.796149] [<ffffffff8143001c>] dump_stack+0x85/0xc9 [ 118.796153] [<ffffffff810e5eab>] __lock_acquire+0x193b/0x1940 [ 118.796156] [<ffffffff810deab4>] ? cpuacct_charge+0xd4/0x1d0 [ 118.796159] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796163] [<ffffffff810d1dda>] ? update_curr_rt+0x15a/0x300 [ 118.796166] [<ffffffff810e6533>] lock_acquire+0xd3/0x220 [ 118.796169] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300 [ 118.796173] [<ffffffff818b638d>] _raw_spin_lock+0x3d/0x80 [ 118.796176] [<ffffffff810d835e>] ? walt_fixup_busy_time+0x1ee/0x300 [ 118.796179] [<ffffffff810d835e>] walt_fixup_busy_time+0x1ee/0x300 [ 118.796183] [<ffffffff810b783c>] set_task_cpu+0xac/0x2e0 [ 118.796187] [<ffffffff810d2a53>] push_rt_task.part.39+0x153/0x2a0 [ 118.796190] [<ffffffff810d2cb7>] push_rt_tasks+0x17/0x30 [ 118.796194] [<ffffffff811b6d3b>] __balance_callback+0x45/0x5c [ 118.796198] [<ffffffff818b0d96>] __schedule+0xaf6/0xbb0 [ 118.796201] [<ffffffff818b0e8c>] schedule+0x3c/0x90 [ 118.796204] [<ffffffff818b6053>] schedule_hrtimeout_range_clock+0xe3/0x140 [ 118.796207] [<ffffffff811125c0>] ? hrtimer_init+0x230/0x230 [ 118.796211] [<ffffffff818b6047>] ? schedule_hrtimeout_range_clock+0xd7/0x140 [ 118.796215] [<ffffffff818b60c3>] schedule_hrtimeout_range+0x13/0x20 [ 118.796218] [<ffffffff81261604>] poll_schedule_timeout+0x54/0x80 [ 118.796221] [<ffffffff81262e67>] do_sys_poll+0x3a7/0x510 [ 118.796225] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796228] [<ffffffff8129ac30>] ? ep_poll_callback+0x120/0x360 [ 118.796232] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796235] [<ffffffff810d60c0>] ? __wake_up_sync_key+0x50/0x60 [ 118.796239] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150 [ 118.796242] [<ffffffff812617d0>] ? poll_select_copy_remaining+0x150/0x150 [ 118.796246] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796249] [<ffffffff811eeced>] ? __might_fault+0x4d/0xa0 [ 118.796253] [<ffffffff810e497d>] ? __lock_acquire+0x40d/0x1940 [ 118.796257] [<ffffffff8103d319>] ? sched_clock+0x9/0x10 [ 118.796260] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90 [ 118.796264] [<ffffffff8111a3a4>] ? ktime_get_ts64+0x84/0x180 [ 118.796268] [<ffffffff810e424d>] ? trace_hardirqs_on+0xd/0x10 [ 118.796271] [<ffffffff8111a3d6>] ? ktime_get_ts64+0xb6/0x180 [ 118.796275] [<ffffffff81261a2a>] ? poll_select_set_timeout+0x5a/0x90 [ 118.796278] [<ffffffff81263095>] SyS_poll+0x65/0xf0 [ 118.796281] [<ffffffff818b7080>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 128.972478] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0dModules linked in: intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel arc4 aesni_intel iwldvm aes_x86_64 lrw gf128mul mac80211 glue_helper ablk_helper cryptd joydev iwlwifi snd_hda_codec_hdmi serio_raw snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel intel_ips snd_hda_codec cfg80211 snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi thinkpad_acpi snd_seq lpc_ich snd_seq_device mei_me snd_timer nvram mei snd soundcore mac_hid shpchp netconsole configfs parport_pc ppdev lp parport autofs4 hid_generic nouveau mxm_wmi i2c_algo_bit ttm drm_kms_helper syscopyarea firewire_ohci sysfillrect usbhid sysimgblt ahci fb_sys_fops e1000e psmouse hid libahci sdhci_pci firewire_core crc_itu_t sdhci drm ptp pps_core video wmi [ 128.972479] irq event stamp: 1026850 [ 128.972480] hardirqs last enabled at (1026849): [<ffffffff811243a6>] tick_nohz_idle_enter+0x46/0x80 [ 128.972481] hardirqs last disabled at (1026850): [<ffffffff810d6a7d>] cpu_startup_entry+0xcd/0x450 [ 128.972481] softirqs last enabled at (1026834): [<ffffffff8108b181>] _local_bh_enable+0x21/0x50 [ 128.972482] softirqs last disabled at (1026833): [<ffffffff8108c2b2>] irq_enter+0x72/0xa0 _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev