Re: [Eas-dev] [RFC PATCH v1 1/3] sched: Introduce Window Assisted Load Tracking (WALT) to track CPU utilization

13 Sep 2016

On 03/09/16 00:27, markivx@codeaurora.org wrote:
...
From: Srivatsa Vaddagiri vatsa@codeaurora.org
This patch implements an alternative window-based CPU utilization
tracking mechanism in the scheduler. Per task and per CPU counters are
updated with utilization statistics using a synchronized (across CPUs)
time source and a single statistic (prev_runnable_sum) is fed to the
registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
...
(window size determined by walt_ravg_window) is used to determine
CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to
the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed
during the current (not yet completed) window

prev_runnable_sum: aggregate utilization of all tasks that executed
during the most recent completed window


prev_runnable_sum is the primary stastic used to guide CPU frequency in
lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
...
statistic, the assumption being that the consumer (e.g., schedutil) will
perform appropriate policy decisions (e.g., margin) before deciding the
next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity
: util;' in cpu_util()? Just asking because otherwise IMHO this is no
different to PELT util.
...
Corresponding to the aggregate statistics, WALT also mantains the
following stats per task:
curr_window - represents the cpu utilization of the task in its most
recently tracked window

prev_window - represents cpu utilization of task in the window prior
to the one being tracked by curr_window


WALT statistic updates are event driven, with updates occurring in
scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch),
task wakeup and during task migration. Migration simply involves removing a
tasks's curr_window and prev_window from the source CPU's curr_runnable sum
and prev_runnable_sum, and adding the per-task counters to the destination
CPU's aggregate CPU counters.
PELT util updates are event-driven as well. The difference is that WALT
operates in core.c whereas PELT util only considers the CFS class.
...
Execution time in an IRQ handler is accounted
in a CPU's curr_runnable_sum statistic, provided that the CPU was also
executing the idle task for the duration of the interrupt handler.
...
Idle task handling is modified by walt_io_is_busy; when set to 1, if a CPU
rq has tasks blocked on IO, idle-task execution is accounted in per-task
and per-CPU counters. Setting walt_io_is_busy will also cause interrupt
handlers in the idle task to update counters as if the idle task was
executing (instead of just the interrupt handler execution time).
The major tunable provided by WALT is walt_ravg_window, which represents
window size (in nanoseconds) and is set to 20ms by default. walt_io_is_busy
(described above) is set to 0 by default.
Potential upcoming changes/improvements include: the use of sched_clock
instead of ktime_get as a time source, support for an unsynchronized
(across CPUs) time source, and integration with mainlined CPU efficiency
APIs.
Signed-off-by: Srivatsa Vaddagiri vatsa@codeaurora.org
Signed-off-by: Vikram Mulukutla markivx@codeaurora.org

include/linux/sched.h            |  35 +++
 include/linux/sched/sysctl.h     |   1 +
 include/trace/events/sched.h     |  76 ++++++
 init/Kconfig                     |   9 +
 kernel/sched/Makefile            |   1 +
 kernel/sched/core.c              |  28 +-
 kernel/sched/cpufreq_schedutil.c |   7 +-
 kernel/sched/cputime.c           |  11 +-
 kernel/sched/debug.c             |  10 +
 kernel/sched/fair.c              |   7 +-
 kernel/sched/sched.h             |  10 +
 kernel/sched/walt.c              | 540 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/walt.h              |  73 ++++++
 kernel/sysctl.c                  |   9 +
 14 files changed, 812 insertions(+), 5 deletions(-)
 create mode 100644 kernel/sched/walt.c
 create mode 100644 kernel/sched/walt.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 253538f..56e708f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -314,6 +314,17 @@ extern char ___assert_task_state[1 - 2*!!(
 /* Task command name length */
 #define TASK_COMM_LEN 16
+enum task_event {

PUT_PREV_TASK   = 0,
PICK_NEXT_TASK  = 1,
TASK_WAKE       = 2,
TASK_MIGRATE    = 3,
TASK_UPDATE     = 4,
IRQ_UPDATE      = 5,

+};
I always ask myself why WALT has fine-grained event types where PELT can
do with running/not running? Is there a benefit or are they used in
account_cpu_busy_time() essentially as running/not running ?
[...]
...
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57..2adf245 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -562,6 +562,82 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
TP_printk("cpu=%d", __entry->cpu)
 );



+TRACE_EVENT(sched_walt_util,


TP_PROTO(int cpu, unsigned long cfs_util_avg, unsigned long walt_util),

TP_ARGS(cpu, cfs_util_avg, walt_util),

TP_STRUCT__entry(
__field(		int,	cpu		)


__field(	unsigned long,	cfs_util_avg	)


__field(	unsigned long,	walt_util	)


),

TP_fast_assign(
__entry->cpu		= cpu;


__entry->cfs_util_avg	= cfs_util_avg;


__entry->walt_util	= walt_util;


),

TP_printk("cpu %d cfs_util_avg %lu walt_util %lu", __entry->cpu,  __entry->cfs_util_avg, __entry->walt_util)

+);



+struct rq;



+TRACE_EVENT(sched_walt_update_task_ravg,


TP_PROTO(struct task_struct *p, struct rq *rq, enum task_event evt, u64 wallclock, u64 irqtime),

TP_ARGS(p, rq, evt, wallclock, irqtime),

TP_STRUCT__entry(
__array(	char,	comm,   TASK_COMM_LEN	)


__field(	pid_t,	pid			)


__field(	pid_t,	cur_pid			)


__field(unsigned int,	cpu			)


__field(unsigned int,	cur_freq		)


__field(	u64,	wallclock		)


__field(	u64,	mark_start		)


__field(	u64,	win_start		)


__field(	u64,	irqtime			)


__field(enum task_event,	evt		)


__field(	u64,	rq_cs			)


__field(	u64,	rq_ps			)


__field(	u32,	curr_window		)


__field(	u32,	prev_window		)


),

TP_fast_assign(
__entry->wallclock      = wallclock;


__entry->win_start      = rq->window_start;


__entry->evt            = evt;


__entry->cpu            = rq->cpu;


__entry->cur_pid        = rq->curr->pid;


__entry->cur_freq       = rq->cur_freq;


memcpy(__entry->comm, p->comm, TASK_COMM_LEN);


__entry->pid            = p->pid;


__entry->mark_start     = p->ravg.mark_start;


__entry->irqtime        = irqtime;


__entry->rq_cs          = rq->curr_runnable_sum;


__entry->rq_ps          = rq->prev_runnable_sum;


__entry->curr_window	= p->ravg.curr_window;


__entry->prev_window	= p->ravg.prev_window;


),

TP_printk("wc %llu ws %llu event %s cpu %d cur_freq %u cur_pid %d task %d (%s) ms %llu irqtime %llu rq_cs %llu rq_ps %llu cur_window %u prev_window %u"
, __entry->wallclock, __entry->win_start,


task_event_names[__entry->evt], __entry->cpu,


__entry->cur_freq, __entry->cur_pid,


__entry->pid, __entry->comm, __entry->mark_start,


__entry->irqtime,


__entry->rq_cs, __entry->rq_ps,


__entry->curr_window, __entry->prev_window


)



+);




#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
IMHO, trace_events can go into another patch.
If you try to compile w/o CONFIG_SCHED_WALT they break the build.
[...]
...
@@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
    */
   smp_cond_acquire(!p->on_cpu);

raw_spin_lock(&task_rq(p)->lock);

This is extra locking, right?
[...]
...
diff --git a/kernel/sched/walt.c b/kernel/sched/walt.c
new file mode 100644
index 0000000..203e02d
--- /dev/null
+++ b/kernel/sched/walt.c
@@ -0,0 +1,540 @@
+/*


Copyright (c) 2016, The Linux Foundation. All rights reserved.







This program is free software; you can redistribute it and/or modify



it under the terms of the GNU General Public License version 2 and



only version 2 as published by the Free Software Foundation.







This program is distributed in the hope that it will be useful,



but WITHOUT ANY WARRANTY; without even the implied warranty of



MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the



GNU General Public License for more details.











Window Assisted Load Tracking (WALT) implementation credits:



Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,



Pavan Kumar Kondeti, Olav Haugan







2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla



        and Todd Kjos





2016-08-31: Integration with mainline by Srivatsa Vaddagiri



        and Vikram Mulukutla




*/


+#include <linux/syscore_ops.h>
+#include <linux/cpufreq.h>
+#include <trace/events/sched.h>
+#include "sched.h"
+#include "walt.h"




+char *task_event_names[] = {"PUT_PREV_TASK", "PICK_NEXT_TASK",

		"TASK_WAKE", "TASK_MIGRATE", "TASK_UPDATE",


		"IRQ_UPDATE"};




+__read_mostly unsigned int sysctl_sched_use_walt_metrics = 1;



+static __read_mostly unsigned int walt_freq_account_wait_time;
+static __read_mostly unsigned int walt_io_is_busy;



+/* 1 -> use PELT based load stats, 0 -> use window-based load stats */
+static unsigned int __read_mostly walt_disabled;



+static unsigned int max_possible_efficiency = 1024;



+/*


Maximum possible frequency across all cpus. Task demand and cpu



capacity (cpu_power) metrics are scaled in reference to it.


*/

+static unsigned int max_possible_freq = 1;



+/* Window size (in ns) */
+__read_mostly unsigned int walt_ravg_window = 20000000;



+/* Min window size (in ns) = 10ms */
+#define MIN_SCHED_RAVG_WINDOW 10000000



+/* Max window size (in ns) = 1s */
+#define MAX_SCHED_RAVG_WINDOW 1000000000



+static unsigned int sync_cpu;
What is the business of this sync_cpu?
[...]
...
+unsigned long __weak arch_get_cpu_efficiency(int cpu)
+{

return 1024;

+}
So this should be connected to arch_scale_cpu_capacity(NULL, cpu)
instead. You mentioned this already somewhere.
...



+void walt_init_cpu_efficiency(void)
+{

int i, efficiency;
unsigned int max = 0;

for_each_possible_cpu(i) {
efficiency = arch_get_cpu_efficiency(i);


cpu_rq(i)->efficiency = efficiency;



if (efficiency > max)


	max = efficiency;


}

if (max)
max_possible_efficiency = max;



+}



[...]
...
+void walt_account_irqtime(int cpu, struct task_struct *curr,

		u64 delta, u64 wallclock)



+{

struct rq *rq = cpu_rq(cpu);
unsigned long flags;

raw_spin_lock_irqsave(&rq->lock, flags);

/*
* cputime (wallclock) uses sched_clock so use the same here for


* consistency.


*/


delta += sched_clock_cpu(cpu) - wallclock;

walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), delta);

raw_spin_unlock_irqrestore(&rq->lock, flags);

+}



+int fast_switching = 1;
An assumption that you're on x86. Should be mentioned somewhere in the
patch-header.
...



+void walt_freq_transition(int cpu, unsigned long new_freq)
+{

int i;
unsigned long flags;

local_irq_save(flags);
for_each_cpu(i, &cpu_rq(cpu)->freq_domain_cpumask) {
struct rq *rq = cpu_rq(i);



if (!fast_switching ||


    (fast_switching && smp_processor_id() != i))


	raw_spin_lock(&rq->lock);



walt_update_task_ravg(rq->curr, rq, TASK_UPDATE,


		      walt_ktime_clock(), 0);


rq->cur_freq = new_freq;



if (!fast_switching ||


    (fast_switching && smp_processor_id() != i))


	raw_spin_unlock(&rq->lock);


if (!rq->window_start)


	walt_set_window_start(rq);


}
local_irq_restore(flags);

+}



+static int cpufreq_notifier_trans(struct notifier_block *nb,

unsigned long val, void *data)



+{

struct cpufreq_freqs *freq = (struct cpufreq_freqs *)data;
unsigned int cpu = freq->cpu, new_freq = freq->new;

if (val != CPUFREQ_POSTCHANGE)
return 0;



BUG_ON(!new_freq);

if (cpu_rq(cpu)->cur_freq == new_freq)
return 0;



walt_freq_transition(cpu, new_freq);

return 0;

+}



+static struct notifier_block notifier_policy_block = {

.notifier_call = cpufreq_notifier_policy

+};



+static struct notifier_block notifier_trans_block = {

.notifier_call = cpufreq_notifier_trans

+};



+static int register_sched_callback(void)
+{

int ret;

ret = cpufreq_register_notifier(&notifier_policy_block,
				CPUFREQ_POLICY_NOTIFIER);



if (!fast_switching)
ret = cpufreq_register_notifier(&notifier_trans_block,


				CPUFREQ_TRANSITION_NOTIFIER);



return 0;

+}



+/*


cpufreq callbacks can be registered at core_initcall or later time.



Any registration done prior to that is "forgotten" by cpufreq. See



initialization of variable init_cpufreq_transition_notifier_list_called



for further information.


*/

+core_initcall(register_sched_callback);
Why not using the mainline frequency invariance interface
'arch_scale_freq_capacity()' for all of WALT's frequency invariance
needs and implement the FIE (Frequency Invariance Engine in the Arch)
plus link it to its users by '#define arch_scale_freq_capacity
foo_scale_freq_capacity'
[...]

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC PATCH v1 1/3] sched: Introduce Window Assisted Load Tracking (WALT) to track CPU utilization