Re: [Eas-dev] [RFC PATCH v1 1/3] sched: Introduce Window Assisted Load Tracking (WALT) to track CPU utilization

15 Sep 2016


      * Dietmar Eggemann dietmar.eggemann@arm.com [2016-09-13 12:50:15]:
...
...
This patch implements an alternative window-based CPU utilization
tracking mechanism in the scheduler. Per task and per CPU counters are
updated with utilization statistics using a synchronized (across CPUs)
time source and a single statistic (prev_runnable_sum) is fed to the
registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
Anyone that has interest in WALT data, mainly cpufreq governors. Also, a minor
correction here - I don't think we envisoned any registration mechanism here to
share WALT data. In our production kernel, cpufreq (interactive) governor pulls
the data whenever it needs by calling an exported function of scheduler.
...
...
(window size determined by walt_ravg_window) is used to determine
CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to
the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed
during the current (not yet completed) window

prev_runnable_sum: aggregate utilization of all tasks that executed
during the most recent completed window


prev_runnable_sum is the primary stastic used to guide CPU frequency in
lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
...
statistic, the assumption being that the consumer (e.g., schedutil) will
perform appropriate policy decisions (e.g., margin) before deciding the
next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity
: util;' in cpu_util()? Just asking because otherwise IMHO this is no
different to PELT util.
Not sure I follow you here. Which "former" paragraph is being referred here?
To add some clarity on "policy" stuff Vikram is referring to here,
prev_runnable_sum refers to actual busy time incurred in previous window. How
that is used to decide on next frequency involves consideration of desired
headroom or idle time. For example, a CPU that was busy for 99% of the previous
window when it was running at some frequency f1, may or may not result in a
frequency increase for the next window, depending on the "idle time" goals set
by user (which is the policy aspect involved here).
...
...
WALT statistic updates are event driven, with updates occurring in
scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch),
task wakeup and during task migration. Migration simply involves removing a
tasks's curr_window and prev_window from the source CPU's curr_runnable sum
and prev_runnable_sum, and adding the per-task counters to the destination
CPU's aggregate CPU counters.
PELT util updates are event-driven as well. The difference is that WALT
operates in core.c whereas PELT util only considers the CFS class.
I think the main point of difference between PELT and WALT is the speed at which
"forecasted" demand of task can be updated. For example, consider a task that
was using 1% of cpu for a long time and hence considered a little task, fit to be
run on little cpu. It experiences a sudden burst of work, which makes it consume
100% of a little cpu (say at little cpu's max_frequency). The speed with
which we can classify the task as needing big cpu is faster with WALT then PELT
afaics. With WALT, it can be classified as 'big' task as soon as one-window
completes (typically 10-20ms latency) whereas with PELT, it will take longer. It
could however be argued that the task may not exhibhit the same high demand in
next window (and hence the conservative approach of PELT is better), but from
our experience with mobile workloads (especially that involve graphic processing
which is largely window/frame-based) the window-based approach for predicting
immediate cpu/frequency needs of task has worked better. I believe Google has
also done fair amount of validation on usefulness of WALT (for mobile workloads)
that we have now come to the point of discussing the approach more widely.
...
...
+enum task_event {

PUT_PREV_TASK   = 0,
PICK_NEXT_TASK  = 1,
TASK_WAKE       = 2,
TASK_MIGRATE    = 3,
TASK_UPDATE     = 4,
IRQ_UPDATE      = 5,

+};
I always ask myself why WALT has fine-grained event types where PELT can
do with running/not running? Is there a benefit or are they used in
account_cpu_busy_time() essentially as running/not running ?
I think we wanted some ability to discount wait-time from reflecting in task's
cpu demand/usage. In PICK_NEXT_TASK event, for example, we can ignore task's
wait time (despite it being in running status) and not add that as cpu/task's
busy time. Its appears possible to achieve that goal by depending on just
on_rq and on_cpu fields of a task (and do away with various event types that we
currently have). We will explore that optimization for next iteration.
...
...
@@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
    */
   smp_cond_acquire(!p->on_cpu);

raw_spin_lock(&task_rq(p)->lock);

This is extra locking, right?
rq->lock needs to be held before calling walt_update_task_ravg() and so this
locking is unavoidable.
...
...
+static unsigned int sync_cpu;
What is the business of this sync_cpu?
WALT, as implemented currently, depends on maintaining windows that are
synchronized across all cpus. This helps adjust cpu busy counters when task
migrate. This requires we have a hardware clock synchronized across cpus (not a
hard requirement, but nice-2-have one, easily met on ARM) and also one cpu to be
nominated as "reference" for window_start of currently active window. cpu0 is
default sync_cpu and its window_start is initialized during bootup, based on the
hardware clock (sched_clock()) value seen at that point. CPU0's window_start
value is advanced periodically to reflect expiration of windows of fixed-size.
Other cpus window_start value is initialized later during bootup in reference to
CPU0's window_start value. Once the secondary cpu's window_start has been
'synchronized' with CPU0's window_start, there is no further synchronization
required later.
We realize that not all architectures have hardware clock that is synchronized
across CPUs and I think it should still be possible to have synchronized windows
as long as the frequency of hardware clock is same on all cpus. That would be
the next major change WALT need to address.
...
Why not using the mainline frequency invariance interface
'arch_scale_freq_capacity()' for all of WALT's frequency invariance
needs and implement the FIE (Frequency Invariance Engine in the Arch)
plus link it to its users by '#define arch_scale_freq_capacity
foo_scale_freq_capacity'
Yes that's on our todo-list (to adopt upstream available mechanisms for scaling
cpu stats).
- vatsa

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC PATCH v1 1/3] sched: Introduce Window Assisted Load Tracking (WALT) to track CPU utilization