* Dietmar Eggemann dietmar.eggemann@arm.com [2016-09-13 12:50:15]:
This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to the registered utilization callback listeners. A windowed view of time
What are these 'registered utilization callback listeners'?
Anyone that has interest in WALT data, mainly cpufreq governors. Also, a minor correction here - I don't think we envisoned any registration mechanism here to share WALT data. In our production kernel, cpufreq (interactive) governor pulls the data whenever it needs by calling an exported function of scheduler.
(window size determined by walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by WALT, both normalized to the max possible frequency and the max efficiency (IPC) of that CPU:
curr_runnable_sum: aggregate utilization of all tasks that executed during the current (not yet completed) window prev_runnable_sum: aggregate utilization of all tasks that executed during the most recent completed window
prev_runnable_sum is the primary stastic used to guide CPU frequency in lieu of PELT's cfs_rq->util_avg. No additional policy is imposed on this
s/cfs_rq->util_avg/cfs_rq->avg.util_avg
statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., margin) before deciding the next P-state.
The former paragraph is related to 'return (util >= capacity) ? capacity : util;' in cpu_util()? Just asking because otherwise IMHO this is no different to PELT util.
Not sure I follow you here. Which "former" paragraph is being referred here?
To add some clarity on "policy" stuff Vikram is referring to here, prev_runnable_sum refers to actual busy time incurred in previous window. How that is used to decide on next frequency involves consideration of desired headroom or idle time. For example, a CPU that was busy for 99% of the previous window when it was running at some frequency f1, may or may not result in a frequency increase for the next window, depending on the "idle time" goals set by user (which is the policy aspect involved here).
WALT statistic updates are event driven, with updates occurring in scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch), task wakeup and during task migration. Migration simply involves removing a tasks's curr_window and prev_window from the source CPU's curr_runnable sum and prev_runnable_sum, and adding the per-task counters to the destination CPU's aggregate CPU counters.
PELT util updates are event-driven as well. The difference is that WALT operates in core.c whereas PELT util only considers the CFS class.
I think the main point of difference between PELT and WALT is the speed at which "forecasted" demand of task can be updated. For example, consider a task that was using 1% of cpu for a long time and hence considered a little task, fit to be run on little cpu. It experiences a sudden burst of work, which makes it consume 100% of a little cpu (say at little cpu's max_frequency). The speed with which we can classify the task as needing big cpu is faster with WALT then PELT afaics. With WALT, it can be classified as 'big' task as soon as one-window completes (typically 10-20ms latency) whereas with PELT, it will take longer. It could however be argued that the task may not exhibhit the same high demand in next window (and hence the conservative approach of PELT is better), but from our experience with mobile workloads (especially that involve graphic processing which is largely window/frame-based) the window-based approach for predicting immediate cpu/frequency needs of task has worked better. I believe Google has also done fair amount of validation on usefulness of WALT (for mobile workloads) that we have now come to the point of discussing the approach more widely.
+enum task_event {
- PUT_PREV_TASK = 0,
- PICK_NEXT_TASK = 1,
- TASK_WAKE = 2,
- TASK_MIGRATE = 3,
- TASK_UPDATE = 4,
- IRQ_UPDATE = 5,
+};
I always ask myself why WALT has fine-grained event types where PELT can do with running/not running? Is there a benefit or are they used in account_cpu_busy_time() essentially as running/not running ?
I think we wanted some ability to discount wait-time from reflecting in task's cpu demand/usage. In PICK_NEXT_TASK event, for example, we can ignore task's wait time (despite it being in running status) and not add that as cpu/task's busy time. Its appears possible to achieve that goal by depending on just on_rq and on_cpu fields of a task (and do away with various event types that we currently have). We will explore that optimization for next iteration.
@@ -2049,6 +2053,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_acquire(!p->on_cpu);
- raw_spin_lock(&task_rq(p)->lock);
This is extra locking, right?
rq->lock needs to be held before calling walt_update_task_ravg() and so this locking is unavoidable.
+static unsigned int sync_cpu;
What is the business of this sync_cpu?
WALT, as implemented currently, depends on maintaining windows that are synchronized across all cpus. This helps adjust cpu busy counters when task migrate. This requires we have a hardware clock synchronized across cpus (not a hard requirement, but nice-2-have one, easily met on ARM) and also one cpu to be nominated as "reference" for window_start of currently active window. cpu0 is default sync_cpu and its window_start is initialized during bootup, based on the hardware clock (sched_clock()) value seen at that point. CPU0's window_start value is advanced periodically to reflect expiration of windows of fixed-size. Other cpus window_start value is initialized later during bootup in reference to CPU0's window_start value. Once the secondary cpu's window_start has been 'synchronized' with CPU0's window_start, there is no further synchronization required later.
We realize that not all architectures have hardware clock that is synchronized across CPUs and I think it should still be possible to have synchronized windows as long as the frequency of hardware clock is same on all cpus. That would be the next major change WALT need to address.
Why not using the mainline frequency invariance interface 'arch_scale_freq_capacity()' for all of WALT's frequency invariance needs and implement the FIE (Frequency Invariance Engine in the Arch) plus link it to its users by '#define arch_scale_freq_capacity foo_scale_freq_capacity'
Yes that's on our todo-list (to adopt upstream available mechanisms for scaling cpu stats).
- vatsa