The current approach to select an idle state is based on the idle period statistics computation.
Useless to say this approach satisfied everyone as a solution to find the best trade-off between the performances and the energy saving via the menu governor.
However, the kernel is evolving to act pro-actively regarding the energy constraints with the scheduler and the different power management subsystems are not collaborating with the scheduler as the conductor of the decisions, they all act independently.
In order to integrate the cpuidle framework into the scheduler, we have to radically change the approach by clearly identifying what is causing a wake up and how it behaves. The cpuidle governors are based on idle period statistics, hence without knowledge of what woke up the cpu. In these sources of wakes up, the IPI are of course accounted which results in doing statistics on the scheduler behavior too. It is no sense to let the scheduler to take a decision based on a next prediction of its own decisions.
This serie inverts the logic.
First there is a small statistic library do to basic and fast statistics computation, put in the library directory and make it available to everyone. It is mathematically proven there is no overflow in the code (check the log and comments).
The second patch provides a callback to be registered in the irq subsystem and to be called when an interrupt is handled with a timestamp. Interrupts related to timers are discarded.
The third patch uses the callback provided by the patch above to compute an average for each interrupt on each cpu. When the interrupt intervals are in standard deviation +/- mean value, then the source of wake up is considered stable and enters in the 'predictable' category. Then the next prediction wakeup for a specific cpu is the minimum remaining time of each interrupt's next prediction / or the timer.
These are the results with a workload emulator (mp3, video, browser, ...) on a Dual Xeon 6 cores. Each test has been run 10 times.
-------------------------- successful predictions (%) -------------------------- scripts/rt-app-browser.sh.menu.dat: N min max sum mean stddev 10 56.51 68.61 631.27 63.127 3.6882 scripts/rt-app-browser.sh.irq.dat: N min max sum mean stddev 10 72.88 79.94 774.43 77.443 2.10055 -------------------------- Successful predictions (%) -------------------------- scripts/rt-app-mp3.sh.menu.dat: N min max sum mean stddev 10 65.4 69.53 675.51 67.551 1.42503 scripts/rt-app-mp3.sh.irq.dat: N min max sum mean stddev 10 82.03 92.13 854.69 85.469 2.63553 -------------------------- Successful predictions (%) -------------------------- scripts/rt-app-video.sh.menu.dat: N min max sum mean stddev 10 57.69 77.72 625.58 62.558 5.54488 scripts/rt-app-video.sh.irq.dat: N min max sum mean stddev 10 73.19 75.2 742.33 74.233 0.752316 -------------------------- Successful predictions (%) -------------------------- scripts/video.sh.menu.dat: N min max sum mean stddev 10 40.7 59.08 463.02 46.302 5.25094 scripts/video.sh.irq.dat: N min max sum mean stddev 10 29.64 84.59 425.58 42.558 16.007
The next prediction algorithm is very simple at the moment but it opens the door for the following improvements: - Detect patterns (eg. 1, 1, 3, 1, 1, 3, ...) - Each devices behave differently, thus the prediction algorithm can be per interrupt. Eg. disk ios have a burst of fast interrupt followed by a couple of slow interrupts.
If a simplistic algorithm gives better results than the menu governor, there is a high probability an optimized one will do much better.
* Regarding how this integrates into the scheduler
At the moment the integration is the first step, hence there is just a very small integration when the scheduler tries to find a cpu it will prevent to use an idle cpu where the idle period did not reach the energy break even.
Invoking the API to enter idle is simplified on purpose to let the scheduler to take a decision between it asks when is expected the next wakeup on the cpu and when it enters idle.
- sched_idle_next_wakeup() => returns a s64 telling the remaining time before a wakeup occurs
- sched_idle(duration, latency) => goes idle with the specified duration and the latency constraint
Daniel Lezcano (9): lib: Add a simple statistics library irq: Add a framework to measure interrupt timings sched: idle: IRQ based next prediction for idle period sched-idle: Plug sched idle with the idle task cpuidle: Add statistics and debug information with debugfs cpuidle: Store the idle start time stamp sched: fair: Fix wrong idle timestamp usage sched/fair: Prevent to break the target residency sched-idle: Add a debugfs entry to switch from cpuidle to sched-idle
Nicolas Pitre (1): idle-sched: Add a trace event when an interrupt occurs
drivers/cpuidle/Kconfig | 12 ++ drivers/cpuidle/Makefile | 2 + drivers/cpuidle/cpuidle.c | 16 +- drivers/cpuidle/debugfs.c | 232 +++++++++++++++++++++++ drivers/cpuidle/debugfs.h | 19 ++ include/linux/cpuidle.h | 16 ++ include/linux/interrupt.h | 45 +++++ include/linux/irqdesc.h | 3 + include/linux/stats.h | 29 +++ include/trace/events/irq.h | 44 +++++ kernel/irq/Kconfig | 3 + kernel/irq/handle.c | 12 ++ kernel/irq/manage.c | 65 ++++++- kernel/sched/Makefile | 1 + kernel/sched/fair.c | 44 +++-- kernel/sched/idle-sched.c | 449 +++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/idle.c | 11 +- kernel/sched/sched.h | 20 ++ lib/Makefile | 3 +- lib/stats.c | 235 ++++++++++++++++++++++++ 20 files changed, 1239 insertions(+), 22 deletions(-) create mode 100644 drivers/cpuidle/debugfs.c create mode 100644 drivers/cpuidle/debugfs.h create mode 100644 include/linux/stats.h create mode 100644 kernel/sched/idle-sched.c create mode 100644 lib/stats.c
-- 1.9.1