The new power aware scheduling framework is being designed with a goal that all the cpu power management is in one place. Today the power management policies are fragmented between the cpuidle and cpufreq subsystems, which makes power management inconsistent. To top this, we were integrating task packing algorithms into the scheduler which could potentially worsen the scenario.
The new power aware scheduler design will have all policies, all metrics, all averaging concerning cpuidle and cpufrequency in one place, that being the scheduler. This patchset lays the foundation for this approach to help remove the existing fragmented approach towards cpu power savings.
NOTE: This patchset targets only cpuidle. cpu-frequency can be integrated into this design on the same lines.
The design is broken down into incremental steps which will enable easy validation of the power aware scheduler. This by no means is complete and will require more work to get to a stage where it can beat the current approach. Like I said this is just the foundation to help us get started. The subsequent patches can be small incremental measured steps.
Ingo had pointed out this approach in http://lwn.net/Articles/552889/ and I have tried my best at understanding and implementing the initial steps that he suggested.
1.Start from the dumbest possible state: all CPUs are powered up fully, there's no idle state selection essentially.
2.Then go for the biggest effect first and add the ability to idle in a lower power state (with new functions and a low level driver that implements this for the platform with no policy embedded into it.
3.Implement the task packing algorithm.
This patchset implements the above three steps and makes the fundamental design of power aware scheduler clear. It shows how:
1.The design should be non intrusive with the existing code. It should be enabled/disabled by a config switch. This way we can continue to work towards making it better without having to worry about regressing the kernel and yet have it in the kernel at the same time; a confidence booster that it is making headway. CONFIG_SCHED_POWER is the switch that makes the new code appear when turned on and disappear and default to the original code when turned off.
2.The design should help us test it better. Like Ingo pointed out:
"Important: it's not a problem that the initial code won't outperform the current kernel's performance. It should outperform the _initial_ 'dumb' code in the first step. Then the next step should outperform the previous step, etc. The quality of this iterative approach will eventually surpass the combined effect of currently available but non-integrated facilities."
This is precisely what this design does. PATCH[1/19] disables cpuidle and cpufrequency sub systems altogether if CONFIG_SCHED_POWER is enabled. This is the dumb code. Our subsequent patches should outperform this.
3. Introduce a low level driver which interfaces scheduler with C-state switching. Again Ingo had pointed out this saying: "It should be presented to the scheduler in a platform independent fashion, but without policy embedded: a low level platform driver interface in essence."
PATCH[2/19] ensures that CPUIDLE governors no longer control idle state selection. The idle state selection and policies are moved into kernel/sched/power.c. True, its the same code from the menu governor, however it has been moved into scheduler specific code and no longer functions like a driver. Its meant to be part of the core kernel. The "low level driver" lives under drivers/cpuidle/cpuidle.c like before. It registers platform specific cpuidle drivers and does other low level stuff that the scheduler needn't bother about. It has no policies embedded into it whatsoever. Importantly it is an entry point to switching C states and nothing beyond that.
PATCH[3/19] enumerates idle states and parameters in the scheduler topology. This is so that the scheduler knows the cost of entry/exit into idle states that can be made use of going ahead. As an example, this patchset shows how the platform specific cpuidle driver should help fill up the idle state details into the topology. This fundamental information is missing today in the scheduler.
These two patches are not expected to change the performance/power savings in any way. They are just the first steps towards the integrated approach of the power aware scheduler.
The patches PATCH[4/19] to PATCH[18/19] do task packing. This series is the one that Alex Shi had posted long ago https://lkml.org/lkml/2013/3/30/78. However this patch series will come into effect only if CONFIG_SCHED_POWER is enabled. It is this series which is expected to bring about changes in performance and power savings; not necessarily better than the existing code, but certainly should be better than the dumb code.
Our subsequent efforts should surpass the performance/powersavings of the existing code. This patch series is compile tested only.
V1 of this power efficient scheduling design was posted by Morten after Ingo posted his suggestions on http://lwn.net/Articles/552889/. [RFC][PATCH 0/9] sched: Power scheduler design proposal: https://lkml.org/lkml/2013/7/15/101 But it decoupled the scheduler into the regular and power scheduler with the latter controlling the cpus that could be used by the regular scheduler. We do not need this kind of decoupling. With the foundation that this patch set lays, it must be relatively easy to make the existing scheduler power aware.
---
Alex Shi (16): sched: add sched balance policies in kernel sched: add sysfs interface for sched_balance_policy selection sched: log the cpu utilization at rq sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing sched: move sg/sd_lb_stats struct ahead sched: get rq potential maximum utilization sched: detect wakeup burst with rq->avg_idle sched: add power aware scheduling in fork/exec/wake sched: using avg_idle to detect bursty wakeup sched: packing transitory tasks in wakeup power balancing sched: add power/performance balance allow flag sched: pull all tasks from source grp and no balance for prefer_sibling sched: add new members of sd_lb_stats sched: power aware load balance sched: lazy power balance sched: don't do power balance on share cpu power domain
Preeti U Murthy (3): sched/power: Remove cpu idle state selection and cpu frequency tuning sched/power: Move idle state selection into the scheduler sched/idle: Enumerate idle states in scheduler topology
Documentation/ABI/testing/sysfs-devices-system-cpu | 23 + arch/powerpc/Kconfig | 1 arch/powerpc/platforms/powernv/Kconfig | 12 drivers/cpufreq/Kconfig | 2 drivers/cpuidle/Kconfig | 10 drivers/cpuidle/cpuidle-powernv.c | 10 drivers/cpuidle/cpuidle.c | 65 ++ include/linux/sched.h | 16 - include/linux/sched/sysctl.h | 3 kernel/Kconfig.sched | 11 kernel/sched/Makefile | 1 kernel/sched/debug.c | 3 kernel/sched/fair.c | 632 +++++++++++++++++++- kernel/sched/power.c | 480 +++++++++++++++ kernel/sched/sched.h | 16 + kernel/sysctl.c | 9 16 files changed, 1234 insertions(+), 60 deletions(-) create mode 100644 kernel/Kconfig.sched create mode 100644 kernel/sched/power.c
--
As a first step towards improving the power awareness of the scheduler, this patch enables a "dumb" state where all power management is turned off. Whatever additionally we put into the kernel for cpu power management must do better than this in terms of performance as well as powersavings. This will enable us to benchmark and optimize the power aware scheduler from scratch.If we are to benchmark it against the performance of the existing design, we will get sufficiently distracted by the performance numbers and get steered away from a sane design.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
arch/powerpc/Kconfig | 1 + arch/powerpc/platforms/powernv/Kconfig | 12 ++++++------ drivers/cpufreq/Kconfig | 2 ++ drivers/cpuidle/Kconfig | 2 ++ kernel/Kconfig.sched | 11 +++++++++++ 5 files changed, 22 insertions(+), 6 deletions(-) create mode 100644 kernel/Kconfig.sched
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 80b94b0..b7fe36a 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -301,6 +301,7 @@ config HIGHMEM
source kernel/Kconfig.hz source kernel/Kconfig.preempt +source kernel/Kconfig.sched source "fs/Kconfig.binfmt"
config HUGETLB_PAGE_SIZE_VARIABLE diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 45a8ed0..b0ef8b1 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,12 +11,12 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM - select CPU_FREQ - select CPU_FREQ_GOV_PERFORMANCE - select CPU_FREQ_GOV_POWERSAVE - select CPU_FREQ_GOV_USERSPACE - select CPU_FREQ_GOV_ONDEMAND - select CPU_FREQ_GOV_CONSERVATIVE + select CPU_FREQ if !SCHED_POWER + select CPU_FREQ_GOV_PERFORMANCE if CPU_FREQ + select CPU_FREQ_GOV_POWERSAVE if CPU_FREQ + select CPU_FREQ_GOV_USERSPACE if CPU_FREQ + select CPU_FREQ_GOV_ONDEMAND if CPU_FREQ + select CPU_FREQ_GOV_CONSERVATIVE if CPU_FREQ select PPC_DOORBELL default y
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index ffe350f..8976fd6 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -2,6 +2,7 @@ menu "CPU Frequency scaling"
config CPU_FREQ bool "CPU Frequency scaling" + depends on !SCHED_POWER help CPU Frequency scaling allows you to change the clock speed of CPUs on the fly. This is a nice method to save power, because @@ -12,6 +13,7 @@ config CPU_FREQ (see below) after boot, or use a userspace tool.
For details, take a look at file:Documentation/cpu-freq. + This feature will turn off if power aware scheduling is enabled.
If in doubt, say N.
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 32748c3..2c4ac79 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -3,6 +3,7 @@ menu "CPU Idle" config CPU_IDLE bool "CPU idle PM support" default y if ACPI || PPC_PSERIES + depends on !SCHED_POWER select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE) select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) help @@ -11,6 +12,7 @@ config CPU_IDLE governors that can be swapped during runtime.
If you're using an ACPI-enabled platform, you should say Y here. + This feature will turn off if power aware scheduling is enabled.
if CPU_IDLE
diff --git a/kernel/Kconfig.sched b/kernel/Kconfig.sched new file mode 100644 index 0000000..374454c --- /dev/null +++ b/kernel/Kconfig.sched @@ -0,0 +1,11 @@ +menu "Power Aware Scheduling" + +config SCHED_POWER + bool "Power Aware Scheduler" + default n + help + Select this to enable the new power aware scheduler. +endmenu + + +
On Mon, 11 Aug 2014, Preeti U Murthy wrote:
As a first step towards improving the power awareness of the scheduler, this patch enables a "dumb" state where all power management is turned off. Whatever additionally we put into the kernel for cpu power management must do better than this in terms of performance as well as powersavings. This will enable us to benchmark and optimize the power aware scheduler from scratch.If we are to benchmark it against the performance of the existing design, we will get sufficiently distracted by the performance numbers and get steered away from a sane design.
I understand your goal here, but people *will* compare performance between the old and the new design anyway. So I think it would be a better approach to simply let the existing code be and create a new scheduler-based governor that can be swapped with the existing ones at run time. Eventually we'll want average users to test and compare this, and asking them to recompile a second kernel and reboot between them might get unwieldy to many people.
And by allowing both to coexist at run time, we're making sure both the old and the new code are built helping not breaking the old code. And that will also cut down on the number of #ifdefs in many places.
In other words, CONFIG_SCHED_POWER is needed to select the scheduler based governor but it shouldn't force the existing code disabled.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com
arch/powerpc/Kconfig | 1 + arch/powerpc/platforms/powernv/Kconfig | 12 ++++++------ drivers/cpufreq/Kconfig | 2 ++ drivers/cpuidle/Kconfig | 2 ++ kernel/Kconfig.sched | 11 +++++++++++ 5 files changed, 22 insertions(+), 6 deletions(-) create mode 100644 kernel/Kconfig.sched
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 80b94b0..b7fe36a 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -301,6 +301,7 @@ config HIGHMEM source kernel/Kconfig.hz source kernel/Kconfig.preempt +source kernel/Kconfig.sched source "fs/Kconfig.binfmt" config HUGETLB_PAGE_SIZE_VARIABLE diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 45a8ed0..b0ef8b1 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,12 +11,12 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM
- select CPU_FREQ
- select CPU_FREQ_GOV_PERFORMANCE
- select CPU_FREQ_GOV_POWERSAVE
- select CPU_FREQ_GOV_USERSPACE
- select CPU_FREQ_GOV_ONDEMAND
- select CPU_FREQ_GOV_CONSERVATIVE
- select CPU_FREQ if !SCHED_POWER
- select CPU_FREQ_GOV_PERFORMANCE if CPU_FREQ
- select CPU_FREQ_GOV_POWERSAVE if CPU_FREQ
- select CPU_FREQ_GOV_USERSPACE if CPU_FREQ
- select CPU_FREQ_GOV_ONDEMAND if CPU_FREQ
- select CPU_FREQ_GOV_CONSERVATIVE if CPU_FREQ select PPC_DOORBELL default y
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index ffe350f..8976fd6 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -2,6 +2,7 @@ menu "CPU Frequency scaling" config CPU_FREQ bool "CPU Frequency scaling"
- depends on !SCHED_POWER help CPU Frequency scaling allows you to change the clock speed of CPUs on the fly. This is a nice method to save power, because
@@ -12,6 +13,7 @@ config CPU_FREQ (see below) after boot, or use a userspace tool. For details, take a look at file:Documentation/cpu-freq.
This feature will turn off if power aware scheduling is enabled.
If in doubt, say N. diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 32748c3..2c4ac79 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -3,6 +3,7 @@ menu "CPU Idle" config CPU_IDLE bool "CPU idle PM support" default y if ACPI || PPC_PSERIES
- depends on !SCHED_POWER select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE) select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) help
@@ -11,6 +12,7 @@ config CPU_IDLE governors that can be swapped during runtime. If you're using an ACPI-enabled platform, you should say Y here.
This feature will turn off if power aware scheduling is enabled.
if CPU_IDLE diff --git a/kernel/Kconfig.sched b/kernel/Kconfig.sched new file mode 100644 index 0000000..374454c --- /dev/null +++ b/kernel/Kconfig.sched @@ -0,0 +1,11 @@ +menu "Power Aware Scheduling"
+config SCHED_POWER
bool "Power Aware Scheduler"
default n
help
Select this to enable the new power aware scheduler.
+endmenu
On 08/18/2014 09:09 PM, Nicolas Pitre wrote:
On Mon, 11 Aug 2014, Preeti U Murthy wrote:
As a first step towards improving the power awareness of the scheduler, this patch enables a "dumb" state where all power management is turned off. Whatever additionally we put into the kernel for cpu power management must do better than this in terms of performance as well as powersavings. This will enable us to benchmark and optimize the power aware scheduler from scratch.If we are to benchmark it against the performance of the existing design, we will get sufficiently distracted by the performance numbers and get steered away from a sane design.
I understand your goal here, but people *will* compare performance between the old and the new design anyway. So I think it would be a better approach to simply let the existing code be and create a new scheduler-based governor that can be swapped with the existing ones at run time. Eventually we'll want average users to test and compare this, and asking them to recompile a second kernel and reboot between them might get unwieldy to many people.
And by allowing both to coexist at run time, we're making sure both the old and the new code are built helping not breaking the old code. And that will also cut down on the number of #ifdefs in many places.
In other words, CONFIG_SCHED_POWER is needed to select the scheduler based governor but it shouldn't force the existing code disabled.
I don't think I understand you here. So are you proposing a runtime switch like a sysfs interface instead of a config switch? Wouldn't that be unwise given that its a complete turnaround of the behavior kernel after the switch? I agree that the first patch is a dummy patch, its meant to ensure that we have *atleast* the power efficiency that this patch brings in. Of course after that point this patch is a no-op. In fact the subsequent patches will mitigate the effect of this.
Regards Preeti U Murthy
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com
On Mon, 18 Aug 2014, Preeti U Murthy wrote:
On 08/18/2014 09:09 PM, Nicolas Pitre wrote:
On Mon, 11 Aug 2014, Preeti U Murthy wrote:
As a first step towards improving the power awareness of the scheduler, this patch enables a "dumb" state where all power management is turned off. Whatever additionally we put into the kernel for cpu power management must do better than this in terms of performance as well as powersavings. This will enable us to benchmark and optimize the power aware scheduler from scratch.If we are to benchmark it against the performance of the existing design, we will get sufficiently distracted by the performance numbers and get steered away from a sane design.
I understand your goal here, but people *will* compare performance between the old and the new design anyway. So I think it would be a better approach to simply let the existing code be and create a new scheduler-based governor that can be swapped with the existing ones at run time. Eventually we'll want average users to test and compare this, and asking them to recompile a second kernel and reboot between them might get unwieldy to many people.
And by allowing both to coexist at run time, we're making sure both the old and the new code are built helping not breaking the old code. And that will also cut down on the number of #ifdefs in many places.
In other words, CONFIG_SCHED_POWER is needed to select the scheduler based governor but it shouldn't force the existing code disabled.
I don't think I understand you here. So are you proposing a runtime switch like a sysfs interface instead of a config switch?
Absolutely.
And looking at drivers/cpuidle/sysfs.c:store_current_governor() it seems that the facility is there already.
Wouldn't that be unwise given that its a complete turnaround of the behavior kernel after the switch?
Oh sure. This is like changing cpufreq governors at run time. But people should know what they're playing with and that system behavior changes are expected.
I agree that the first patch is a dummy patch, its meant to ensure that we have *atleast* the power efficiency that this patch brings in. Of course after that point this patch is a no-op. In fact the subsequent patches will mitigate the effect of this.
Still, allowing runtime switches between the legacy governors and the in-scheduler governor will greatly facilitate benchmarking.
And since our goal is to surpass the legacy governors then we should set it as our reference mark from the start.
Nicolas
The goal of the power aware scheduling design is to integrate all policy, metrics and averaging into the scheduler. Today the cpu power management is fragmented and hence inconsistent.
As a first step towards this integration, rid the cpuidle state management of the governors. Retain only the cpuidle driver in the cpu idle susbsystem which acts as an interface between the scheduler and low level platform specific cpuidle drivers. For all decision making around selection of idle states,the cpuidle driver falls back to the scheduler.
The current algorithm for idle state selection is the same as the logic used by the menu governor. However going ahead the heuristics will be tuned and improved upon with metrics better known to the scheduler.
Note: cpufrequency is still left disabled when CONFIG_SCHED_POWER is selected.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
drivers/cpuidle/Kconfig | 12 + drivers/cpuidle/cpuidle-powernv.c | 2 drivers/cpuidle/cpuidle.c | 65 ++++- include/linux/sched.h | 9 + kernel/sched/Makefile | 1 kernel/sched/power.c | 480 +++++++++++++++++++++++++++++++++++++ 6 files changed, 554 insertions(+), 15 deletions(-) create mode 100644 kernel/sched/power.c
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 2c4ac79..4fa4cb1 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -3,16 +3,14 @@ menu "CPU Idle" config CPU_IDLE bool "CPU idle PM support" default y if ACPI || PPC_PSERIES - depends on !SCHED_POWER - select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE) - select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) + select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE && !SCHED_POWER) + select CPU_IDLE_GOV_MENU if ((NO_HZ || NO_HZ_IDLE) && !SCHED_POWER) help CPU idle is a generic framework for supporting software-controlled idle processor power management. It includes modular cross-platform governors that can be swapped during runtime.
If you're using an ACPI-enabled platform, you should say Y here. - This feature will turn off if power aware scheduling is enabled.
if CPU_IDLE
@@ -22,10 +20,16 @@ config CPU_IDLE_MULTIPLE_DRIVERS config CPU_IDLE_GOV_LADDER bool "Ladder governor (for periodic timer tick)" default y + depends on !SCHED_POWER + help + This feature will turn off if power aware scheduling is enabled.
config CPU_IDLE_GOV_MENU bool "Menu governor (for tickless system)" default y + depends on !SCHED_POWER + help + This feature will turn off if power aware scheduling is enabled.
menu "ARM CPU Idle Drivers" depends on ARM diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index fa79392..95ef533 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, unsigned long new_lpcr;
if (powersave_nap < 2) - return; + return 0; if (unlikely(system_state < SYSTEM_RUNNING)) return index;
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index ee9df5e..38fb213 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -150,6 +150,19 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, return entered_state; }
+#ifdef CONFIG_SCHED_POWER +static int __cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{ + return cpuidle_sched_select(drv, dev); +} +#else +static int __cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{ + return cpuidle_curr_governor->select(drv, dev); +} +#endif /** * cpuidle_select - ask the cpuidle framework to choose an idle state * @@ -169,7 +182,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) if (unlikely(use_deepest_state)) return cpuidle_find_deepest_state(drv, dev);
- return cpuidle_curr_governor->select(drv, dev); + return __cpuidle_select(drv, dev); }
/** @@ -190,6 +203,18 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, return cpuidle_enter_state(dev, drv, index); }
+#ifdef CONFIG_SCHED_POWER +static void __cpuidle_reflect(struct cpuidle_device *dev, int index) +{ + cpuidle_sched_reflect(dev, index); +} +#else +static void __cpuidle_reflect(struct cpuidle_device *dev, int index) +{ + if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state)) + cpuidle_curr_governor->reflect(dev, index); +} +#endif /** * cpuidle_reflect - tell the underlying governor what was the state * we were in @@ -200,8 +225,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, */ void cpuidle_reflect(struct cpuidle_device *dev, int index) { - if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state)) - cpuidle_curr_governor->reflect(dev, index); + __cpuidle_reflect(dev, index); }
/** @@ -265,6 +289,28 @@ void cpuidle_resume(void) mutex_unlock(&cpuidle_lock); }
+#ifdef CONFIG_SCHED_POWER +static int cpuidle_check_governor(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int enable) +{ + if (enable) + return cpuidle_sched_enable_device(drv, dev); + else + return 0; +} +#else +static int cpuidle_check_governor(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int enable) +{ + if (!cpuidle_curr_governor) + return -EIO; + + if (enable && cpuidle_curr_governor->enable) + return cpuidle_curr_governor->enable(drv, dev); + else if (cpuidle_curr_governor->disable) + cpuidle_curr_governor->disable(drv, dev); +} +#endif /** * cpuidle_enable_device - enables idle PM for a CPU * @dev: the CPU @@ -285,7 +331,7 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
drv = cpuidle_get_cpu_driver(dev);
- if (!drv || !cpuidle_curr_governor) + if (!drv) return -EIO;
if (!dev->registered) @@ -298,8 +344,8 @@ int cpuidle_enable_device(struct cpuidle_device *dev) if (ret) return ret;
- if (cpuidle_curr_governor->enable && - (ret = cpuidle_curr_governor->enable(drv, dev))) + ret = cpuidle_check_governor(drv, dev, 1); + if (ret) goto fail_sysfs;
smp_wmb(); @@ -331,13 +377,12 @@ void cpuidle_disable_device(struct cpuidle_device *dev) if (!dev || !dev->enabled) return;
- if (!drv || !cpuidle_curr_governor) + if (!drv) return; - + dev->enabled = 0;
- if (cpuidle_curr_governor->disable) - cpuidle_curr_governor->disable(drv, dev); + cpuidle_check_governor(drv, dev, 0);
cpuidle_remove_device_sysfs(dev); enabled_devices--; diff --git a/include/linux/sched.h b/include/linux/sched.h index 7c19d55..5dd99b5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -26,6 +26,7 @@ struct sched_param { #include <linux/nodemask.h> #include <linux/mm_types.h> #include <linux/preempt_mask.h> +#include <linux/cpuidle.h>
#include <asm/page.h> #include <asm/ptrace.h> @@ -846,6 +847,14 @@ enum cpu_idle_type { CPU_MAX_IDLE_TYPES };
+#ifdef CONFIG_SCHED_POWER +extern void cpuidle_sched_reflect(struct cpuidle_device *dev, int index); +extern int cpuidle_sched_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev); +extern int cpuidle_sched_enable_device(struct cpuidle_driver *drv, + struct cpuidle_device *dev); +#endif + /* * Increase resolution of cpu_capacity calculations */ diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index ab32b7b..5b8e469 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_SCHED_POWER) += power.o diff --git a/kernel/sched/power.c b/kernel/sched/power.c new file mode 100644 index 0000000..63c9276 --- /dev/null +++ b/kernel/sched/power.c @@ -0,0 +1,480 @@ +/* + * power.c - the power aware scheduler + * + * Author: + * Preeti U. Murthy preeti@linux.vnet.ibm.com + * + * This code is a replica of drivers/cpuidle/governors/menu.c + * To make the transition to power aware scheduler away from + * the cpuidle governor model easy, we do exactly what the + * governors do for now. Going ahead the heuristics will be + * tuned and improved upon. + * + * This code is licenced under the GPL version 2 as described + * in the COPYING file that acompanies the Linux Kernel. + */ + +#include <linux/kernel.h> +#include <linux/cpuidle.h> +#include <linux/pm_qos.h> +#include <linux/time.h> +#include <linux/ktime.h> +#include <linux/hrtimer.h> +#include <linux/tick.h> +#include <linux/sched.h> +#include <linux/math64.h> +#include <linux/module.h> + +/* + * Please note when changing the tuning values: + * If (MAX_INTERESTING-1) * RESOLUTION > UINT_MAX, the result of + * a scaling operation multiplication may overflow on 32 bit platforms. + * In that case, #define RESOLUTION as ULL to get 64 bit result: + * #define RESOLUTION 1024ULL + * + * The default values do not overflow. + */ +#define BUCKETS 12 +#define INTERVALS 8 +#define RESOLUTION 1024 +#define DECAY 8 +#define MAX_INTERESTING 50000 + + +/* + * Concepts and ideas behind the power aware scheduler + * + * For the power aware scheduler, there are 3 decision factors for picking a C + * state: + * 1) Energy break even point + * 2) Performance impact + * 3) Latency tolerance (from pmqos infrastructure) + * These these three factors are treated independently. + * + * Energy break even point + * ----------------------- + * C state entry and exit have an energy cost, and a certain amount of time in + * the C state is required to actually break even on this cost. CPUIDLE + * provides us this duration in the "target_residency" field. So all that we + * need is a good prediction of how long we'll be idle. Like the traditional + * governors, we start with the actual known "next timer event" time. + * + * Since there are other source of wakeups (interrupts for example) than + * the next timer event, this estimation is rather optimistic. To get a + * more realistic estimate, a correction factor is applied to the estimate, + * that is based on historic behavior. For example, if in the past the actual + * duration always was 50% of the next timer tick, the correction factor will + * be 0.5. + * + * power aware scheduler uses a running average for this correction factor, + * however it uses a set of factors, not just a single factor. This stems from + * the realization that the ratio is dependent on the order of magnitude of the + * expected duration; if we expect 500 milliseconds of idle time the likelihood of + * getting an interrupt very early is much higher than if we expect 50 micro + * seconds of idle time. A second independent factor that has big impact on + * the actual factor is if there is (disk) IO outstanding or not. + * (as a special twist, we consider every sleep longer than 50 milliseconds + * as perfect; there are no power gains for sleeping longer than this) + * + * For these two reasons we keep an array of 12 independent factors, that gets + * indexed based on the magnitude of the expected duration as well as the + * "is IO outstanding" property. + * + * Repeatable-interval-detector + * ---------------------------- + * There are some cases where "next timer" is a completely unusable predictor: + * Those cases where the interval is fixed, for example due to hardware + * interrupt mitigation, but also due to fixed transfer rate devices such as + * mice. + * For this, we use a different predictor: We track the duration of the last 8 + * intervals and if the stand deviation of these 8 intervals is below a + * threshold value, we use the average of these intervals as prediction. + * + * Limiting Performance Impact + * --------------------------- + * C states, especially those with large exit latencies, can have a real + * noticeable impact on workloads, which is not acceptable for most sysadmins, + * and in addition, less performance has a power price of its own. + * + * As a general rule of thumb, power aware sched assumes that the following + * heuristic holds: + * The busier the system, the less impact of C states is acceptable + * + * This rule-of-thumb is implemented using a performance-multiplier: + * If the exit latency times the performance multiplier is longer than + * the predicted duration, the C state is not considered a candidate + * for selection due to a too high performance impact. So the higher + * this multiplier is, the longer we need to be idle to pick a deep C + * state, and thus the less likely a busy CPU will hit such a deep + * C state. + * + * Two factors are used in determing this multiplier: + * a value of 10 is added for each point of "per cpu load average" we have. + * a value of 5 points is added for each process that is waiting for + * IO on this CPU. + * (these values are experimentally determined) + * + * The load average factor gives a longer term (few seconds) input to the + * decision, while the iowait value gives a cpu local instantanious input. + * The iowait factor may look low, but realize that this is also already + * represented in the system load average. + * + */ + +struct sched_cpuidle_info { + int last_state_idx; + int needs_update; + + unsigned int next_timer_us; + unsigned int predicted_us; + unsigned int bucket; + unsigned int correction_factor[BUCKETS]; + unsigned int intervals[INTERVALS]; + int interval_ptr; +}; + + +#define LOAD_INT(x) ((x) >> FSHIFT) +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) + +static int get_loadavg(void) +{ + unsigned long this = this_cpu_load(); + + + return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10; +} + +static inline int which_bucket(unsigned int duration) +{ + int bucket = 0; + + /* + * We keep two groups of stats; one with no + * IO pending, one without. + * This allows us to calculate + * E(duration)|iowait + */ + if (nr_iowait_cpu(smp_processor_id())) + bucket = BUCKETS/2; + + if (duration < 10) + return bucket; + if (duration < 100) + return bucket + 1; + if (duration < 1000) + return bucket + 2; + if (duration < 10000) + return bucket + 3; + if (duration < 100000) + return bucket + 4; + return bucket + 5; +} + +/* + * Return a multiplier for the exit latency that is intended + * to take performance requirements into account. + * The more performance critical we estimate the system + * to be, the higher this multiplier, and thus the higher + * the barrier to go to an expensive C state. + */ +static inline int performance_multiplier(void) +{ + int mult = 1; + + /* for higher loadavg, we are more reluctant */ + + mult += 2 * get_loadavg(); + + /* for IO wait tasks (per cpu!) we add 5x each */ + mult += 10 * nr_iowait_cpu(smp_processor_id()); + + return mult; +} + +static DEFINE_PER_CPU(struct sched_cpuidle_info, cpuidle_info ); + +static void cpuidle_sched_update(struct cpuidle_driver *drv, + struct cpuidle_device *dev); + +/* This implements DIV_ROUND_CLOSEST but avoids 64 bit division */ +static u64 div_round64(u64 dividend, u32 divisor) +{ + return div_u64(dividend + (divisor / 2), divisor); +} + +/* + * Try detecting repeating patterns by keeping track of the last 8 + * intervals, and checking if the standard deviation of that set + * of points is below a threshold. If it is... then use the + * average of these 8 points as the estimated value. + */ +static void get_typical_interval(struct sched_cpuidle_info *data) +{ + int i, divisor; + unsigned int max, thresh; + uint64_t avg, stddev; + + thresh = UINT_MAX; /* Discard outliers above this value */ + +again: + + /* First calculate the average of past intervals */ + max = 0; + avg = 0; + divisor = 0; + for (i = 0; i < INTERVALS; i++) { + unsigned int value = data->intervals[i]; + if (value <= thresh) { + avg += value; + divisor++; + if (value > max) + max = value; + } + } + do_div(avg, divisor); + + /* Then try to determine standard deviation */ + stddev = 0; + for (i = 0; i < INTERVALS; i++) { + unsigned int value = data->intervals[i]; + if (value <= thresh) { + int64_t diff = value - avg; + stddev += diff * diff; + } + } + do_div(stddev, divisor); + /* + * The typical interval is obtained when standard deviation is small + * or standard deviation is small compared to the average interval. + * + * int_sqrt() formal parameter type is unsigned long. When the + * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor) + * the resulting squared standard deviation exceeds the input domain + * of int_sqrt on platforms where unsigned long is 32 bits in size. + * In such case reject the candidate average. + * + * Use this result only if there is no timer to wake us up sooner. + */ + if (likely(stddev <= ULONG_MAX)) { + stddev = int_sqrt(stddev); + if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3)) + || stddev <= 20) { + if (data->next_timer_us > avg) + data->predicted_us = avg; + return; + } + } + + /* + * If we have outliers to the upside in our distribution, discard + * those by setting the threshold to exclude these outliers, then + * calculate the average and standard deviation again. Once we get + * down to the bottom 3/4 of our samples, stop excluding samples. + * + * This can deal with workloads that have long pauses interspersed + * with sporadic activity with a bunch of short pauses. + */ + if ((divisor * 4) <= INTERVALS * 3) + return; + + thresh = max - 1; + goto again; +} + +/** + * cpuidle_sched_select - selects the next idle state to enter + * @drv: cpuidle driver containing state data + * @dev: the CPU + */ +int cpuidle_sched_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{ + struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info); + int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY); + int i; + unsigned int interactivity_req; + struct timespec t; + + if (data->needs_update) { + cpuidle_sched_update(drv, dev); + data->needs_update = 0; + } + + data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1; + + /* Special case when user has set very strict latency requirement */ + if (unlikely(latency_req == 0)) + return 0; + + /* determine the expected residency time, round up */ + t = ktime_to_timespec(tick_nohz_get_sleep_length()); + data->next_timer_us = + t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC; + + + data->bucket = which_bucket(data->next_timer_us); + + /* + * Force the result of multiplication to be 64 bits even if both + * operands are 32 bits. + * Make sure to round up for half microseconds. + */ + data->predicted_us = div_round64((uint64_t)data->next_timer_us * + data->correction_factor[data->bucket], + RESOLUTION * DECAY); + + get_typical_interval(data); + + /* + * Performance multiplier defines a minimum predicted idle + * duration / latency ratio. Adjust the latency limit if + * necessary. + */ + interactivity_req = data->predicted_us / performance_multiplier(); + if (latency_req > interactivity_req) + latency_req = interactivity_req; + + /* + * We want to default to C1 (hlt), not to busy polling + * unless the timer is happening really really soon. + */ + if (data->next_timer_us > 5 && + !drv->states[CPUIDLE_DRIVER_STATE_START].disabled && + dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0) + data->last_state_idx = CPUIDLE_DRIVER_STATE_START; + + /* + * Find the idle state with the lowest power while satisfying + * our constraints. + */ + for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) { + struct cpuidle_state *s = &drv->states[i]; + struct cpuidle_state_usage *su = &dev->states_usage[i]; + + if (s->disabled || su->disable) + continue; + if (s->target_residency > data->predicted_us) + continue; + if (s->exit_latency > latency_req) + continue; + + data->last_state_idx = i; + } + + return data->last_state_idx; +} + +/** + * cpuidle_sched_reflect - records that data structures need update + * @dev: the CPU + * @index: the index of actual entered state + * + * NOTE: it's important to be fast here because this operation will add to + * the overall exit latency. + */ +void cpuidle_sched_reflect(struct cpuidle_device *dev, int index) +{ + struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info); + data->last_state_idx = index; + if (index >= 0) + data->needs_update = 1; +} + +/** + * cpuidle_sched_update - attempts to guess what happened after entry + * @drv: cpuidle driver containing state data + * @dev: the CPU + */ +static void cpuidle_sched_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) +{ + struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info); + int last_idx = data->last_state_idx; + struct cpuidle_state *target = &drv->states[last_idx]; + unsigned int measured_us; + unsigned int new_factor; + + /* + * Try to figure out how much time passed between entry to low + * power state and occurrence of the wakeup event. + * + * If the entered idle state didn't support residency measurements, + * we are basically lost in the dark how much time passed. + * As a compromise, assume we slept for the whole expected time. + * + * Any measured amount of time will include the exit latency. + * Since we are interested in when the wakeup begun, not when it + * was completed, we must subtract the exit latency. However, if + * the measured amount of time is less than the exit latency, + * assume the state was never reached and the exit latency is 0. + */ + if (unlikely(!(target->flags & CPUIDLE_FLAG_TIME_VALID))) { + /* Use timer value as is */ + measured_us = data->next_timer_us; + + } else { + /* Use measured value */ + measured_us = cpuidle_get_last_residency(dev); + + /* Deduct exit latency */ + if (measured_us > target->exit_latency) + measured_us -= target->exit_latency; + + /* Make sure our coefficients do not exceed unity */ + if (measured_us > data->next_timer_us) + measured_us = data->next_timer_us; + } + + /* Update our correction ratio */ + new_factor = data->correction_factor[data->bucket]; + new_factor -= new_factor / DECAY; + + if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING) + new_factor += RESOLUTION * measured_us / data->next_timer_us; + else + /* + * we were idle so long that we count it as a perfect + * prediction + */ + new_factor += RESOLUTION; + + /* + * We don't want 0 as factor; we always want at least + * a tiny bit of estimated time. Fortunately, due to rounding, + * new_factor will stay nonzero regardless of measured_us values + * and the compiler can eliminate this test as long as DECAY > 1. + */ + if (DECAY == 1 && unlikely(new_factor == 0)) + new_factor = 1; + + data->correction_factor[data->bucket] = new_factor; + + /* update the repeating-pattern data */ + data->intervals[data->interval_ptr++] = measured_us; + if (data->interval_ptr >= INTERVALS) + data->interval_ptr = 0; +} + +/** + * cpuidle_sched_enable_device - scans a CPU's states and does setup + * @drv: cpuidle driver + * @dev: the CPU + */ +int cpuidle_sched_enable_device(struct cpuidle_driver *drv, + struct cpuidle_device *dev) +{ + struct sched_cpuidle_info *data = &per_cpu(cpuidle_info, dev->cpu); + int i; + + memset(data, 0, sizeof(struct sched_cpuidle_info)); + + /* + * if the correction factor is 0 (eg first time init or cpu hotplug + * etc), we actually want to start out with a unity factor. + */ + for(i = 0; i < BUCKETS; i++) + data->correction_factor[i] = RESOLUTION * DECAY; + + return 0; +} +
On Mon, 11 Aug 2014, Preeti U Murthy wrote:
The goal of the power aware scheduling design is to integrate all policy, metrics and averaging into the scheduler. Today the cpu power management is fragmented and hence inconsistent.
As a first step towards this integration, rid the cpuidle state management of the governors. Retain only the cpuidle driver in the cpu idle susbsystem which acts as an interface between the scheduler and low level platform specific cpuidle drivers. For all decision making around selection of idle states,the cpuidle driver falls back to the scheduler.
The current algorithm for idle state selection is the same as the logic used by the menu governor. However going ahead the heuristics will be tuned and improved upon with metrics better known to the scheduler.
I'd strongly suggest a different approach here. Instead of copying the menu governor code and tweaking it afterwards, it would be cleaner to literally start from scratch with a new governor. Said new governor would grow inside the scheduler with more design freedom instead of being strapped on the side.
By copying existing code, the chance for cruft to remain for a long time is close to 100%. We already have one copy of it, let's keep it working and start afresh instead.
By starting clean it is way easier to explain and justify additions to a new design than convincing ourselves about the removal of no longer needed pieces from a legacy design.
Note: cpufrequency is still left disabled when CONFIG_SCHED_POWER is selected.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com
drivers/cpuidle/Kconfig | 12 + drivers/cpuidle/cpuidle-powernv.c | 2 drivers/cpuidle/cpuidle.c | 65 ++++- include/linux/sched.h | 9 + kernel/sched/Makefile | 1 kernel/sched/power.c | 480 +++++++++++++++++++++++++++++++++++++ 6 files changed, 554 insertions(+), 15 deletions(-) create mode 100644 kernel/sched/power.c
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 2c4ac79..4fa4cb1 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -3,16 +3,14 @@ menu "CPU Idle" config CPU_IDLE bool "CPU idle PM support" default y if ACPI || PPC_PSERIES
- depends on !SCHED_POWER
- select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
- select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
- select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE && !SCHED_POWER)
- select CPU_IDLE_GOV_MENU if ((NO_HZ || NO_HZ_IDLE) && !SCHED_POWER) help CPU idle is a generic framework for supporting software-controlled idle processor power management. It includes modular cross-platform governors that can be swapped during runtime.
If you're using an ACPI-enabled platform, you should say Y here.
This feature will turn off if power aware scheduling is enabled.
if CPU_IDLE @@ -22,10 +20,16 @@ config CPU_IDLE_MULTIPLE_DRIVERS config CPU_IDLE_GOV_LADDER bool "Ladder governor (for periodic timer tick)" default y
- depends on !SCHED_POWER
- help
This feature will turn off if power aware scheduling is enabled.
config CPU_IDLE_GOV_MENU bool "Menu governor (for tickless system)" default y
- depends on !SCHED_POWER
- help
This feature will turn off if power aware scheduling is enabled.
menu "ARM CPU Idle Drivers" depends on ARM diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index fa79392..95ef533 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, unsigned long new_lpcr; if (powersave_nap < 2)
return;
if (unlikely(system_state < SYSTEM_RUNNING)) return index;return 0;
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index ee9df5e..38fb213 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -150,6 +150,19 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, return entered_state; } +#ifdef CONFIG_SCHED_POWER +static int __cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
+{
- return cpuidle_sched_select(drv, dev);
+} +#else +static int __cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
+{
- return cpuidle_curr_governor->select(drv, dev);
+} +#endif /**
- cpuidle_select - ask the cpuidle framework to choose an idle state
@@ -169,7 +182,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) if (unlikely(use_deepest_state)) return cpuidle_find_deepest_state(drv, dev);
- return cpuidle_curr_governor->select(drv, dev);
- return __cpuidle_select(drv, dev);
} /** @@ -190,6 +203,18 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, return cpuidle_enter_state(dev, drv, index); } +#ifdef CONFIG_SCHED_POWER +static void __cpuidle_reflect(struct cpuidle_device *dev, int index) +{
- cpuidle_sched_reflect(dev, index);
+} +#else +static void __cpuidle_reflect(struct cpuidle_device *dev, int index) +{
- if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
cpuidle_curr_governor->reflect(dev, index);
+} +#endif /**
- cpuidle_reflect - tell the underlying governor what was the state
- we were in
@@ -200,8 +225,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, */ void cpuidle_reflect(struct cpuidle_device *dev, int index) {
- if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
cpuidle_curr_governor->reflect(dev, index);
- __cpuidle_reflect(dev, index);
} /** @@ -265,6 +289,28 @@ void cpuidle_resume(void) mutex_unlock(&cpuidle_lock); } +#ifdef CONFIG_SCHED_POWER +static int cpuidle_check_governor(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int enable)
+{
- if (enable)
return cpuidle_sched_enable_device(drv, dev);
- else
return 0;
+} +#else +static int cpuidle_check_governor(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int enable)
+{
- if (!cpuidle_curr_governor)
return -EIO;
- if (enable && cpuidle_curr_governor->enable)
return cpuidle_curr_governor->enable(drv, dev);
- else if (cpuidle_curr_governor->disable)
cpuidle_curr_governor->disable(drv, dev);
+} +#endif /**
- cpuidle_enable_device - enables idle PM for a CPU
- @dev: the CPU
@@ -285,7 +331,7 @@ int cpuidle_enable_device(struct cpuidle_device *dev) drv = cpuidle_get_cpu_driver(dev);
- if (!drv || !cpuidle_curr_governor)
- if (!drv) return -EIO;
if (!dev->registered) @@ -298,8 +344,8 @@ int cpuidle_enable_device(struct cpuidle_device *dev) if (ret) return ret;
- if (cpuidle_curr_governor->enable &&
(ret = cpuidle_curr_governor->enable(drv, dev)))
- ret = cpuidle_check_governor(drv, dev, 1);
- if (ret) goto fail_sysfs;
smp_wmb(); @@ -331,13 +377,12 @@ void cpuidle_disable_device(struct cpuidle_device *dev) if (!dev || !dev->enabled) return;
- if (!drv || !cpuidle_curr_governor)
- if (!drv) return;
- dev->enabled = 0;
- if (cpuidle_curr_governor->disable)
cpuidle_curr_governor->disable(drv, dev);
- cpuidle_check_governor(drv, dev, 0);
cpuidle_remove_device_sysfs(dev); enabled_devices--; diff --git a/include/linux/sched.h b/include/linux/sched.h index 7c19d55..5dd99b5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -26,6 +26,7 @@ struct sched_param { #include <linux/nodemask.h> #include <linux/mm_types.h> #include <linux/preempt_mask.h> +#include <linux/cpuidle.h> #include <asm/page.h> #include <asm/ptrace.h> @@ -846,6 +847,14 @@ enum cpu_idle_type { CPU_MAX_IDLE_TYPES }; +#ifdef CONFIG_SCHED_POWER +extern void cpuidle_sched_reflect(struct cpuidle_device *dev, int index); +extern int cpuidle_sched_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
+extern int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
+#endif
/*
- Increase resolution of cpu_capacity calculations
*/ diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index ab32b7b..5b8e469 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_SCHED_POWER) += power.o diff --git a/kernel/sched/power.c b/kernel/sched/power.c new file mode 100644 index 0000000..63c9276 --- /dev/null +++ b/kernel/sched/power.c @@ -0,0 +1,480 @@ +/*
- power.c - the power aware scheduler
- Author:
Preeti U. Murthy <preeti@linux.vnet.ibm.com>
- This code is a replica of drivers/cpuidle/governors/menu.c
- To make the transition to power aware scheduler away from
- the cpuidle governor model easy, we do exactly what the
- governors do for now. Going ahead the heuristics will be
- tuned and improved upon.
- This code is licenced under the GPL version 2 as described
- in the COPYING file that acompanies the Linux Kernel.
- */
+#include <linux/kernel.h> +#include <linux/cpuidle.h> +#include <linux/pm_qos.h> +#include <linux/time.h> +#include <linux/ktime.h> +#include <linux/hrtimer.h> +#include <linux/tick.h> +#include <linux/sched.h> +#include <linux/math64.h> +#include <linux/module.h>
+/*
- Please note when changing the tuning values:
- If (MAX_INTERESTING-1) * RESOLUTION > UINT_MAX, the result of
- a scaling operation multiplication may overflow on 32 bit platforms.
- In that case, #define RESOLUTION as ULL to get 64 bit result:
- #define RESOLUTION 1024ULL
- The default values do not overflow.
- */
+#define BUCKETS 12 +#define INTERVALS 8 +#define RESOLUTION 1024 +#define DECAY 8 +#define MAX_INTERESTING 50000
+/*
- Concepts and ideas behind the power aware scheduler
- For the power aware scheduler, there are 3 decision factors for picking a C
- state:
- Energy break even point
- Performance impact
- Latency tolerance (from pmqos infrastructure)
- These these three factors are treated independently.
- Energy break even point
- C state entry and exit have an energy cost, and a certain amount of time in
- the C state is required to actually break even on this cost. CPUIDLE
- provides us this duration in the "target_residency" field. So all that we
- need is a good prediction of how long we'll be idle. Like the traditional
- governors, we start with the actual known "next timer event" time.
- Since there are other source of wakeups (interrupts for example) than
- the next timer event, this estimation is rather optimistic. To get a
- more realistic estimate, a correction factor is applied to the estimate,
- that is based on historic behavior. For example, if in the past the actual
- duration always was 50% of the next timer tick, the correction factor will
- be 0.5.
- power aware scheduler uses a running average for this correction factor,
- however it uses a set of factors, not just a single factor. This stems from
- the realization that the ratio is dependent on the order of magnitude of the
- expected duration; if we expect 500 milliseconds of idle time the likelihood of
- getting an interrupt very early is much higher than if we expect 50 micro
- seconds of idle time. A second independent factor that has big impact on
- the actual factor is if there is (disk) IO outstanding or not.
- (as a special twist, we consider every sleep longer than 50 milliseconds
- as perfect; there are no power gains for sleeping longer than this)
- For these two reasons we keep an array of 12 independent factors, that gets
- indexed based on the magnitude of the expected duration as well as the
- "is IO outstanding" property.
- Repeatable-interval-detector
- There are some cases where "next timer" is a completely unusable predictor:
- Those cases where the interval is fixed, for example due to hardware
- interrupt mitigation, but also due to fixed transfer rate devices such as
- mice.
- For this, we use a different predictor: We track the duration of the last 8
- intervals and if the stand deviation of these 8 intervals is below a
- threshold value, we use the average of these intervals as prediction.
- Limiting Performance Impact
- C states, especially those with large exit latencies, can have a real
- noticeable impact on workloads, which is not acceptable for most sysadmins,
- and in addition, less performance has a power price of its own.
- As a general rule of thumb, power aware sched assumes that the following
- heuristic holds:
The busier the system, the less impact of C states is acceptable
- This rule-of-thumb is implemented using a performance-multiplier:
- If the exit latency times the performance multiplier is longer than
- the predicted duration, the C state is not considered a candidate
- for selection due to a too high performance impact. So the higher
- this multiplier is, the longer we need to be idle to pick a deep C
- state, and thus the less likely a busy CPU will hit such a deep
- C state.
- Two factors are used in determing this multiplier:
- a value of 10 is added for each point of "per cpu load average" we have.
- a value of 5 points is added for each process that is waiting for
- IO on this CPU.
- (these values are experimentally determined)
- The load average factor gives a longer term (few seconds) input to the
- decision, while the iowait value gives a cpu local instantanious input.
- The iowait factor may look low, but realize that this is also already
- represented in the system load average.
- */
+struct sched_cpuidle_info {
- int last_state_idx;
- int needs_update;
- unsigned int next_timer_us;
- unsigned int predicted_us;
- unsigned int bucket;
- unsigned int correction_factor[BUCKETS];
- unsigned int intervals[INTERVALS];
- int interval_ptr;
+};
+#define LOAD_INT(x) ((x) >> FSHIFT) +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+static int get_loadavg(void) +{
- unsigned long this = this_cpu_load();
- return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
+}
+static inline int which_bucket(unsigned int duration) +{
- int bucket = 0;
- /*
* We keep two groups of stats; one with no
* IO pending, one without.
* This allows us to calculate
* E(duration)|iowait
*/
- if (nr_iowait_cpu(smp_processor_id()))
bucket = BUCKETS/2;
- if (duration < 10)
return bucket;
- if (duration < 100)
return bucket + 1;
- if (duration < 1000)
return bucket + 2;
- if (duration < 10000)
return bucket + 3;
- if (duration < 100000)
return bucket + 4;
- return bucket + 5;
+}
+/*
- Return a multiplier for the exit latency that is intended
- to take performance requirements into account.
- The more performance critical we estimate the system
- to be, the higher this multiplier, and thus the higher
- the barrier to go to an expensive C state.
- */
+static inline int performance_multiplier(void) +{
- int mult = 1;
- /* for higher loadavg, we are more reluctant */
- mult += 2 * get_loadavg();
- /* for IO wait tasks (per cpu!) we add 5x each */
- mult += 10 * nr_iowait_cpu(smp_processor_id());
- return mult;
+}
+static DEFINE_PER_CPU(struct sched_cpuidle_info, cpuidle_info );
+static void cpuidle_sched_update(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
+/* This implements DIV_ROUND_CLOSEST but avoids 64 bit division */ +static u64 div_round64(u64 dividend, u32 divisor) +{
- return div_u64(dividend + (divisor / 2), divisor);
+}
+/*
- Try detecting repeating patterns by keeping track of the last 8
- intervals, and checking if the standard deviation of that set
- of points is below a threshold. If it is... then use the
- average of these 8 points as the estimated value.
- */
+static void get_typical_interval(struct sched_cpuidle_info *data) +{
- int i, divisor;
- unsigned int max, thresh;
- uint64_t avg, stddev;
- thresh = UINT_MAX; /* Discard outliers above this value */
+again:
- /* First calculate the average of past intervals */
- max = 0;
- avg = 0;
- divisor = 0;
- for (i = 0; i < INTERVALS; i++) {
unsigned int value = data->intervals[i];
if (value <= thresh) {
avg += value;
divisor++;
if (value > max)
max = value;
}
- }
- do_div(avg, divisor);
- /* Then try to determine standard deviation */
- stddev = 0;
- for (i = 0; i < INTERVALS; i++) {
unsigned int value = data->intervals[i];
if (value <= thresh) {
int64_t diff = value - avg;
stddev += diff * diff;
}
- }
- do_div(stddev, divisor);
- /*
* The typical interval is obtained when standard deviation is small
* or standard deviation is small compared to the average interval.
*
* int_sqrt() formal parameter type is unsigned long. When the
* greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
* the resulting squared standard deviation exceeds the input domain
* of int_sqrt on platforms where unsigned long is 32 bits in size.
* In such case reject the candidate average.
*
* Use this result only if there is no timer to wake us up sooner.
*/
- if (likely(stddev <= ULONG_MAX)) {
stddev = int_sqrt(stddev);
if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
|| stddev <= 20) {
if (data->next_timer_us > avg)
data->predicted_us = avg;
return;
}
- }
- /*
* If we have outliers to the upside in our distribution, discard
* those by setting the threshold to exclude these outliers, then
* calculate the average and standard deviation again. Once we get
* down to the bottom 3/4 of our samples, stop excluding samples.
*
* This can deal with workloads that have long pauses interspersed
* with sporadic activity with a bunch of short pauses.
*/
- if ((divisor * 4) <= INTERVALS * 3)
return;
- thresh = max - 1;
- goto again;
+}
+/**
- cpuidle_sched_select - selects the next idle state to enter
- @drv: cpuidle driver containing state data
- @dev: the CPU
- */
+int cpuidle_sched_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
+{
- struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
- int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
- int i;
- unsigned int interactivity_req;
- struct timespec t;
- if (data->needs_update) {
cpuidle_sched_update(drv, dev);
data->needs_update = 0;
- }
- data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
- /* Special case when user has set very strict latency requirement */
- if (unlikely(latency_req == 0))
return 0;
- /* determine the expected residency time, round up */
- t = ktime_to_timespec(tick_nohz_get_sleep_length());
- data->next_timer_us =
t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
- data->bucket = which_bucket(data->next_timer_us);
- /*
* Force the result of multiplication to be 64 bits even if both
* operands are 32 bits.
* Make sure to round up for half microseconds.
*/
- data->predicted_us = div_round64((uint64_t)data->next_timer_us *
data->correction_factor[data->bucket],
RESOLUTION * DECAY);
- get_typical_interval(data);
- /*
* Performance multiplier defines a minimum predicted idle
* duration / latency ratio. Adjust the latency limit if
* necessary.
*/
- interactivity_req = data->predicted_us / performance_multiplier();
- if (latency_req > interactivity_req)
latency_req = interactivity_req;
- /*
* We want to default to C1 (hlt), not to busy polling
* unless the timer is happening really really soon.
*/
- if (data->next_timer_us > 5 &&
!drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
- /*
* Find the idle state with the lowest power while satisfying
* our constraints.
*/
- for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
struct cpuidle_state *s = &drv->states[i];
struct cpuidle_state_usage *su = &dev->states_usage[i];
if (s->disabled || su->disable)
continue;
if (s->target_residency > data->predicted_us)
continue;
if (s->exit_latency > latency_req)
continue;
data->last_state_idx = i;
- }
- return data->last_state_idx;
+}
+/**
- cpuidle_sched_reflect - records that data structures need update
- @dev: the CPU
- @index: the index of actual entered state
- NOTE: it's important to be fast here because this operation will add to
the overall exit latency.
- */
+void cpuidle_sched_reflect(struct cpuidle_device *dev, int index) +{
- struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
- data->last_state_idx = index;
- if (index >= 0)
data->needs_update = 1;
+}
+/**
- cpuidle_sched_update - attempts to guess what happened after entry
- @drv: cpuidle driver containing state data
- @dev: the CPU
- */
+static void cpuidle_sched_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) +{
- struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
- int last_idx = data->last_state_idx;
- struct cpuidle_state *target = &drv->states[last_idx];
- unsigned int measured_us;
- unsigned int new_factor;
- /*
* Try to figure out how much time passed between entry to low
* power state and occurrence of the wakeup event.
*
* If the entered idle state didn't support residency measurements,
* we are basically lost in the dark how much time passed.
* As a compromise, assume we slept for the whole expected time.
*
* Any measured amount of time will include the exit latency.
* Since we are interested in when the wakeup begun, not when it
* was completed, we must subtract the exit latency. However, if
* the measured amount of time is less than the exit latency,
* assume the state was never reached and the exit latency is 0.
*/
- if (unlikely(!(target->flags & CPUIDLE_FLAG_TIME_VALID))) {
/* Use timer value as is */
measured_us = data->next_timer_us;
- } else {
/* Use measured value */
measured_us = cpuidle_get_last_residency(dev);
/* Deduct exit latency */
if (measured_us > target->exit_latency)
measured_us -= target->exit_latency;
/* Make sure our coefficients do not exceed unity */
if (measured_us > data->next_timer_us)
measured_us = data->next_timer_us;
- }
- /* Update our correction ratio */
- new_factor = data->correction_factor[data->bucket];
- new_factor -= new_factor / DECAY;
- if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING)
new_factor += RESOLUTION * measured_us / data->next_timer_us;
- else
/*
* we were idle so long that we count it as a perfect
* prediction
*/
new_factor += RESOLUTION;
- /*
* We don't want 0 as factor; we always want at least
* a tiny bit of estimated time. Fortunately, due to rounding,
* new_factor will stay nonzero regardless of measured_us values
* and the compiler can eliminate this test as long as DECAY > 1.
*/
- if (DECAY == 1 && unlikely(new_factor == 0))
new_factor = 1;
- data->correction_factor[data->bucket] = new_factor;
- /* update the repeating-pattern data */
- data->intervals[data->interval_ptr++] = measured_us;
- if (data->interval_ptr >= INTERVALS)
data->interval_ptr = 0;
+}
+/**
- cpuidle_sched_enable_device - scans a CPU's states and does setup
- @drv: cpuidle driver
- @dev: the CPU
- */
+int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
+{
- struct sched_cpuidle_info *data = &per_cpu(cpuidle_info, dev->cpu);
- int i;
- memset(data, 0, sizeof(struct sched_cpuidle_info));
- /*
* if the correction factor is 0 (eg first time init or cpu hotplug
* etc), we actually want to start out with a unity factor.
*/
- for(i = 0; i < BUCKETS; i++)
data->correction_factor[i] = RESOLUTION * DECAY;
- return 0;
+}
On 08/18/2014 09:24 PM, Nicolas Pitre wrote:
On Mon, 11 Aug 2014, Preeti U Murthy wrote:
The goal of the power aware scheduling design is to integrate all policy, metrics and averaging into the scheduler. Today the cpu power management is fragmented and hence inconsistent.
As a first step towards this integration, rid the cpuidle state management of the governors. Retain only the cpuidle driver in the cpu idle susbsystem which acts as an interface between the scheduler and low level platform specific cpuidle drivers. For all decision making around selection of idle states,the cpuidle driver falls back to the scheduler.
The current algorithm for idle state selection is the same as the logic used by the menu governor. However going ahead the heuristics will be tuned and improved upon with metrics better known to the scheduler.
I'd strongly suggest a different approach here. Instead of copying the menu governor code and tweaking it afterwards, it would be cleaner to literally start from scratch with a new governor. Said new governor would grow inside the scheduler with more design freedom instead of being strapped on the side.
By copying existing code, the chance for cruft to remain for a long time is close to 100%. We already have one copy of it, let's keep it working and start afresh instead.
By starting clean it is way easier to explain and justify additions to a new design than convincing ourselves about the removal of no longer needed pieces from a legacy design.
Ok. The reason I did it this way was that I did not find anything grossly wrong in the current cpuidle governor algorithm. Of course this can be improved but I did not see strong reasons to completely wipe it away. I see good scope to improve upon the existing algorithm with additional knowledge of *the idle states being mapped to scheduling domains*. This will in itself give us a better algorithm and does not mandate significant changes from the current algorithm. So I really don't see why we need to start from scratch.
The primary issue that I found was that with the goal being power aware scheduler we must ensure that the possibility of a governor getting registered with cpuidle to choose idle states no longer will exist. The reason being there is just *one entity who will take this decision and there is no option about it*. This patch intends to bring the focus to this specific detail.
Regards Preeti U Murthy
Note: cpufrequency is still left disabled when CONFIG_SCHED_POWER is selected.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com
drivers/cpuidle/Kconfig | 12 + drivers/cpuidle/cpuidle-powernv.c | 2 drivers/cpuidle/cpuidle.c | 65 ++++- include/linux/sched.h | 9 + kernel/sched/Makefile | 1 kernel/sched/power.c | 480 +++++++++++++++++++++++++++++++++++++ 6 files changed, 554 insertions(+), 15 deletions(-) create mode 100644 kernel/sched/power.c
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 2c4ac79..4fa4cb1 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -3,16 +3,14 @@ menu "CPU Idle" config CPU_IDLE bool "CPU idle PM support" default y if ACPI || PPC_PSERIES
- depends on !SCHED_POWER
- select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
- select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE)
- select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE && !SCHED_POWER)
- select CPU_IDLE_GOV_MENU if ((NO_HZ || NO_HZ_IDLE) && !SCHED_POWER) help CPU idle is a generic framework for supporting software-controlled idle processor power management. It includes modular cross-platform governors that can be swapped during runtime.
If you're using an ACPI-enabled platform, you should say Y here.
This feature will turn off if power aware scheduling is enabled.
if CPU_IDLE @@ -22,10 +20,16 @@ config CPU_IDLE_MULTIPLE_DRIVERS config CPU_IDLE_GOV_LADDER bool "Ladder governor (for periodic timer tick)" default y
- depends on !SCHED_POWER
- help
This feature will turn off if power aware scheduling is enabled.
config CPU_IDLE_GOV_MENU bool "Menu governor (for tickless system)" default y
- depends on !SCHED_POWER
- help
This feature will turn off if power aware scheduling is enabled.
menu "ARM CPU Idle Drivers" depends on ARM diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index fa79392..95ef533 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, unsigned long new_lpcr; if (powersave_nap < 2)
return;
if (unlikely(system_state < SYSTEM_RUNNING)) return index;return 0;
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index ee9df5e..38fb213 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -150,6 +150,19 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, return entered_state; } +#ifdef CONFIG_SCHED_POWER +static int __cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
+{
- return cpuidle_sched_select(drv, dev);
+} +#else +static int __cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
+{
- return cpuidle_curr_governor->select(drv, dev);
+} +#endif /**
- cpuidle_select - ask the cpuidle framework to choose an idle state
@@ -169,7 +182,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) if (unlikely(use_deepest_state)) return cpuidle_find_deepest_state(drv, dev);
- return cpuidle_curr_governor->select(drv, dev);
- return __cpuidle_select(drv, dev);
} /** @@ -190,6 +203,18 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, return cpuidle_enter_state(dev, drv, index); } +#ifdef CONFIG_SCHED_POWER +static void __cpuidle_reflect(struct cpuidle_device *dev, int index) +{
- cpuidle_sched_reflect(dev, index);
+} +#else +static void __cpuidle_reflect(struct cpuidle_device *dev, int index) +{
- if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
cpuidle_curr_governor->reflect(dev, index);
+} +#endif /**
- cpuidle_reflect - tell the underlying governor what was the state
- we were in
@@ -200,8 +225,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, */ void cpuidle_reflect(struct cpuidle_device *dev, int index) {
- if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
cpuidle_curr_governor->reflect(dev, index);
- __cpuidle_reflect(dev, index);
} /** @@ -265,6 +289,28 @@ void cpuidle_resume(void) mutex_unlock(&cpuidle_lock); } +#ifdef CONFIG_SCHED_POWER +static int cpuidle_check_governor(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int enable)
+{
- if (enable)
return cpuidle_sched_enable_device(drv, dev);
- else
return 0;
+} +#else +static int cpuidle_check_governor(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int enable)
+{
- if (!cpuidle_curr_governor)
return -EIO;
- if (enable && cpuidle_curr_governor->enable)
return cpuidle_curr_governor->enable(drv, dev);
- else if (cpuidle_curr_governor->disable)
cpuidle_curr_governor->disable(drv, dev);
+} +#endif /**
- cpuidle_enable_device - enables idle PM for a CPU
- @dev: the CPU
@@ -285,7 +331,7 @@ int cpuidle_enable_device(struct cpuidle_device *dev) drv = cpuidle_get_cpu_driver(dev);
- if (!drv || !cpuidle_curr_governor)
- if (!drv) return -EIO;
if (!dev->registered) @@ -298,8 +344,8 @@ int cpuidle_enable_device(struct cpuidle_device *dev) if (ret) return ret;
- if (cpuidle_curr_governor->enable &&
(ret = cpuidle_curr_governor->enable(drv, dev)))
- ret = cpuidle_check_governor(drv, dev, 1);
- if (ret) goto fail_sysfs;
smp_wmb(); @@ -331,13 +377,12 @@ void cpuidle_disable_device(struct cpuidle_device *dev) if (!dev || !dev->enabled) return;
- if (!drv || !cpuidle_curr_governor)
- if (!drv) return;
- dev->enabled = 0;
- if (cpuidle_curr_governor->disable)
cpuidle_curr_governor->disable(drv, dev);
- cpuidle_check_governor(drv, dev, 0);
cpuidle_remove_device_sysfs(dev); enabled_devices--; diff --git a/include/linux/sched.h b/include/linux/sched.h index 7c19d55..5dd99b5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -26,6 +26,7 @@ struct sched_param { #include <linux/nodemask.h> #include <linux/mm_types.h> #include <linux/preempt_mask.h> +#include <linux/cpuidle.h> #include <asm/page.h> #include <asm/ptrace.h> @@ -846,6 +847,14 @@ enum cpu_idle_type { CPU_MAX_IDLE_TYPES }; +#ifdef CONFIG_SCHED_POWER +extern void cpuidle_sched_reflect(struct cpuidle_device *dev, int index); +extern int cpuidle_sched_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
+extern int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
+#endif
/*
- Increase resolution of cpu_capacity calculations
*/ diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index ab32b7b..5b8e469 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_SCHED_POWER) += power.o diff --git a/kernel/sched/power.c b/kernel/sched/power.c new file mode 100644 index 0000000..63c9276 --- /dev/null +++ b/kernel/sched/power.c @@ -0,0 +1,480 @@ +/*
- power.c - the power aware scheduler
- Author:
Preeti U. Murthy <preeti@linux.vnet.ibm.com>
- This code is a replica of drivers/cpuidle/governors/menu.c
- To make the transition to power aware scheduler away from
- the cpuidle governor model easy, we do exactly what the
- governors do for now. Going ahead the heuristics will be
- tuned and improved upon.
- This code is licenced under the GPL version 2 as described
- in the COPYING file that acompanies the Linux Kernel.
- */
+#include <linux/kernel.h> +#include <linux/cpuidle.h> +#include <linux/pm_qos.h> +#include <linux/time.h> +#include <linux/ktime.h> +#include <linux/hrtimer.h> +#include <linux/tick.h> +#include <linux/sched.h> +#include <linux/math64.h> +#include <linux/module.h>
+/*
- Please note when changing the tuning values:
- If (MAX_INTERESTING-1) * RESOLUTION > UINT_MAX, the result of
- a scaling operation multiplication may overflow on 32 bit platforms.
- In that case, #define RESOLUTION as ULL to get 64 bit result:
- #define RESOLUTION 1024ULL
- The default values do not overflow.
- */
+#define BUCKETS 12 +#define INTERVALS 8 +#define RESOLUTION 1024 +#define DECAY 8 +#define MAX_INTERESTING 50000
+/*
- Concepts and ideas behind the power aware scheduler
- For the power aware scheduler, there are 3 decision factors for picking a C
- state:
- Energy break even point
- Performance impact
- Latency tolerance (from pmqos infrastructure)
- These these three factors are treated independently.
- Energy break even point
- C state entry and exit have an energy cost, and a certain amount of time in
- the C state is required to actually break even on this cost. CPUIDLE
- provides us this duration in the "target_residency" field. So all that we
- need is a good prediction of how long we'll be idle. Like the traditional
- governors, we start with the actual known "next timer event" time.
- Since there are other source of wakeups (interrupts for example) than
- the next timer event, this estimation is rather optimistic. To get a
- more realistic estimate, a correction factor is applied to the estimate,
- that is based on historic behavior. For example, if in the past the actual
- duration always was 50% of the next timer tick, the correction factor will
- be 0.5.
- power aware scheduler uses a running average for this correction factor,
- however it uses a set of factors, not just a single factor. This stems from
- the realization that the ratio is dependent on the order of magnitude of the
- expected duration; if we expect 500 milliseconds of idle time the likelihood of
- getting an interrupt very early is much higher than if we expect 50 micro
- seconds of idle time. A second independent factor that has big impact on
- the actual factor is if there is (disk) IO outstanding or not.
- (as a special twist, we consider every sleep longer than 50 milliseconds
- as perfect; there are no power gains for sleeping longer than this)
- For these two reasons we keep an array of 12 independent factors, that gets
- indexed based on the magnitude of the expected duration as well as the
- "is IO outstanding" property.
- Repeatable-interval-detector
- There are some cases where "next timer" is a completely unusable predictor:
- Those cases where the interval is fixed, for example due to hardware
- interrupt mitigation, but also due to fixed transfer rate devices such as
- mice.
- For this, we use a different predictor: We track the duration of the last 8
- intervals and if the stand deviation of these 8 intervals is below a
- threshold value, we use the average of these intervals as prediction.
- Limiting Performance Impact
- C states, especially those with large exit latencies, can have a real
- noticeable impact on workloads, which is not acceptable for most sysadmins,
- and in addition, less performance has a power price of its own.
- As a general rule of thumb, power aware sched assumes that the following
- heuristic holds:
The busier the system, the less impact of C states is acceptable
- This rule-of-thumb is implemented using a performance-multiplier:
- If the exit latency times the performance multiplier is longer than
- the predicted duration, the C state is not considered a candidate
- for selection due to a too high performance impact. So the higher
- this multiplier is, the longer we need to be idle to pick a deep C
- state, and thus the less likely a busy CPU will hit such a deep
- C state.
- Two factors are used in determing this multiplier:
- a value of 10 is added for each point of "per cpu load average" we have.
- a value of 5 points is added for each process that is waiting for
- IO on this CPU.
- (these values are experimentally determined)
- The load average factor gives a longer term (few seconds) input to the
- decision, while the iowait value gives a cpu local instantanious input.
- The iowait factor may look low, but realize that this is also already
- represented in the system load average.
- */
+struct sched_cpuidle_info {
- int last_state_idx;
- int needs_update;
- unsigned int next_timer_us;
- unsigned int predicted_us;
- unsigned int bucket;
- unsigned int correction_factor[BUCKETS];
- unsigned int intervals[INTERVALS];
- int interval_ptr;
+};
+#define LOAD_INT(x) ((x) >> FSHIFT) +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+static int get_loadavg(void) +{
- unsigned long this = this_cpu_load();
- return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
+}
+static inline int which_bucket(unsigned int duration) +{
- int bucket = 0;
- /*
* We keep two groups of stats; one with no
* IO pending, one without.
* This allows us to calculate
* E(duration)|iowait
*/
- if (nr_iowait_cpu(smp_processor_id()))
bucket = BUCKETS/2;
- if (duration < 10)
return bucket;
- if (duration < 100)
return bucket + 1;
- if (duration < 1000)
return bucket + 2;
- if (duration < 10000)
return bucket + 3;
- if (duration < 100000)
return bucket + 4;
- return bucket + 5;
+}
+/*
- Return a multiplier for the exit latency that is intended
- to take performance requirements into account.
- The more performance critical we estimate the system
- to be, the higher this multiplier, and thus the higher
- the barrier to go to an expensive C state.
- */
+static inline int performance_multiplier(void) +{
- int mult = 1;
- /* for higher loadavg, we are more reluctant */
- mult += 2 * get_loadavg();
- /* for IO wait tasks (per cpu!) we add 5x each */
- mult += 10 * nr_iowait_cpu(smp_processor_id());
- return mult;
+}
+static DEFINE_PER_CPU(struct sched_cpuidle_info, cpuidle_info );
+static void cpuidle_sched_update(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
+/* This implements DIV_ROUND_CLOSEST but avoids 64 bit division */ +static u64 div_round64(u64 dividend, u32 divisor) +{
- return div_u64(dividend + (divisor / 2), divisor);
+}
+/*
- Try detecting repeating patterns by keeping track of the last 8
- intervals, and checking if the standard deviation of that set
- of points is below a threshold. If it is... then use the
- average of these 8 points as the estimated value.
- */
+static void get_typical_interval(struct sched_cpuidle_info *data) +{
- int i, divisor;
- unsigned int max, thresh;
- uint64_t avg, stddev;
- thresh = UINT_MAX; /* Discard outliers above this value */
+again:
- /* First calculate the average of past intervals */
- max = 0;
- avg = 0;
- divisor = 0;
- for (i = 0; i < INTERVALS; i++) {
unsigned int value = data->intervals[i];
if (value <= thresh) {
avg += value;
divisor++;
if (value > max)
max = value;
}
- }
- do_div(avg, divisor);
- /* Then try to determine standard deviation */
- stddev = 0;
- for (i = 0; i < INTERVALS; i++) {
unsigned int value = data->intervals[i];
if (value <= thresh) {
int64_t diff = value - avg;
stddev += diff * diff;
}
- }
- do_div(stddev, divisor);
- /*
* The typical interval is obtained when standard deviation is small
* or standard deviation is small compared to the average interval.
*
* int_sqrt() formal parameter type is unsigned long. When the
* greatest difference to an outlier exceeds ~65 ms * sqrt(divisor)
* the resulting squared standard deviation exceeds the input domain
* of int_sqrt on platforms where unsigned long is 32 bits in size.
* In such case reject the candidate average.
*
* Use this result only if there is no timer to wake us up sooner.
*/
- if (likely(stddev <= ULONG_MAX)) {
stddev = int_sqrt(stddev);
if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
|| stddev <= 20) {
if (data->next_timer_us > avg)
data->predicted_us = avg;
return;
}
- }
- /*
* If we have outliers to the upside in our distribution, discard
* those by setting the threshold to exclude these outliers, then
* calculate the average and standard deviation again. Once we get
* down to the bottom 3/4 of our samples, stop excluding samples.
*
* This can deal with workloads that have long pauses interspersed
* with sporadic activity with a bunch of short pauses.
*/
- if ((divisor * 4) <= INTERVALS * 3)
return;
- thresh = max - 1;
- goto again;
+}
+/**
- cpuidle_sched_select - selects the next idle state to enter
- @drv: cpuidle driver containing state data
- @dev: the CPU
- */
+int cpuidle_sched_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
+{
- struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
- int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
- int i;
- unsigned int interactivity_req;
- struct timespec t;
- if (data->needs_update) {
cpuidle_sched_update(drv, dev);
data->needs_update = 0;
- }
- data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
- /* Special case when user has set very strict latency requirement */
- if (unlikely(latency_req == 0))
return 0;
- /* determine the expected residency time, round up */
- t = ktime_to_timespec(tick_nohz_get_sleep_length());
- data->next_timer_us =
t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
- data->bucket = which_bucket(data->next_timer_us);
- /*
* Force the result of multiplication to be 64 bits even if both
* operands are 32 bits.
* Make sure to round up for half microseconds.
*/
- data->predicted_us = div_round64((uint64_t)data->next_timer_us *
data->correction_factor[data->bucket],
RESOLUTION * DECAY);
- get_typical_interval(data);
- /*
* Performance multiplier defines a minimum predicted idle
* duration / latency ratio. Adjust the latency limit if
* necessary.
*/
- interactivity_req = data->predicted_us / performance_multiplier();
- if (latency_req > interactivity_req)
latency_req = interactivity_req;
- /*
* We want to default to C1 (hlt), not to busy polling
* unless the timer is happening really really soon.
*/
- if (data->next_timer_us > 5 &&
!drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
- /*
* Find the idle state with the lowest power while satisfying
* our constraints.
*/
- for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
struct cpuidle_state *s = &drv->states[i];
struct cpuidle_state_usage *su = &dev->states_usage[i];
if (s->disabled || su->disable)
continue;
if (s->target_residency > data->predicted_us)
continue;
if (s->exit_latency > latency_req)
continue;
data->last_state_idx = i;
- }
- return data->last_state_idx;
+}
+/**
- cpuidle_sched_reflect - records that data structures need update
- @dev: the CPU
- @index: the index of actual entered state
- NOTE: it's important to be fast here because this operation will add to
the overall exit latency.
- */
+void cpuidle_sched_reflect(struct cpuidle_device *dev, int index) +{
- struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
- data->last_state_idx = index;
- if (index >= 0)
data->needs_update = 1;
+}
+/**
- cpuidle_sched_update - attempts to guess what happened after entry
- @drv: cpuidle driver containing state data
- @dev: the CPU
- */
+static void cpuidle_sched_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) +{
- struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info);
- int last_idx = data->last_state_idx;
- struct cpuidle_state *target = &drv->states[last_idx];
- unsigned int measured_us;
- unsigned int new_factor;
- /*
* Try to figure out how much time passed between entry to low
* power state and occurrence of the wakeup event.
*
* If the entered idle state didn't support residency measurements,
* we are basically lost in the dark how much time passed.
* As a compromise, assume we slept for the whole expected time.
*
* Any measured amount of time will include the exit latency.
* Since we are interested in when the wakeup begun, not when it
* was completed, we must subtract the exit latency. However, if
* the measured amount of time is less than the exit latency,
* assume the state was never reached and the exit latency is 0.
*/
- if (unlikely(!(target->flags & CPUIDLE_FLAG_TIME_VALID))) {
/* Use timer value as is */
measured_us = data->next_timer_us;
- } else {
/* Use measured value */
measured_us = cpuidle_get_last_residency(dev);
/* Deduct exit latency */
if (measured_us > target->exit_latency)
measured_us -= target->exit_latency;
/* Make sure our coefficients do not exceed unity */
if (measured_us > data->next_timer_us)
measured_us = data->next_timer_us;
- }
- /* Update our correction ratio */
- new_factor = data->correction_factor[data->bucket];
- new_factor -= new_factor / DECAY;
- if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING)
new_factor += RESOLUTION * measured_us / data->next_timer_us;
- else
/*
* we were idle so long that we count it as a perfect
* prediction
*/
new_factor += RESOLUTION;
- /*
* We don't want 0 as factor; we always want at least
* a tiny bit of estimated time. Fortunately, due to rounding,
* new_factor will stay nonzero regardless of measured_us values
* and the compiler can eliminate this test as long as DECAY > 1.
*/
- if (DECAY == 1 && unlikely(new_factor == 0))
new_factor = 1;
- data->correction_factor[data->bucket] = new_factor;
- /* update the repeating-pattern data */
- data->intervals[data->interval_ptr++] = measured_us;
- if (data->interval_ptr >= INTERVALS)
data->interval_ptr = 0;
+}
+/**
- cpuidle_sched_enable_device - scans a CPU's states and does setup
- @drv: cpuidle driver
- @dev: the CPU
- */
+int cpuidle_sched_enable_device(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
+{
- struct sched_cpuidle_info *data = &per_cpu(cpuidle_info, dev->cpu);
- int i;
- memset(data, 0, sizeof(struct sched_cpuidle_info));
- /*
* if the correction factor is 0 (eg first time init or cpu hotplug
* etc), we actually want to start out with a unity factor.
*/
- for(i = 0; i < BUCKETS; i++)
data->correction_factor[i] = RESOLUTION * DECAY;
- return 0;
+}
On Mon, 18 Aug 2014, Preeti U Murthy wrote:
On 08/18/2014 09:24 PM, Nicolas Pitre wrote:
On Mon, 11 Aug 2014, Preeti U Murthy wrote:
The goal of the power aware scheduling design is to integrate all policy, metrics and averaging into the scheduler. Today the cpu power management is fragmented and hence inconsistent.
As a first step towards this integration, rid the cpuidle state management of the governors. Retain only the cpuidle driver in the cpu idle susbsystem which acts as an interface between the scheduler and low level platform specific cpuidle drivers. For all decision making around selection of idle states,the cpuidle driver falls back to the scheduler.
The current algorithm for idle state selection is the same as the logic used by the menu governor. However going ahead the heuristics will be tuned and improved upon with metrics better known to the scheduler.
I'd strongly suggest a different approach here. Instead of copying the menu governor code and tweaking it afterwards, it would be cleaner to literally start from scratch with a new governor. Said new governor would grow inside the scheduler with more design freedom instead of being strapped on the side.
By copying existing code, the chance for cruft to remain for a long time is close to 100%. We already have one copy of it, let's keep it working and start afresh instead.
By starting clean it is way easier to explain and justify additions to a new design than convincing ourselves about the removal of no longer needed pieces from a legacy design.
Ok. The reason I did it this way was that I did not find anything grossly wrong in the current cpuidle governor algorithm. Of course this can be improved but I did not see strong reasons to completely wipe it away. I see good scope to improve upon the existing algorithm with additional knowledge of *the idle states being mapped to scheduling domains*. This will in itself give us a better algorithm and does not mandate significant changes from the current algorithm. So I really don't see why we need to start from scratch.
Sure the current algorithm can be improved. But it has its limitations by design. And simply making it more topology aware wouldn't justify moving it into the scheduler.
What we're contemplating is something completely integrated with the scheduler where cpuidle and cpufreq (and eventually thermal management) together are part of the same "governor" to provide global decisions on all fronts.
Not only should the next wake-up event be predicted, but also the anticipated system load, etc. The scheduler may know that a given CPU is unlikely to be used for a while and could call for the deepest C-state right away without waiting for the current menu heuristic to converge.
There is also Daniel's I/O latency tracking that could replace the menu governor latency guessing, the later based on heuristics that could be described as black magic.
And all this has to eventually be policed by a global performance/power concern that should weight C-states, P-states and task placement together and select the best combination (Morten's work).
Therefore the current menu algorithm won't do it. It simply wasn't designed for that.
We'll have the opportunity to discuss this further tomorrow anyway.
The primary issue that I found was that with the goal being power aware scheduler we must ensure that the possibility of a governor getting registered with cpuidle to choose idle states no longer will exist. The reason being there is just *one entity who will take this decision and there is no option about it*. This patch intends to bring the focus to this specific detail.
I think there is nothing wrong with having multiple governors being registered. We simply decide at runtime via sysfs which one has control over the low-level cpuidle drivers.
Nicolas
The goal of the power aware scheduler design is to integrate all cpu power management in the scheduler. As a first step the idle state selection was moved into the scheduler. Doing this helps better decide which idle state to enter into using metrics known by the scheduler. However the cost of entering and exiting an idle state can help the scheduler do load balancing better.It would be even better if the idle states can let the scheduler know about the impact on the cache contents when the cpu enters that state. The scheduler can make use of this data while waking up tasks or scheduling new tasks. To make way for such information to be propogated to the scheduler, enumerate idle states in the scheduler topology levels.
Doing so will also let the scheduler know the idle states that a *sched_group* can enter into at a given level of scheduling domain. This means the scheduler is implicitly made aware of the fact that idle state is not necessarily a per-cpu state, it can be a per-core state or a state shared by a group of cpus that is specified by the sched_group. The knowledge of this higher level cpuidle information is missing today too.
The low level platform cpuidle drivers must expose to the scheduler the idle states at the different topology levels. This patch takes up the powernv cpuidle driver to illustrate this. The scheduling topology is left to the arch to decide. Commit 143e1e28cb40bed836 introduced this. The platform idle drivers are thus in a better position to fill up the topology levels with appropriate cpuidle state information while they discover it themselves.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
drivers/cpuidle/cpuidle-powernv.c | 8 ++++++++ include/linux/sched.h | 3 +++ 2 files changed, 11 insertions(+)
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 95ef533..4232fbc 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -184,6 +184,11 @@ static int powernv_add_idle_states(void)
dt_idle_states = len_flags / sizeof(u32);
+#ifdef CONFIG_SCHED_POWER + /* Snooze is a thread level idle state; the rest are core level idle states */ + sched_domain_topology[0].states[0] = powernv_states[0]; +#endif + for (i = 0; i < dt_idle_states; i++) {
flags = be32_to_cpu(idle_state_flags[i]); @@ -209,6 +214,9 @@ static int powernv_add_idle_states(void) powernv_states[nr_idle_states].enter = &fastsleep_loop; nr_idle_states++; } +#ifdef CONFIG_SCHED_POWER + sched_domain_topology[1].states[i] = powernv_states[nr_idle_states]; +#endif }
return nr_idle_states; diff --git a/include/linux/sched.h b/include/linux/sched.h index 5dd99b5..009da6a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1027,6 +1027,9 @@ struct sched_domain_topology_level { #ifdef CONFIG_SCHED_DEBUG char *name; #endif +#ifdef CONFIG_SCHED_POWER + struct cpuidle_state states[CPUIDLE_STATE_MAX]; +#endif };
extern struct sched_domain_topology_level *sched_domain_topology;
From: Alex Shi alex.shi@intel.com
Current scheduler behavior is just consider for larger performance of system. So it try to spread tasks on more cpu sockets and cpu cores
To adding the consideration of power awareness, the patchset adds a powersaving scheduler policy. It will use runnable load util in scheduler balancing. The current scheduling is taken as performance policy.
performance: the current scheduling behaviour, try to spread tasks on more CPU sockets or cores. performance oriented. powersaving: will pack tasks into few sched group until all LCPU in the group is full, power oriented.
The incoming patches will enable powersaving scheduling in CFS.
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 5 +++++ kernel/sched/sched.h | 7 +++++++ 2 files changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bfa3c86..77da534 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7800,6 +7800,11 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task return rr_interval; }
+#ifdef CONFIG_SCHED_POWER +/* The default scheduler policy is 'performance'. */ +int __read_mostly sched_balance_policy = SCHED_POLICY_PERFORMANCE; +#endif + /* * All the scheduling class methods: */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 579712f..95fc013 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -23,6 +23,13 @@ extern atomic_long_t calc_load_tasks; extern long calc_load_fold_active(struct rq *this_rq); extern void update_cpu_load_active(struct rq *this_rq);
+#ifdef CONFIG_SCHED_POWER +#define SCHED_POLICY_PERFORMANCE (0x1) +#define SCHED_POLICY_POWERSAVING (0x2) + +extern int __read_mostly sched_balance_policy; +#endif + /* * Helpers for converting nanosecond timing to jiffy resolution */
From: Alex Shi alex.shi@intel.com
This patch add the power aware scheduler knob into sysfs:
$cat /sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy performance powersaving $cat /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy powersaving
This means the using sched balance policy is 'powersaving'.
User can change the policy by commend 'echo': echo performance > /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
Documentation/ABI/testing/sysfs-devices-system-cpu | 23 +++++++ kernel/sched/fair.c | 69 ++++++++++++++++++++ 2 files changed, 92 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index acb9bfc..c7ed5c1 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -53,6 +53,29 @@ Description: Dynamic addition and removal of CPU's. This is not hotplug the system. Information writtento the file to remove CPU's is architecture specific.
+What: /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy + /sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy +Date: Oct 2012 +Contact: Linux kernel mailing list linux-kernel@vger.kernel.org +Description: CFS balance policy show and set interface. + This will get activated only when CONFIG_SCHED_POWER is set. + + available_sched_balance_policy: shows there are 2 kinds of + policies: + performance powersaving. + current_sched_balance_policy: shows current scheduler policy. + User can change the policy by writing it. + + Policy decides the CFS scheduler how to balance tasks onto + different CPU unit. + + performance: try to spread tasks onto more CPU sockets, + more CPU cores. performance oriented. + + powersaving: try to pack tasks onto same core or same CPU + until every LCPUs are busy in the core or CPU socket. + powersaving oriented. + What: /sys/devices/system/cpu/cpu#/node Date: October 2009 Contact: Linux memory management mailing list linux-mm@kvack.org diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 77da534..f1b0a33 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5863,6 +5863,75 @@ static inline int sg_imbalanced(struct sched_group *group) return group->sgc->imbalance; }
+#if defined(CONFIG_SYSFS) && defined(CONFIG_SCHED_POWER) +static ssize_t show_available_sched_balance_policy(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "performance powersaving\n"); +} + +static ssize_t show_current_sched_balance_policy(struct device *dev, + struct device_attribute *attr, char *buf) +{ + if (sched_balance_policy == SCHED_POLICY_PERFORMANCE) + return sprintf(buf, "performance\n"); + else if (sched_balance_policy == SCHED_POLICY_POWERSAVING) + return sprintf(buf, "powersaving\n"); + return 0; +} + +static ssize_t set_sched_balance_policy(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + unsigned int ret = -EINVAL; + char str_policy[16]; + + ret = sscanf(buf, "%15s", str_policy); + if (ret != 1) + return -EINVAL; + + if (!strcmp(str_policy, "performance")) + sched_balance_policy = SCHED_POLICY_PERFORMANCE; + else if (!strcmp(str_policy, "powersaving")) + sched_balance_policy = SCHED_POLICY_POWERSAVING; + else + return -EINVAL; + + return count; +} + +/* + * * Sysfs setup bits: + * */ +static DEVICE_ATTR(current_sched_balance_policy, 0644, + show_current_sched_balance_policy, set_sched_balance_policy); + +static DEVICE_ATTR(available_sched_balance_policy, 0444, + show_available_sched_balance_policy, NULL); + +static struct attribute *sched_balance_policy_default_attrs[] = { + &dev_attr_current_sched_balance_policy.attr, + &dev_attr_available_sched_balance_policy.attr, + NULL +}; +static struct attribute_group sched_balance_policy_attr_group = { + .attrs = sched_balance_policy_default_attrs, + .name = "sched_balance_policy", +}; + +int __init create_sysfs_sched_balance_policy_group(struct device *dev) +{ + return sysfs_create_group(&dev->kobj, &sched_balance_policy_attr_group); +} + +static int __init sched_balance_policy_sysfs_init(void) +{ + return create_sysfs_sched_balance_policy_group(cpu_subsys.dev_root); +} + +core_initcall(sched_balance_policy_sysfs_init); +#endif /* CONFIG_SYSFS && CONFIG_SCHED_POWER*/ + /* * Compute the group capacity factor. *
From: Alex Shi alex.shi@intel.com
The cpu's utilization is to measure how busy is the cpu. util = cpu_rq(cpu)->avg.runnable_avg_sum * SCHED_POEWR_SCALE / cpu_rq(cpu)->avg.runnable_avg_period;
Since the util is no more than 1, we scale its value with 1024, same as SCHED_POWER_SCALE and set the FULL_UTIL as 1024.
In later power aware scheduling, we are sensitive for how busy of the cpu. Since as to power consuming, it is tight related with cpu busy time.
BTW, rq->util can be used for any purposes if needed, not only power scheduling.
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/debug.c | 3 +++ kernel/sched/fair.c | 15 +++++++++++++++ kernel/sched/sched.h | 9 +++++++++ 3 files changed, 27 insertions(+)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 627b3c3..395af7f 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -325,6 +325,9 @@ do { \
P(ttwu_count); P(ttwu_local); +#ifdef CONFIG_SCHED_POWER + P(util); +#endif
#undef P #undef P64 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f1b0a33..681ad06 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2446,10 +2446,25 @@ static inline void __update_group_entity_contrib(struct sched_entity *se) } }
+#ifdef CONFIG_SCHED_POWER +static void update_rq_util(struct rq *rq) +{ + if (rq->avg.runnable_avg_period) + rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT) + / rq->avg.runnable_avg_period; + else + rq->util = (u64)(rq->avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT); +} +#else +static void update_rq_util(struct rq *rq) {} +#endif + static inline void update_rq_runnable_avg(struct rq *rq, int runnable) { __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); __update_tg_runnable_avg(&rq->avg, &rq->cfs); + + update_rq_util(rq); } #else /* CONFIG_FAIR_GROUP_SCHED */ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 95fc013..971b812 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -508,6 +508,11 @@ extern struct root_domain def_root_domain;
#endif /* CONFIG_SMP */
+/* full cpu utilization */ +#ifdef CONFIG_SCHED_POWER +#define FULL_UTIL SCHED_CAPACITY_SCALE +#endif + /* * This is the main, per-CPU runqueue data structure. * @@ -556,6 +561,10 @@ struct rq { struct sched_avg avg; #endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_SCHED_POWER + unsigned int util; +#endif /* CONFIG_SCHED_POWER */ + /* * This is part of a global counter where only the total sum * over all CPUs matters. A task can increase this counter on
From: Alex Shi alex.shi@intel.com
For power aware balancing, we care the sched domain/group's utilization. So add: sd_lb_stats.sd_util and sg_lb_stats.group_util.
And want to know which group is busiest but still has capability to handle more tasks, so add: sd_lb_stats.group_leader
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 681ad06..3d6d081 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5593,6 +5593,9 @@ struct sg_lb_stats { unsigned int nr_numa_running; unsigned int nr_preferred_running; #endif +#ifdef CONFIG_SCHED_POWER + unsigned int group_util; /* sum utilization of group */ +#endif };
/* @@ -5608,6 +5611,12 @@ struct sd_lb_stats {
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ struct sg_lb_stats local_stat; /* Statistics of the local group */ + +#ifdef CONFIG_SCHED_POWER + /* Varibles of power aware scheduling */ + unsigned int sd_util; /* sum utilization of this domain */ + struct sched_group *group_leader; /* Group which relieves group_min */ +#endif };
static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
From: Alex Shi alex.shi@intel.com
Power aware fork/exec/wake balancing needs both of structs in incoming patches. So move ahead before it.
Signed-off-by: Alex Shi alex.shi@intel.com Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 89 ++++++++++++++++++++++++++------------------------- 1 file changed, 45 insertions(+), 44 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3d6d081..031d115 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4505,6 +4505,51 @@ done: }
/* + * sg_lb_stats - stats of a sched_group required for load_balancing + */ +struct sg_lb_stats { + unsigned long avg_load; /*Avg load across the CPUs of the group */ + unsigned long group_load; /* Total load over the CPUs of the group */ + unsigned long sum_weighted_load; /* Weighted load of group's tasks */ + unsigned long load_per_task; + unsigned long group_capacity; + unsigned int sum_nr_running; /* Nr tasks running in the group */ + unsigned int group_capacity_factor; + unsigned int idle_cpus; + unsigned int group_weight; + int group_imb; /* Is there an imbalance in the group ? */ + int group_has_free_capacity; +#ifdef CONFIG_NUMA_BALANCING + unsigned int nr_numa_running; + unsigned int nr_preferred_running; +#endif +#ifdef CONFIG_SCHED_POWER + unsigned int group_util; /* sum utilization of group */ +#endif +}; + +/* + * sd_lb_stats - Structure to store the statistics of a sched_domain + * during load balancing. + */ +struct sd_lb_stats { + struct sched_group *busiest; /* Busiest group in this sd */ + struct sched_group *local; /* Local group in this sd */ + unsigned long total_load; /* Total load of all groups in sd */ + unsigned long total_capacity; /* Total capacity of all groups in sd */ + unsigned long avg_load; /* Average load across all groups in sd */ + + struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ + struct sg_lb_stats local_stat; /* Statistics of the local group */ + +#ifdef CONFIG_SCHED_POWER + /* Varibles of power aware scheduling */ + unsigned int sd_util; /* sum utilization of this domain */ + struct sched_group *group_leader; /* Group which relieves group_min */ +#endif +}; + +/* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, * SD_BALANCE_FORK, or SD_BALANCE_EXEC. @@ -5574,50 +5619,6 @@ static unsigned long task_h_load(struct task_struct *p) #endif
/********** Helpers for find_busiest_group ************************/ -/* - * sg_lb_stats - stats of a sched_group required for load_balancing - */ -struct sg_lb_stats { - unsigned long avg_load; /*Avg load across the CPUs of the group */ - unsigned long group_load; /* Total load over the CPUs of the group */ - unsigned long sum_weighted_load; /* Weighted load of group's tasks */ - unsigned long load_per_task; - unsigned long group_capacity; - unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity_factor; - unsigned int idle_cpus; - unsigned int group_weight; - int group_imb; /* Is there an imbalance in the group ? */ - int group_has_free_capacity; -#ifdef CONFIG_NUMA_BALANCING - unsigned int nr_numa_running; - unsigned int nr_preferred_running; -#endif -#ifdef CONFIG_SCHED_POWER - unsigned int group_util; /* sum utilization of group */ -#endif -}; - -/* - * sd_lb_stats - Structure to store the statistics of a sched_domain - * during load balancing. - */ -struct sd_lb_stats { - struct sched_group *busiest; /* Busiest group in this sd */ - struct sched_group *local; /* Local group in this sd */ - unsigned long total_load; /* Total load of all groups in sd */ - unsigned long total_capacity; /* Total capacity of all groups in sd */ - unsigned long avg_load; /* Average load across all groups in sd */ - - struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ - struct sg_lb_stats local_stat; /* Statistics of the local group */ - -#ifdef CONFIG_SCHED_POWER - /* Varibles of power aware scheduling */ - unsigned int sd_util; /* sum utilization of this domain */ - struct sched_group *group_leader; /* Group which relieves group_min */ -#endif -};
static inline void init_sd_lb_stats(struct sd_lb_stats *sds) {
From: Alex Shi alex.shi@intel.com
Since the rt task priority is higher than fair tasks, cfs_rq utilization is just the left of rt utilization.
When there are some cfs tasks in queue, the potential utilization may be yielded, so mulitiplying cfs task number to get max potential utilization of cfs. Then the rq utilization is sum of rt util and cfs util.
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 031d115..60abaf4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3308,6 +3308,53 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) return cfs_bandwidth_used() && cfs_rq->throttle_count; }
+#ifdef CONFIG_SCHED_POWER +static unsigned long scale_rt_util(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + u64 total, age_stamp, avg; + s64 delta; + + /* + * Since we're reading these variables without serialization make sure + * we read them once before doing sanity checks on them. + */ + age_stamp = ACCESS_ONCE(rq->age_stamp); + avg = ACCESS_ONCE(rq->rt_avg); + + delta = rq_clock(rq) - age_stamp; + if (unlikely(delta < 0)) + delta = 0; + + total = sched_avg_period() + delta; + + if (unlikely(total < avg)) { + /* Ensures that capacity won't go beyond full scaled value */ + return SCHED_CAPACITY_SCALE; + } + + if (unlikely((s64)total < SCHED_CAPACITY_SCALE)) + total = SCHED_CAPACITY_SCALE; + + total >>= SCHED_CAPACITY_SHIFT; + + return div_u64(avg, total); +} + +static unsigned int max_rq_util(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + unsigned int rt_util = scale_rt_util(cpu); + unsigned int cfs_util; + unsigned int nr_running; + + cfs_util = (FULL_UTIL - rt_util) > rq->util ? rq->util + : (FULL_UTIL - rt_util); + nr_running = rq->nr_running ? rq->nr_running : 1; + + return rt_util + cfs_util * nr_running; +} +#endif /* * Ensure that neither of the group entities corresponding to src_cpu or * dest_cpu are members of a throttled hierarchy when performing group
From: Alex Shi alex.shi@intel.com
rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple dirt cheap manner' -- Mike Galbraith.
With this cheap and smart bursty indicator, we can find the wake up burst, and just use nr_running as instant utilization only.
The 'sysctl_sched_burst_threshold' used for wakeup burst, set it as double of sysctl_sched_migration_cost.
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
include/linux/sched/sysctl.h | 3 +++ kernel/sched/fair.c | 4 ++++ kernel/sysctl.c | 9 +++++++++ 3 files changed, 16 insertions(+)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 596a0e0..0a3307b 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -55,6 +55,9 @@ extern unsigned int sysctl_numa_balancing_scan_size;
#ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; +#ifdef CONFIG_SCHED_POWER +extern unsigned int sysctl_sched_burst_threshold; +#endif /* CONFIG_SCHED_POWER */ extern unsigned int sysctl_sched_nr_migrate; extern unsigned int sysctl_sched_time_avg; extern unsigned int sysctl_timer_migration; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 60abaf4..20e2414 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -92,6 +92,10 @@ unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+#ifdef CONFIG_SCHED_POWER +const_debug unsigned int sysctl_sched_burst_threshold = 1000000UL; +#endif + /* * The exponential sliding window over which load is averaged for shares * distribution. diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 75875a7..8175d93 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -329,6 +329,15 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_SCHED_POWER + { + .procname = "sched_burst_threshold_ns", + .data = &sysctl_sched_burst_threshold, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif /* CONFIG_SCHED_POWER */ { .procname = "sched_nr_migrate", .data = &sysctl_sched_nr_migrate,
From: Alex Shi alex.shi@intel.com
This patch add power aware scheduling in fork/exec/wake. It try to select cpu from the busiest while still has utilization group. That's will save power since it leaves more groups idle in system.
The trade off is adding a power aware statistics collection in group seeking. But since the collection just happened in power scheduling eligible condition, the worst case of hackbench testing just drops about 2% with powersaving policy. No clear change for performance policy.
The main function in this patch is get_cpu_for_power_policy(), that will try to get a idlest cpu from the busiest while still has utilization group, if the system is using power aware policy and has such group.
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 114 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 20e2414..e993f1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4600,6 +4600,103 @@ struct sd_lb_stats { #endif };
+#ifdef CONFIG_SCHED_POWER +/* + * Try to collect the task running number and capacity of the group. + */ +static void get_sg_power_stats(struct sched_group *group, + struct sched_domain *sd, struct sg_lb_stats *sgs) +{ + int i; + + for_each_cpu(i, sched_group_cpus(group)) + sgs->group_util += max_rq_util(i); + + sgs->group_weight = group->group_weight; +} + +/* + * Is this domain full of utilization with the task? + */ +static int is_sd_full(struct sched_domain *sd, + struct task_struct *p, struct sd_lb_stats *sds) +{ + struct sched_group *group; + struct sg_lb_stats sgs; + long sd_min_delta = LONG_MAX; + unsigned int putil; + + if (p->se.load.weight == p->se.avg.load_avg_contrib) + /* p maybe a new forked task */ + putil = FULL_UTIL; + else + putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT) + / (p->se.avg.runnable_avg_period + 1); + + /* Try to collect the domain's utilization */ + group = sd->groups; + do { + long g_delta; + + memset(&sgs, 0, sizeof(sgs)); + get_sg_power_stats(group, sd, &sgs); + + g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util; + + if (g_delta > 0 && g_delta < sd_min_delta) { + sd_min_delta = g_delta; + sds->group_leader = group; + } + + sds->sd_util += sgs.group_util; + } while (group = group->next, group != sd->groups); + + if (sds->sd_util + putil < sd->span_weight * FULL_UTIL) + return 0; + + /* can not hold one more task in this domain */ + return 1; +} + +/* + * Execute power policy if this domain is not full. + */ +static inline int get_sd_sched_balance_policy(struct sched_domain *sd, + int cpu, struct task_struct *p, struct sd_lb_stats *sds) +{ + if (sched_balance_policy == SCHED_POLICY_PERFORMANCE) + return SCHED_POLICY_PERFORMANCE; + + memset(sds, 0, sizeof(*sds)); + if (is_sd_full(sd, p, sds)) + return SCHED_POLICY_PERFORMANCE; + return sched_balance_policy; +} + +/* + * If power policy is eligible for this domain, and it has task allowed cpu. + * we will select CPU from this domain. + */ +static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu, + struct task_struct *p, struct sd_lb_stats *sds) +{ + int policy; + int new_cpu = -1; + + policy = get_sd_sched_balance_policy(sd, cpu, p, sds); + if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) + new_cpu = find_idlest_cpu(sds->group_leader, p, cpu); + + return new_cpu; +} +#else +static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu, + struct task_struct *p, struct sd_lb_stats *sds) +{ + return -1; +} +#endif /* CONFIG_SCHED_POWER */ + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -4608,6 +4705,9 @@ struct sd_lb_stats { * Balances load by selecting the idlest cpu in the idlest group, or under * certain conditions an idle sibling cpu if the domain has SD_WAKE_AFFINE set. * + * If CONFIG_SCHED_POWER is set and SCHED_POLICY_POWERSAVE is enabled, the power + * aware scheduler kicks in. It returns a cpu appropriate for power savings. + * * Returns the target cpu number. * * preempt must be disabled. @@ -4620,6 +4720,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f int new_cpu = cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; + struct sd_lb_stats sds;
if (p->nr_cpus_allowed == 1) return prev_cpu; @@ -4645,12 +4746,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f break; }
- if (tmp->flags & sd_flag) + if (tmp->flags & sd_flag) { sd = tmp; + + new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds); + if (new_cpu != -1) + goto unlock; + } } + if (affine_sd) { + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds); + if (new_cpu != -1) + goto unlock;
- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + prev_cpu = cpu; + }
if (sd_flag & SD_BALANCE_WAKE) { new_cpu = select_idle_sibling(p, prev_cpu);
From: Alex Shi alex.shi@intel.com
Sleeping task has no utiliation, when they were bursty waked up, the zero utilization make scheduler out of balance, like aim7 benchmark.
rq->avg_idle is 'to used to accommodate bursty loads in a dirt simple dirt cheap manner' -- Mike Galbraith.
With this cheap and smart bursty indicator, we can find the wake up burst, and use nr_running as instant utilization in this scenario.
For other scenarios, we still use the precise CPU utilization to judage if a domain is eligible for power scheduling.
Thanks for Mike Galbraith's idea!
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 33 ++++++++++++++++++++++++++------- 1 file changed, 26 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e993f1c..3db77e8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4605,12 +4605,19 @@ struct sd_lb_stats { * Try to collect the task running number and capacity of the group. */ static void get_sg_power_stats(struct sched_group *group, - struct sched_domain *sd, struct sg_lb_stats *sgs) + struct sched_domain *sd, struct sg_lb_stats *sgs, int burst) { int i;
- for_each_cpu(i, sched_group_cpus(group)) - sgs->group_util += max_rq_util(i); + for_each_cpu(i, sched_group_cpus(group)) { + struct rq *rq = cpu_rq(i); + + if (burst && rq->nr_running > 1) + /* use nr_running as instant utilization */ + sgs->group_util += rq->nr_running; + else + sgs->group_util += max_rq_util(i); + }
sgs->group_weight = group->group_weight; } @@ -4624,6 +4631,8 @@ static int is_sd_full(struct sched_domain *sd, struct sched_group *group; struct sg_lb_stats sgs; long sd_min_delta = LONG_MAX; + int cpu = task_cpu(p); + int burst = 0; unsigned int putil;
if (p->se.load.weight == p->se.avg.load_avg_contrib) @@ -4633,15 +4642,21 @@ static int is_sd_full(struct sched_domain *sd, putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT) / (p->se.avg.runnable_avg_period + 1);
+ if (cpu_rq(cpu)->avg_idle < sysctl_sched_burst_threshold) + burst = 1; + /* Try to collect the domain's utilization */ group = sd->groups; do { long g_delta;
memset(&sgs, 0, sizeof(sgs)); - get_sg_power_stats(group, sd, &sgs); + get_sg_power_stats(group, sd, &sgs, burst);
- g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util; + if (burst) + g_delta = sgs.group_weight - sgs.group_util; + else + g_delta = sgs.group_weight * FULL_UTIL - sgs.group_util;
if (g_delta > 0 && g_delta < sd_min_delta) { sd_min_delta = g_delta; @@ -4651,8 +4666,12 @@ static int is_sd_full(struct sched_domain *sd, sds->sd_util += sgs.group_util; } while (group = group->next, group != sd->groups);
- if (sds->sd_util + putil < sd->span_weight * FULL_UTIL) - return 0; + if (burst) { + if (sds->sd_util < sd->span_weight) + return 0; + } else + if (sds->sd_util + putil < sd->span_weight * FULL_UTIL) + return 0;
/* can not hold one more task in this domain */ return 1;
From: Alex Shi alex.shi@intel.com
If the waked task is transitory enough, it will has a chance to be packed into a cpu which is busy but still has time to care it.
For powersaving policy, only the history util < 25% task has chance to be packed. If there is no cpu eligible to handle it, will use a idlest cpu in leader group.
Morten Rasmussen catch a type bug. And PeterZ reminder to consider rt_util. thanks you!
Inspired-by: Vincent Guittot vincent.guittot@linaro.org Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 49 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3db77e8..e7a677e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4693,24 +4693,65 @@ static inline int get_sd_sched_balance_policy(struct sched_domain *sd, }
/* + * find_leader_cpu - find the busiest but still has enough free time cpu + * among the cpus in group. + */ +static int +find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu, + int policy) +{ + int vacancy, min_vacancy = INT_MAX; + int leader_cpu = -1; + int i; + /* percentage of the task's util */ + unsigned putil = (u64)(p->se.avg.runnable_avg_sum << SCHED_CAPACITY_SHIFT) + / (p->se.avg.runnable_avg_period + 1); + + /* bias toward local cpu */ + if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) && + FULL_UTIL - max_rq_util(this_cpu) - (putil << 2) > 0) + return this_cpu; + + /* Traverse only the allowed CPUs */ + for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { + if (i == this_cpu) + continue; + + /* only light task allowed, putil < 25% */ + vacancy = FULL_UTIL - max_rq_util(i) - (putil << 2); + + if (vacancy > 0 && vacancy < min_vacancy) { + min_vacancy = vacancy; + leader_cpu = i; + } + } + return leader_cpu; +} + +/* * If power policy is eligible for this domain, and it has task allowed cpu. * we will select CPU from this domain. */ static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu, - struct task_struct *p, struct sd_lb_stats *sds) + struct task_struct *p, struct sd_lb_stats *sds, int wakeup) { int policy; int new_cpu = -1;
policy = get_sd_sched_balance_policy(sd, cpu, p, sds); - if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) - new_cpu = find_idlest_cpu(sds->group_leader, p, cpu); - + if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) { + if (wakeup) + new_cpu = find_leader_cpu(sds->group_leader, + p, cpu, policy); + /* for fork balancing and a little busy task */ + if (new_cpu == -1) + new_cpu = find_idlest_cpu(sds->group_leader, p, cpu); + } return new_cpu; } #else static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu, - struct task_struct *p, struct sd_lb_stats *sds) + struct task_struct *p, struct sd_lb_stats *sds, int wakeup) { return -1; } @@ -4768,13 +4809,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (tmp->flags & sd_flag) { sd = tmp;
- new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds); + new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds, + sd_flag & SD_BALANCE_WAKE); if (new_cpu != -1) goto unlock; } } if (affine_sd) { - new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds); + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 1); if (new_cpu != -1) goto unlock;
From: Alex Shi alex.shi@intel.com
If a sched domain is idle enough for regular power balance, LBF_POWER_BAL will be set, LBF_PERF_BAL will be clean. If a sched domain is busy, their value will be set oppositely.
If the domain is suitable for power balance, but balance should not be down by this cpu(this cpu is already idle or full), both of flags are cleared to wait a suitable cpu to do power balance. That mean no any balance, neither power balance nor performance balance will be done on this cpu.
Above logical will be implemented by incoming patches.
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e7a677e..f9b2a21 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5372,6 +5372,11 @@ enum fbq_type { regular, remote, all }; #define LBF_DST_PINNED 0x04 #define LBF_SOME_PINNED 0x08
+#ifdef CONFIG_SCHED_POWER +#define LBF_POWER_BAL 0x08 /* if power balance allowed */ +#define LBF_PERF_BAL 0x10 /* if performance balance allowed */ +#endif + struct lb_env { struct sched_domain *sd;
@@ -6866,6 +6871,9 @@ static int load_balance(int this_cpu, struct rq *this_rq, .idle = idle, .loop_break = sched_nr_migrate_break, .cpus = cpus, +#ifdef CONFIG_SCHED_POWER + .flags = LBF_PERF_BAL, +#endif .fbq_type = all, };
From: Alex Shi alex.shi@intel.com
In power balance, we hope some sched groups are fully empty to save CPU power of them. So, we want to move any tasks from them.
Also in power aware scheduling, we don't want to balance 'prefer_sibling' groups just because local group has capacity. If the local group has no tasks at the time, that is the power balance hope so.
Signed-off-by: Alex Shi alex.shi@intel.com [Added CONFIG_SCHED_POWER switch to enable this patch] Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 49 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f9b2a21..fd93eaf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6346,6 +6346,21 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) } #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_SCHED_POWER +static int get_power_policy(struct lb_env *env) +{ + if (env->flags & LBF_PERF_BAL) + return 0; + else + return 1; +} +#else +static int get_power_policy(struct lb_env *env) +{ + return 0; +} +#endif /* CONFIG_SCHED_POWER */ + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancing. * @env: The load balancing environment. @@ -6358,6 +6373,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; bool overload = false; + int powersave = 0;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; @@ -6393,9 +6409,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). + * + * In power aware scheduling, we don't care load weight and + * want not to pull tasks just because local group has capacity. */ + powersave = get_power_policy(env); + if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) + sds->local_stat.group_has_free_capacity && !powersave) sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
if (update_sd_pick_busiest(env, sds, sg, sgs)) { @@ -6761,8 +6782,15 @@ static struct rq *find_busiest_queue(struct lb_env *env, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu capacity. */ +#ifdef CONFIG_SCHED_POWER + if (rq->nr_running == 0 || + (!(env->flags & LBF_POWER_BAL) && capacity_factor && + rq->nr_running == 1 && wl > env->imbalance)) + continue; +#else if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance) continue; +#endif /* CONFIG_SCHED_POWER */
/* * For the load comparisons with the other cpu's, consider @@ -6848,6 +6876,25 @@ static int should_we_balance(struct lb_env *env) return balance_cpu == env->dst_cpu; }
+#ifdef CONFIG_SCHED_POWER +static int is_busiest_eligible(struct rq *rq, struct lb_env *env) +{ + if (rq->nr_running > 1 || + (rq->nr_running == 1 && env->flags & LBF_POWER_BAL)) + return 1; + else + return 0; +} +#else +static int is_busiest_eligible(struct rq *rq, struct lb_env *env) +{ + if (rq->nr_running > 1) + return 1; + else + return 0; +} +#endif /* CONFIG_SCHED_POWER */ + /* * Check this_cpu to ensure it is balanced within domain. Attempt to move * tasks if there is an imbalance. @@ -6911,7 +6958,7 @@ redo: schedstat_add(sd, lb_imbalance[idle], env.imbalance);
ld_moved = 0; - if (busiest->nr_running > 1) { + if (is_busiest_eligible(busiest, &env)) { /* * Attempt to move tasks. If find_busiest_group has found * an imbalance but busiest->nr_running <= 1, the group is
From: Alex Shi alex.shi@intel.com
Added 4 new members in sb_lb_stats, that will be used in incomming power aware balance.
group_min; //least utliszation group in domain min_load_per_task; //load_per_task in group_min leader_util; // sum utilizations of group_leader min_util; // sum utilizations of group_min
Signed-off-by: Alex Shi alex.shi@intel.com Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fd93eaf..6d40aa3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4597,6 +4597,10 @@ struct sd_lb_stats { /* Varibles of power aware scheduling */ unsigned int sd_util; /* sum utilization of this domain */ struct sched_group *group_leader; /* Group which relieves group_min */ + struct sched_group *group_min; /* Least loaded group in sd */ + unsigned long min_load_per_task; /* load_per_task in group_min */ + unsigned int leader_util; /* sum utilizations of group_leader */ + unsigned int min_util; /* sum utilizations of group_min */ #endif };
linaro-kernel@lists.linaro.org