This series is posted for posterity. It has been NAK'd by the community since CPU hotplug has been deemed an inappropriate mechanism for power capping.
CPUoffline is a framework for taking CPU's offline via the hotplug mechanism. The framework itself is quite straightforward: a driver arranges the CPUs into partitions. Each partition is associated to a governor thread and that thread implements a policy for taking CPUs in that partition offline or online, based on some heuristic.
The CPUoffline core code includes a default driver that places all possible CPUs into a single partition, requiring no code to be written for a new platform. There is also a single governor named "avgload" which looks at the average load of all of the *online* CPUs in a partition and makes a hotplug decision based on defined thresholds.
This framework owes a lot to CPUfreq and CPUidle, from which CPUoffline stole^H^H^H^H^H borrowed lots of code.
Note: since development was cut short to community response, there are some missing infrastructure bits such as module unregistration and dynamic govenor switching. The code does work fine as-is for the curious-minded who want to test on an SMP system that supports hotplug.
Mike Turquette (6): ARM: do not mark CPU 0 as hotpluggable cpumask: introduce cpumask for hotpluggable CPUs cpu: update cpu_hotpluggable_mask in register_cpu cpuoffline core governors arm kconfig
arch/arm/Kconfig | 2 + arch/arm/kernel/setup.c | 3 +- drivers/Makefile | 1 + drivers/base/cpu.c | 4 +- drivers/cpuoffline/Kconfig | 26 ++ drivers/cpuoffline/Makefile | 2 + drivers/cpuoffline/cpuoffline.c | 488 ++++++++++++++++++++++++++++++++ drivers/cpuoffline/governors/Kconfig | 9 + drivers/cpuoffline/governors/Makefile | 2 + drivers/cpuoffline/governors/avgload.c | 255 +++++++++++++++++ include/linux/cpumask.h | 27 ++- include/linux/cpuoffline.h | 82 ++++++ kernel/cpu.c | 18 ++ 13 files changed, 912 insertions(+), 7 deletions(-) create mode 100644 drivers/cpuoffline/Kconfig create mode 100644 drivers/cpuoffline/Makefile create mode 100644 drivers/cpuoffline/cpuoffline.c create mode 100644 drivers/cpuoffline/governors/Kconfig create mode 100644 drivers/cpuoffline/governors/Makefile create mode 100644 drivers/cpuoffline/governors/avgload.c create mode 100644 include/linux/cpuoffline.h
A quick poll of the ARM platforms that implement CPU Hotplug support shows that every platform treats CPU 0 as a special case that cannot be hotplugged. In fact every platform has identical code for platform_cpu_die which returns -EPERM in the case of CPU 0.
The user-facing sysfs interfaces should reflect this by not populating an 'online' entry for CPU 0 at all. This better reflects reality by making it clear to users that CPU 0 cannot be hotplugged.
This patch prevents CPU 0 from being marked as hotpluggable on all ARM platforms during CPU registration. This in turn prevents the creation of an 'online' sysfs interface for that CPU.
Signed-off-by: Mike Turquette mturquette@ti.com --- arch/arm/kernel/setup.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c index 70bca64..5f3f4bb 100644 --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -949,7 +949,8 @@ static int __init topology_init(void)
for_each_possible_cpu(cpu) { struct cpuinfo_arm *cpuinfo = &per_cpu(cpu_data, cpu); - cpuinfo->cpu.hotpluggable = 1; + if (cpu) + cpuinfo->cpu.hotpluggable = 1; register_cpu(&cpuinfo->cpu, cpu); }
On some platforms it is possible to have some CPUs which support CPU hotplug and some which do not. Currently the prescence of an 'online' sysfs entry in userspace is adequate for applications to know that a CPU supports hotplug, but there is no convenient way to make the same determination in the kernel.
To better model this relationship this patch introduces a new cpumask to track CPUs that support CPU hotplug operations.
This new cpumask is populated at boot-time and remains static for the life of the machine. Bits set in the mask indicate a CPU which supports hotplug, but make no guarantees about whether that CPU is currently online or not. Likewise a cleared bit in the mask indicates either a CPU which cannot hotplug or a lack of a populated CPU.
The purpose of this new cpumask is to aid kernel code which uses CPU to take CPUs online and offline. Possible uses are as a thermal event mitigation technique or as a power capping mechanism.
Signed-off-by: Mike Turquette mturquette@ti.com --- include/linux/cpumask.h | 27 ++++++++++++++++++++++----- kernel/cpu.c | 18 ++++++++++++++++++ 2 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 4f7a632..3569cd3 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -39,10 +39,11 @@ extern int nr_cpu_ids; * The following particular system cpumasks and operations manage * possible, present, active and online cpus. * - * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable - * cpu_present_mask - has bit 'cpu' set iff cpu is populated - * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler - * cpu_active_mask - has bit 'cpu' set iff cpu available to migration + * cpu_possible_mask - has bit 'cpu' set iff cpu is populatable + * cpu_hotpluggable_mask - has bit 'cpu' set iff cpu is hotpluggable + * cpu_present_mask - has bit 'cpu' set iff cpu is populated + * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler + * cpu_active_mask - has bit 'cpu' set iff cpu available to migration * * If !CONFIG_HOTPLUG_CPU, present == possible, and active == online. * @@ -51,7 +52,11 @@ extern int nr_cpu_ids; * life of that system boot. The cpu_present_mask is dynamic(*), * representing which CPUs are currently plugged in. And * cpu_online_mask is the dynamic subset of cpu_present_mask, - * indicating those CPUs available for scheduling. + * indicating those CPUs available for scheduling. The + * cpu_hotpluggable_mask is also fixed at boot time as the set of CPU + * id's which are possible AND can hotplug. Cleared bits in this mask + * mean that either the CPU is not possible, or it is possible but does + * not support CPU hotplug operations. * * If HOTPLUG is enabled, then cpu_possible_mask is forced to have * all NR_CPUS bits set, otherwise it is just the set of CPUs that @@ -61,6 +66,9 @@ extern int nr_cpu_ids; * depending on what ACPI reports as currently plugged in, otherwise * cpu_present_mask is just a copy of cpu_possible_mask. * + * If HOTPLUG is not enabled then cpu_hotpluggable_mask is the empty + * set. + * * (*) Well, cpu_present_mask is dynamic in the hotplug case. If not * hotplug, it's a copy of cpu_possible_mask, hence fixed at boot. * @@ -76,6 +84,7 @@ extern int nr_cpu_ids; */
extern const struct cpumask *const cpu_possible_mask; +extern const struct cpumask *const cpu_hotpluggable_mask; extern const struct cpumask *const cpu_online_mask; extern const struct cpumask *const cpu_present_mask; extern const struct cpumask *const cpu_active_mask; @@ -85,19 +94,23 @@ extern const struct cpumask *const cpu_active_mask; #define num_possible_cpus() cpumask_weight(cpu_possible_mask) #define num_present_cpus() cpumask_weight(cpu_present_mask) #define num_active_cpus() cpumask_weight(cpu_active_mask) +#define num_hotpluggable_cpus() cpumask_weight(cpu_hotpluggable_mask) #define cpu_online(cpu) cpumask_test_cpu((cpu), cpu_online_mask) #define cpu_possible(cpu) cpumask_test_cpu((cpu), cpu_possible_mask) #define cpu_present(cpu) cpumask_test_cpu((cpu), cpu_present_mask) #define cpu_active(cpu) cpumask_test_cpu((cpu), cpu_active_mask) +#define cpu_hotpluggable(cpu) cpumask_test_cpu((cpu), cpu_hotpluggable_mask) #else #define num_online_cpus() 1U #define num_possible_cpus() 1U #define num_present_cpus() 1U #define num_active_cpus() 1U +#define num_hotpluggable_cpus() 0 #define cpu_online(cpu) ((cpu) == 0) #define cpu_possible(cpu) ((cpu) == 0) #define cpu_present(cpu) ((cpu) == 0) #define cpu_active(cpu) ((cpu) == 0) +#define cpu_hotpluggable(cpu) 0 #endif
/* verify cpu argument to cpumask_* operators */ @@ -692,16 +705,20 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS); #define cpu_none_mask to_cpumask(cpu_bit_bitmap[0])
#define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask) +#define for_each_hotpluggable_cpu(cpu) \ + for_each_cpu((cpu), cpu_hotpluggable_mask) #define for_each_online_cpu(cpu) for_each_cpu((cpu), cpu_online_mask) #define for_each_present_cpu(cpu) for_each_cpu((cpu), cpu_present_mask)
/* Wrappers for arch boot code to manipulate normally-constant masks */ void set_cpu_possible(unsigned int cpu, bool possible); +void set_cpu_hotpluggable(unsigned int cpu, bool hotpluggable); void set_cpu_present(unsigned int cpu, bool present); void set_cpu_online(unsigned int cpu, bool online); void set_cpu_active(unsigned int cpu, bool active); void init_cpu_present(const struct cpumask *src); void init_cpu_possible(const struct cpumask *src); +void init_cpu_hotpluggable(const struct cpumask *src); void init_cpu_online(const struct cpumask *src);
/** diff --git a/kernel/cpu.c b/kernel/cpu.c index 12b7458..8c397c9 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -536,6 +536,11 @@ static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly; const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits); EXPORT_SYMBOL(cpu_possible_mask);
+static DECLARE_BITMAP(cpu_hotpluggable_bits, CONFIG_NR_CPUS) __read_mostly; +const struct cpumask *const cpu_hotpluggable_mask = + to_cpumask(cpu_hotpluggable_bits); +EXPORT_SYMBOL(cpu_hotpluggable_mask); + static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly; const struct cpumask *const cpu_online_mask = to_cpumask(cpu_online_bits); EXPORT_SYMBOL(cpu_online_mask); @@ -556,6 +561,14 @@ void set_cpu_possible(unsigned int cpu, bool possible) cpumask_clear_cpu(cpu, to_cpumask(cpu_possible_bits)); }
+void set_cpu_hotpluggable(unsigned int cpu, bool hotpluggable) +{ + if (hotpluggable) + cpumask_set_cpu(cpu, to_cpumask(cpu_hotpluggable_bits)); + else + cpumask_clear_cpu(cpu, to_cpumask(cpu_hotpluggable_bits)); +} + void set_cpu_present(unsigned int cpu, bool present) { if (present) @@ -590,6 +603,11 @@ void init_cpu_possible(const struct cpumask *src) cpumask_copy(to_cpumask(cpu_possible_bits), src); }
+void init_cpu_hotpluggable(const struct cpumask *src) +{ + cpumask_copy(to_cpumask(cpu_hotpluggable_bits), src); +} + void init_cpu_online(const struct cpumask *src) { cpumask_copy(to_cpumask(cpu_online_bits), src);
Update the cpu_hotpluggable_mask for each registered CPU which supports hotplug. This makes it trivial for kernel code to know which CPUs support hotplug operations.
Signed-off-by: Mike Turquette mturquette@ti.com --- drivers/base/cpu.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 251acea..91ddcf8 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -224,8 +224,10 @@ int __cpuinit register_cpu(struct cpu *cpu, int num)
error = sysdev_register(&cpu->sysdev);
- if (!error && cpu->hotpluggable) + if (!error && cpu->hotpluggable) { register_cpu_control(cpu); + set_cpu_hotpluggable(num, true); + } if (!error) per_cpu(cpu_sys_devices, num) = &cpu->sysdev; if (!error)
--- drivers/Makefile | 1 + drivers/cpuoffline/Kconfig | 26 ++ drivers/cpuoffline/Makefile | 2 + drivers/cpuoffline/cpuoffline.c | 488 +++++++++++++++++++++++++++++++++++++++ include/linux/cpuoffline.h | 82 +++++++ 5 files changed, 599 insertions(+), 0 deletions(-) create mode 100644 drivers/cpuoffline/Kconfig create mode 100644 drivers/cpuoffline/Makefile create mode 100644 drivers/cpuoffline/cpuoffline.c create mode 100644 include/linux/cpuoffline.h
diff --git a/drivers/Makefile b/drivers/Makefile index dde8076..d41e183 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -95,6 +95,7 @@ obj-$(CONFIG_EISA) += eisa/ obj-y += lguest/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_CPU_IDLE) += cpuidle/ +obj-$(CONFIG_CPU_OFFLINE) += cpuoffline/ obj-$(CONFIG_MMC) += mmc/ obj-$(CONFIG_MEMSTICK) += memstick/ obj-y += leds/ diff --git a/drivers/cpuoffline/Kconfig b/drivers/cpuoffline/Kconfig new file mode 100644 index 0000000..57057d4 --- /dev/null +++ b/drivers/cpuoffline/Kconfig @@ -0,0 +1,26 @@ +config CPU_OFFLINE + bool "CPUoffline framework" + help + CPUoffline provides a framework that allows for taking CPUs + offline via an in-kernel governor. The governor itself can + implement any number of policies for deciding to offline a + core. Though primarily used for power capping, CPUoffline can + also be used to implement a thermal duty to prevent core + over-heating, etc. + + For details please see file:Documentation/cpuoffline. + + If in doubt, say N. + +config CPU_OFFLINE_DEFAULT_DRIVER + bool "CPUoffline default driver" + depends on CPU_OFFLINE + help + A default driver that creates a single partition containing + all possible CPUs. The benefit of this driver is that a + platform does not need any new code to make use of the + CPUoffline framework. Do not select this if your platform + implements it's own driver for registering partitions and CPUs + with the CPUoffline framework. + + If in doubt, say N. diff --git a/drivers/cpuoffline/Makefile b/drivers/cpuoffline/Makefile new file mode 100644 index 0000000..0b5aa59 --- /dev/null +++ b/drivers/cpuoffline/Makefile @@ -0,0 +1,2 @@ +# CPUoffline core +obj-$(CONFIG_CPU_OFFLINE) += cpuoffline.o diff --git a/drivers/cpuoffline/cpuoffline.c b/drivers/cpuoffline/cpuoffline.c new file mode 100644 index 0000000..0427df3 --- /dev/null +++ b/drivers/cpuoffline/cpuoffline.c @@ -0,0 +1,488 @@ +/* + * CPU Offline framework core + * + * Copyright (C) 2011 Texas Instruments, Inc. + * Mike Turquette mturquette@ti.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/mutex.h> +#include <linux/cpuoffline.h> +#include <linux/slab.h> +//#include <linux/kobject.h> +#include <linux/sysfs.h> +#include <linux/err.h> + +#define MAX_CPU_LEN 8 + +static int nr_partitions = 0; + +static struct cpuoffline_driver *cpuoffline_driver; +DEFINE_MUTEX(cpuoffline_driver_mutex); + +static LIST_HEAD(cpuoffline_governor_list); +static DEFINE_MUTEX(cpuoffline_governor_mutex); + +static DEFINE_PER_CPU(struct cpuoffline_partition *, cpuoffline_partition); + +struct kobject *cpuoffline_global_kobject; +EXPORT_SYMBOL(cpuoffline_global_kobject); + +/* sysfs interfaces */ + +static struct cpuoffline_governor *__find_governor(const char *str_governor) +{ + struct cpuoffline_governor *gov; + + list_for_each_entry(gov, &cpuoffline_governor_list, governor_list) + if (!strnicmp(str_governor, gov->name, MAX_NAME_LEN)) + return gov; + + return NULL; +} + +static ssize_t current_governor_show(struct cpuoffline_partition *partition, + char *buf) +{ + struct cpuoffline_governor *gov; + + gov = partition->governor; + + if (!gov) + return 0; + + return snprintf(buf, MAX_NAME_LEN, "%s\n", gov->name); +} + +static ssize_t current_governor_store(struct cpuoffline_partition *partition, + const char *buf, size_t count) +{ + int ret; + char govstring[MAX_NAME_LEN]; + struct cpuoffline_governor *gov, *tempgov; + + gov = partition->governor; + + ret = sscanf(buf, "%15s", govstring); + + if (ret != 1) + return -EINVAL; + + tempgov = __find_governor(govstring); + + if (!tempgov) + return -EINVAL; + + if (!try_module_get(tempgov->owner)) + return -EINVAL; + + /* XXX should gov->stop handle the module put? probably not */ + if (gov) { + gov->stop(partition); + module_put(gov->owner); + } + + /* XXX kfree the governor? is this a memleak? */ + partition->governor = gov = tempgov; + + gov->start(partition); + + return count; +} + +static ssize_t available_governors_show(struct cpuoffline_partition *partition, + char *buf) +{ + ssize_t ret = 0; + struct cpuoffline_governor *gov; + + list_for_each_entry(gov, &cpuoffline_governor_list, governor_list) + ret += snprintf(buf, MAX_NAME_LEN, "%s\n", gov->name); + + return ret; +} + +static ssize_t partition_show(struct kobject *kobj, struct attribute *attr, + char *buf) +{ + struct cpuoffline_partition *partition; + struct cpuoffline_attribute *c_attr; + ssize_t ret; + + partition = container_of(kobj, struct cpuoffline_partition, kobj); + c_attr = container_of(attr, struct cpuoffline_attribute, attr); + + if (!partition || !c_attr) + return -EINVAL; + + mutex_lock(&partition->mutex); + /* refcount++ */ + kobject_get(&partition->kobj); + + if (c_attr->show) + ret = c_attr->show(partition, buf); + else + ret = -EIO; + + /* refcount-- */ + kobject_put(&partition->kobj); + + mutex_unlock(&partition->mutex); + return ret; +} + +static ssize_t partition_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t count) +{ + struct cpuoffline_partition *partition; + struct cpuoffline_attribute *c_attr; + ssize_t ret = -EINVAL; + + partition = container_of(kobj, struct cpuoffline_partition, kobj); + c_attr = container_of(attr, struct cpuoffline_attribute, attr); + + if (!partition || !c_attr) + goto out; + + mutex_lock(&partition->mutex); + /* refcount++ */ + kobject_get(&partition->kobj); + + if(c_attr->store) + ret = c_attr->store(partition, buf, count); + else + ret = -EIO; + + /* refcount-- */ + kobject_put(&partition->kobj); + +out: + mutex_unlock(&partition->mutex); + return ret; +} + + +static struct cpuoffline_attribute current_governor = + __ATTR(current_governor, (S_IRUGO | S_IWUSR), current_governor_show, + current_governor_store); + +static struct cpuoffline_attribute available_governors = + __ATTR_RO(available_governors); + +static struct attribute *partition_default_attrs[] = { + ¤t_governor.attr, + &available_governors.attr, + NULL, +}; + +static const struct sysfs_ops partition_ops = { + .show = partition_show, + .store = partition_store, +}; + +static void cpuoffline_partition_release(struct kobject *kobj) +{ + struct cpuoffline_partition *partition; + + partition = container_of(kobj, struct cpuoffline_partition, kobj); + + complete(&partition->kobj_unregister); +} + +static struct kobj_type partition_ktype = { + .sysfs_ops = &partition_ops, + .default_attrs = partition_default_attrs, + .release = cpuoffline_partition_release, +}; + +/* cpu class sysdev device registration */ + +static int cpuoffline_add_dev_interface(struct cpuoffline_partition *partition, + struct sys_device *sys_dev) +{ + int ret = 0; + char name[MAX_CPU_LEN]; + struct kobject *kobj; + + /* create cpuoffline directory for this CPU */ + /*ret = kobject_init_and_add(&kobj, &ktype_device, + &sys_dev->kobj, "%s", "cpuoffline");*/ + kobj = kobject_create_and_add("cpuoffline", &sys_dev->kobj); + + if (!kobj) { + pr_warning("%s: failed to create cpuoffline dir for cpu %d\n", + __func__, sys_dev->id); + return -ENOMEM; + } + +#ifdef CONFIG_CPU_OFFLINE_STATISTICS + /* XXX set up per-CPU statistics here, which is ktype_device */ + /* create directory for cpuoffline stats */ +#endif + + /* create a symlink from this cpu to its partition */ + ret = sysfs_create_link(kobj, &partition->kobj, "partition"); + + if (ret) + pr_warning("%s: failed to create symlink from cpu %d to partition %d\n", + __func__, sys_dev->id, partition->id); + + /* create a symlink from this cpu's partition to itself */ + snprintf(name, MAX_CPU_LEN, "cpu%d", sys_dev->id); + ret = sysfs_create_link(&partition->kobj, kobj, name); + + if (ret) + pr_warning("%s: failed to create symlink from partition %d to cpu %d\n", + __func__, partition->id, sys_dev->id); + + return 0; +} + +static int cpuoffline_add_partition_interface( + struct cpuoffline_partition *partition) +{ + return kobject_init_and_add(&partition->kobj, &partition_ktype, + cpuoffline_global_kobject, "%s%d", "partition", + partition->id); +} + +struct cpuoffline_partition *cpuoffline_partition_init(unsigned int cpu) +{ + int ret = -ENOMEM; + struct cpuoffline_partition *partition; + + partition = kzalloc(sizeof(struct cpuoffline_partition), + GFP_KERNEL); + if (!partition) + goto out; + + if (!zalloc_cpumask_var(&partition->cpus, GFP_KERNEL)) + goto err_free_partition; + + /* start populating ->cpus with this cpu first */ + cpumask_copy(partition->cpus, cpumask_of(cpu)); + + mutex_init(&partition->mutex); + + /* helps sysfs look pretty */ + partition->id = nr_partitions++; + + ret = cpuoffline_driver->init(partition); + + if (ret) { + pr_err("%s: failed to init driver\n", __func__); + goto err_free_cpus; + } + + /* create directory in sysfs for this partition */ + ret = cpuoffline_add_partition_interface(partition); + + /* decrement partition->kobj if the above returns error */ + if (ret) { + pr_warn("%s: failed to create partition interface\n", __func__); + kobject_put(&partition->kobj); + } + + return partition; + +err_free_cpus: + nr_partitions--; + free_cpumask_var(partition->cpus); +err_free_partition: + kfree(partition); +out: + return (void *)ret; +} + +/* does not need locking because sequence is synchronous and orderly */ +static int cpuoffline_add_dev(struct sys_device *sys_dev) +{ + unsigned int cpu = sys_dev->id; + int ret = 0; + struct cpuoffline_partition *partition; + + /* sanity checks */ + if (cpu_is_offline(cpu)) + pr_notice("%s: CPU%d is offline\n", __func__, cpu); + + if (!cpuoffline_driver) + return -EINVAL; + + partition = per_cpu(cpuoffline_partition, cpu); + + /* + * The first cpu in each partition to hit this function will allocate + * partition and populate partition's address into the per-cpu data for + * each of the CPUs in the same. It is up to the driver->init function + * to do this since only the CPUoffline platform driver knows the + * desired topology. + * + * When the other CPUs in a partition hit this path, their partition + * wll have already been allocated. Only thing left to do is set up + * sysfs entries. + */ + if (!partition) { + partition = cpuoffline_partition_init(cpu); + + if (IS_ERR(partition)) { + pr_warn("%s: failed to create partition\n", __func__); + return -ENOMEM; + } + } + + ret = cpuoffline_add_dev_interface(partition, sys_dev); + + return ret; +} + +static int cpuoffline_remove_dev(struct sys_device *sys_dev) +{ + pr_err("%s: GETTING REMOVED!\n", __func__); + return 0; +} + +static struct sysdev_driver cpuoffline_sysdev_driver = { + .add = cpuoffline_add_dev, + .remove = cpuoffline_remove_dev, +}; + +/* driver registration API */ + +int cpuoffline_register_driver(struct cpuoffline_driver *driver) +{ + int ret = 0; + + pr_info("CPUoffline: registering %s driver", driver->name); + + if (!driver) + return -EINVAL; + + mutex_lock(&cpuoffline_driver_mutex); + + /* there can only be one */ + if (cpuoffline_driver) + ret = -EBUSY; + else + cpuoffline_driver = driver; + + mutex_unlock(&cpuoffline_driver_mutex); + + if (ret) + goto out; + + /* register every CPUoffline device */ + ret = sysdev_driver_register(&cpu_sysdev_class, + &cpuoffline_sysdev_driver); + +out: + return ret; +} +EXPORT_SYMBOL_GPL(cpuoffline_register_driver); + +/* FIXME - should this be allowed? */ +int cpuoffline_unregister_driver(struct cpuoffline_driver *driver) +{ + pr_info("CPUoffline: unregistering %s driver\n", driver->name); + + return 0; +} +EXPORT_SYMBOL_GPL(cpuoffline_unregister_driver); + +/* default driver - single partition containing all CPUs */ + +#ifdef CONFIG_CPU_OFFLINE_DEFAULT_DRIVER +/** + * cpuoffline_default_driver_init - create a single partition with all CPUs + * @partition: CPUoffline partition that is yet to be populated + * + * A CPUoffline driver's init function is responsible for two pieces of data. + * First, for every CPU that should be in @partition, the driver init function + * must populate a per-cpu pointer to that partition. Second, for every CPU + * that should be in @partition, the driver init function must set that bit in + * the @partition->cpus cpumask. + */ +int cpuoffline_default_driver_init(struct cpuoffline_partition *partition) +{ + unsigned int cpu; + + /* sanity checks */ + if (!partition) + return -EINVAL; + + cpu = cpumask_first(partition->cpus); + + /* CPU0 should be the only CPU in the mask */ + if (cpu) + return -EINVAL; + + for_each_possible_cpu(cpu) { + per_cpu(cpuoffline_partition, cpu) = partition; + cpumask_set_cpu(cpu, partition->cpus); + } + + return 0; +} + +int cpuoffline_default_driver_exit(struct cpuoffline_partition *partition) +{ + return 0; +} + +static struct cpuoffline_driver cpuoffline_default_driver = { + .name = "default", + .init = cpuoffline_default_driver_init, + .exit = cpuoffline_default_driver_exit, +}; + +static int __init cpuoffline_register_default_driver(void) +{ + return cpuoffline_register_driver(&cpuoffline_default_driver); +} +late_initcall(cpuoffline_register_default_driver); +#endif + +/* CPUoffline governor registration */ +int cpuoffline_register_governor(struct cpuoffline_governor *governor) +{ + int ret; + + if (!governor) + return -EINVAL; + + mutex_lock(&cpuoffline_governor_mutex); + + ret = -EBUSY; + if (__find_governor(governor->name) == NULL) { + ret = 0; + list_add(&governor->governor_list, &cpuoffline_governor_list); + } + + mutex_unlock(&cpuoffline_governor_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(cpuoffline_register_governor); + + +/* CPUoffline core initialization */ + +static int __init cpuoffline_core_init(void) +{ + int cpu; + + pr_info("%s\n", __func__); + for_each_possible_cpu(cpu) { + per_cpu(cpuoffline_partition, cpu) = NULL; + } + + cpuoffline_global_kobject = kobject_create_and_add("cpuoffline", + &cpu_sysdev_class.kset.kobj); + + WARN_ON(!cpuoffline_global_kobject); + /*register_syscore_ops(&cpuoffline_syscore_ops);*/ + + return 0; +} +core_initcall(cpuoffline_core_init); diff --git a/include/linux/cpuoffline.h b/include/linux/cpuoffline.h new file mode 100644 index 0000000..0c5b9a5 --- /dev/null +++ b/include/linux/cpuoffline.h @@ -0,0 +1,82 @@ +/* + * cpuoffline.h + * + * Copyright (C) 2011 Texas Instruments, Inc. + * Mike Turquette mturquette@ti.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/cpu.h> +#include <linux/mutex.h> + +#ifndef _LINUX_CPUOFFLINE_H +#define _LINUX_CPUOFFLINE_H + +#define MAX_NAME_LEN 16 + +//DECLARE_PER_CPU(struct cpuoffline_partition *, cpuoffline_partition); +//DECLARE_PER_CPU(int, cpuoffline_can_offline); + +struct cpuoffline_partition; + +struct cpuoffline_governor { + char name[MAX_NAME_LEN]; + struct list_head governor_list; + /*struct mutex mutex;*/ + struct module *owner; + int (*start)(struct cpuoffline_partition *partition); + int (*stop)(struct cpuoffline_partition *partition); + struct kobject kobj; +}; + +/** + * cpuoffline_parition - set of CPUs affected by a CPUoffline governor + * + * @cpus - bitmask of CPUs managed by this partition + * @cpus_can_offline - bitmask of CPUs in this partition that can go offline + * @min_cpus_online - limit how many CPUs are offline for performance + * @max_cpus_online - limits how many CPUs are online for power capping + * @cpuoffline_governor - governor policy for hotplugging CPUs + */ +struct cpuoffline_partition { + int id; + char name[MAX_NAME_LEN]; + cpumask_var_t cpus; + /*cpumask_var_t cpus_can_offline;*/ + int min_cpus_online; + /*int max_cpus_online;*/ + struct cpuoffline_governor *governor; + + struct kobject kobj; + struct completion kobj_unregister; + + struct mutex mutex; + + void * private_data; +}; + +struct cpuoffline_driver { + char name[MAX_NAME_LEN]; + int (*init)(struct cpuoffline_partition *partition); + int (*exit)(struct cpuoffline_partition *partition); +}; + +/* kobject/sysfs definitions */ +struct cpuoffline_attribute { + struct attribute attr; + ssize_t (*show)(struct cpuoffline_partition *partition, char *buf); + ssize_t (*store)(struct cpuoffline_partition *partition, + const char *buf, size_t count); +}; + +/* registration functions */ + +int cpuoffline_register_governor(struct cpuoffline_governor *governor); +void cpuoffline_unregister_governor(struct cpuoffline_governor *governor); + +int cpuoffline_register_driver(struct cpuoffline_driver *driver); +int cpuoffline_unregister_driver(struct cpuoffline_driver *driver); +#endif
--- drivers/cpuoffline/Makefile | 2 +- drivers/cpuoffline/governors/Kconfig | 9 + drivers/cpuoffline/governors/Makefile | 2 + drivers/cpuoffline/governors/avgload.c | 255 ++++++++++++++++++++++++++++++++ 4 files changed, 267 insertions(+), 1 deletions(-) create mode 100644 drivers/cpuoffline/governors/Kconfig create mode 100644 drivers/cpuoffline/governors/Makefile create mode 100644 drivers/cpuoffline/governors/avgload.c
diff --git a/drivers/cpuoffline/Makefile b/drivers/cpuoffline/Makefile index 0b5aa59..ca3277a 100644 --- a/drivers/cpuoffline/Makefile +++ b/drivers/cpuoffline/Makefile @@ -1,2 +1,2 @@ # CPUoffline core -obj-$(CONFIG_CPU_OFFLINE) += cpuoffline.o +obj-$(CONFIG_CPU_OFFLINE) += cpuoffline.o governors/ diff --git a/drivers/cpuoffline/governors/Kconfig b/drivers/cpuoffline/governors/Kconfig new file mode 100644 index 0000000..5ec9d64 --- /dev/null +++ b/drivers/cpuoffline/governors/Kconfig @@ -0,0 +1,9 @@ +config CPU_OFFLINE_GOVERNOR_AVGLOAD + bool "CPUoffline Avgload governor" + depends on CPU_OFFLINE + help + A simple governor that puts CPUs online or offline based on + CPU load statistics. It will always leave one CPU online in a + partition. + + If in doubt, say N. diff --git a/drivers/cpuoffline/governors/Makefile b/drivers/cpuoffline/governors/Makefile new file mode 100644 index 0000000..5d990a0 --- /dev/null +++ b/drivers/cpuoffline/governors/Makefile @@ -0,0 +1,2 @@ +# CPUoffline governors +obj-$(CONFIG_CPU_OFFLINE) += avgload.o diff --git a/drivers/cpuoffline/governors/avgload.c b/drivers/cpuoffline/governors/avgload.c new file mode 100644 index 0000000..0185d45 --- /dev/null +++ b/drivers/cpuoffline/governors/avgload.c @@ -0,0 +1,255 @@ +/* + * CPU Offline Average Load governor + * + * Copyright (C) 2011 Texas Instruments, Inc. + * Mike Turquette mturquette@ti.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/types.h> +#include <linux/cpuoffline.h> +#include <linux/slab.h> +#include <linux/hrtimer.h> +#include <linux/tick.h> +#include <linux/cpumask.h> + +#include <asm/cputime.h> + +#define AVGLOAD_DEFAULT_SAMPLING_RATE 1000000 +#define AVGLOAD_DEFAULT_ONLINE_THRESHOLD 80 +#define AVGLOAD_DEFAULT_OFFLINE_THRESHOLD 20 + +DEFINE_MUTEX(avgload_mutex); + +struct avgload_instance { + struct cpuoffline_partition *partition; + cputime64_t prev_time_wall; + struct delayed_work work; + struct mutex timer_mutex; + int sampling_rate; + int online_threshold; + int offline_threshold; +}; + +struct avgload_cpu_data { + cputime64_t prev_time_idle; + bool offline; +}; + +/* XXX this seems pretty inefficient... */ +static DEFINE_PER_CPU(struct avgload_cpu_data, avgload_data); + +static void avgload_do_work(struct avgload_instance *instance) +{ + unsigned int cpu; + cputime64_t cur_time_wall, cur_time_idle; + cputime64_t delta_wall, delta_idle; + u64 load = 0; + struct cpuoffline_partition *partition = instance->partition; + struct cpumask mask; + + if (!instance || !partition) { + pr_warning("%s: data does not exist\n", __func__); + return; + } + + /* find CPUs in this partition that are online */ + cpumask_and(&mask, cpu_online_mask, partition->cpus); + + /* this should only happen if CPUs are offlined from userspace */ + if (!cpumask_weight(&mask)) { + pr_err("%s: no cpus are online in this partition. aborting\n", + __func__); + return; + } + + /* determine load for all online CPUs in the partition */ + for_each_cpu(cpu, &mask) { + cur_time_idle = get_cpu_idle_time_us(cpu, &cur_time_wall); + + delta_wall = cputime64_sub(cur_time_wall, + instance->prev_time_wall); + delta_idle = cputime64_sub(cur_time_idle, + per_cpu(avgload_data, cpu).prev_time_idle); + + per_cpu(avgload_data, cpu).prev_time_idle = cur_time_idle; + + /* rollover happens often when bringing a CPU back online */ + if (!delta_wall || delta_wall < delta_idle) + continue; + + /* aggregate load */ + delta_idle = 100 * (delta_wall - delta_idle); + do_div(delta_idle, delta_wall); + load += delta_idle; + } + + /* save last timestamp for next iteration */ + instance->prev_time_wall = cur_time_wall; + + /* average the load */ + do_div(load, cpumask_weight(&mask)); + + /* bring a cpu back online */ + if (load > instance->online_threshold) { + /* which CPUs are offline? */ + cpumask_complement(&mask, cpu_online_mask); + + /* which offline CPUs are in this partition? */ + cpumask_and(&mask, &mask, partition->cpus); + + /* which offline CPUs in this partition can hotplug? */ + cpumask_and(&mask, &mask, cpu_hotpluggable_mask); + + /* bail out if all CPUs are online */ + if (!cpumask_weight(&mask)) + return; + + /* pick a "random" CPU to bring online */ + cpu = cpumask_any(&mask); + + cpu_up(cpu); + + return; + } + + /* take a cpu offline */ + if (load < instance->offline_threshold) { + /* can any of those CPUs hotplug? */ + cpumask_and(&mask, &mask, cpu_hotpluggable_mask); + + if (!cpumask_weight(&mask)) + return; + + /* pick a "random" CPU to go offline */ + cpu = cpumask_any(&mask); + + cpu_down(cpu); + + return; + } +} + +static void do_avgload_timer(struct work_struct *work) +{ + int delay; + struct avgload_instance *instance = + container_of(work, struct avgload_instance, work.work); + + mutex_lock(&instance->timer_mutex); + + /* do the work */ + avgload_do_work(instance); + + delay = usecs_to_jiffies(instance->sampling_rate); + schedule_delayed_work(&instance->work, delay); + + mutex_unlock(&instance->timer_mutex); +} + +static void avgload_timer_init(struct avgload_instance *instance) +{ + int delay = usecs_to_jiffies(instance->sampling_rate); + + INIT_DELAYED_WORK_DEFERRABLE(&instance->work, do_avgload_timer); + schedule_delayed_work(&instance->work, delay); +} + +static void avgload_timer_exit(struct avgload_instance *instance) +{ + cancel_delayed_work_sync(&instance->work); +} + +static int cpuoffline_avgload_start(struct cpuoffline_partition *partition) +{ + struct cpuoffline_governor *gov; + struct avgload_instance *instance; + struct avgload_cpu_data *cpu_data; + int cpu; + + instance = kmalloc(sizeof(struct avgload_instance), GFP_KERNEL); + if (!instance) + return -ENOMEM; + + gov = partition->governor; + if (!gov) { + pr_err("%s: no governor\n", __func__); + return -EINVAL; + } + + mutex_lock(&avgload_mutex); + + /* initialize defaults */ + instance->sampling_rate = AVGLOAD_DEFAULT_SAMPLING_RATE; + instance->online_threshold = AVGLOAD_DEFAULT_ONLINE_THRESHOLD; + instance->offline_threshold = AVGLOAD_DEFAULT_OFFLINE_THRESHOLD; + + /* remember who we are */ + instance->partition = partition; + partition->private_data = instance; + + /* populate idle times before kicking off the workqueue */ + for_each_cpu(cpu, partition->cpus) { + cpu_data = &per_cpu(avgload_data, cpu); + + cpu_data->prev_time_idle = (cputime_t) get_cpu_idle_time_us(cpu, + &instance->prev_time_wall); + } + + /* XXX initialize sysfs stuff here */ + + mutex_unlock(&avgload_mutex); + + mutex_init(&instance->timer_mutex); + avgload_timer_init(instance); + + return 0; +} + +static int cpuoffline_avgload_stop(struct cpuoffline_partition *partition) +{ + struct avgload_instance *instance = partition->private_data; + + if (!instance) + return -EINVAL; + + mutex_lock(&instance->timer_mutex); + avgload_timer_exit(instance); + mutex_unlock(&instance->timer_mutex); + + mutex_lock(&partition->mutex); + partition->private_data = NULL; + mutex_unlock(&partition->mutex); + + kfree(instance); + + return 0; +} + +struct cpuoffline_governor cpuoffline_governor_avgload = { + .name = "avgload", + .owner = THIS_MODULE, + .start = cpuoffline_avgload_start, + .stop = cpuoffline_avgload_stop, +}; + +static int __init cpuoffline_avgload_init(void) +{ + pr_notice("%s: registering avgload\n", __func__); + return cpuoffline_register_governor(&cpuoffline_governor_avgload); +} + +static void __exit cpuoffline_avgload_exit(void) +{ + pr_notice("%s: unregistering avgload\n", __func__); + cpuoffline_unregister_governor(&cpuoffline_governor_avgload); +} + +MODULE_AUTHOR("Mike Turquette mturquette@ti.com"); +MODULE_DESCRIPTION("cpuoffline_avgload - offline CPUs based on their load"); +MODULE_LICENSE("GPL"); + +module_init(cpuoffline_avgload_init);
--- arch/arm/Kconfig | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 2c71a8f..5804b21 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1980,6 +1980,8 @@ endif
source "drivers/cpuidle/Kconfig"
+source "drivers/cpuoffline/Kconfig" + endmenu
menu "Floating point emulation"
On Fri, Aug 19, 2011 at 12:38:13PM -0700, Mike Turquette wrote:
This series is posted for posterity. It has been NAK'd by the community since CPU hotplug has been deemed an inappropriate mechanism for power capping.
I spoke with Amit about this last week. What's the plan going forward on really saving power when a CPU isn't being used -- just ensuring the lowest possible idle state is really off?
And are there situations where the hardware requires something like hotplug to actually maximize savings?
On Mon, Aug 22, 2011 at 11:18:52AM -0300, Christian Robottom Reis wrote:
On Fri, Aug 19, 2011 at 12:38:13PM -0700, Mike Turquette wrote:
This series is posted for posterity. It has been NAK'd by the community since CPU hotplug has been deemed an inappropriate mechanism for power capping.
I spoke with Amit about this last week. What's the plan going forward on really saving power when a CPU isn't being used -- just ensuring the lowest possible idle state is really off?
And are there situations where the hardware requires something like hotplug to actually maximize savings?
Oh, and finally, it seems that patches 1-3 might be useful, regardless of the NAKing of the actual patchset, right?
On Mon, Aug 22, 2011 at 5:21 PM, Christian Robottom Reis kiko@linaro.org wrote:
On Mon, Aug 22, 2011 at 11:18:52AM -0300, Christian Robottom Reis wrote:
On Fri, Aug 19, 2011 at 12:38:13PM -0700, Mike Turquette wrote:
This series is posted for posterity. It has been NAK'd by the community since CPU hotplug has been deemed an inappropriate mechanism for power capping.
I spoke with Amit about this last week. What's the plan going forward on really saving power when a CPU isn't being used -- just ensuring the lowest possible idle state is really off?
And are there situations where the hardware requires something like hotplug to actually maximize savings?
Oh, and finally, it seems that patches 1-3 might be useful, regardless of the NAKing of the actual patchset, right?
No, those were NACK'ed as well. The interface is not be 'enhanced' in any way that might promote it's use for power savings.
Peter's actual message below:
"Nacked-by: Peter Zijlstra a.p.zijlstra@chello.nl
the kernel really shouldn't be using hotplug for this (nor should userspace really). hot-unplugging random cpus wrecks things like cpusets. Furthermore hotplug does way too much work to use as a simple means to idle a cpu.
Even the availability of this mask is wrong, since that implies the information is useful, which per the above it is not, the kernel shouldn't care about this full-stop.
The only reason for the OS to unplug a CPU is imminent and unavoidable hardware failure. Thermal capping is not that (and yes ACPI-4.0 is a broken piece of shit)."
On Mon, Aug 22, 2011 at 06:52:33PM +0300, Amit Kucheria wrote:
"Nacked-by: Peter Zijlstra a.p.zijlstra@chello.nl
the kernel really shouldn't be using hotplug for this (nor should userspace really). hot-unplugging random cpus wrecks things like cpusets. Furthermore hotplug does way too much work to use as a simple means to idle a cpu.
Even the availability of this mask is wrong, since that implies the information is useful, which per the above it is not, the kernel shouldn't care about this full-stop.
The only reason for the OS to unplug a CPU is imminent and unavoidable hardware failure. Thermal capping is not that (and yes ACPI-4.0 is a broken piece of shit)."
Thanks! This is a good opener for the TSC session I suggested. :-)