It is usually interesting to do some statistics on a set of data, especially when the values are coming in a stream (measured time by time for instance).
Instead of having the statistics formula inside a specific subsystem, this small library provides the basic statistics functions available for all the kernel.
The library is designed to do the minimum computation when a new value is added. Only the basic value storage, array shifting and accumulation is done.
The average, variance and standard deviation is computed when requested via the corresponding functions.
The statistic library can deal up to 65536 values in the range of -2^24 and 2^24 - 1. These are large values and does not really make sense to the kernel code to use the statistics at these limits, so it is up to developer to use wisely the library.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- include/linux/stats.h | 29 +++++++ lib/Makefile | 3 +- lib/stats.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 266 insertions(+), 1 deletion(-) create mode 100644 include/linux/stats.h create mode 100644 lib/stats.c
diff --git a/include/linux/stats.h b/include/linux/stats.h new file mode 100644 index 0000000..664eb30 --- /dev/null +++ b/include/linux/stats.h @@ -0,0 +1,29 @@ +/* + * include/linux/stats.h + */ +#ifndef _LINUX_STATS_H +#define _LINUX_STATS_H + +struct stats { + s64 sum; /* sum of values */ + u64 sq_sum; /* sum of the square values */ + s32 *values; /* array of values */ + s32 min; /* minimal value of the entire series */ + s32 max; /* maximal value of the entire series */ + unsigned int n; /* current number of values */ + unsigned int w_ptr; /* current window pointer */ + unsigned short len; /* size of the value array */ +}; + +extern s32 stats_max(struct stats *s); +extern s32 stats_min(struct stats *s); +extern s32 stats_mean(struct stats *s); +extern u32 stats_variance(struct stats *s); +extern u32 stats_stddev(struct stats *s); +extern void stats_add(struct stats *s, s32 val); +extern void stats_reset(struct stats *s); +extern void stats_free(struct stats *s); +extern struct stats *stats_alloc(unsigned short len); +extern unsigned short stats_n(struct stats *s); + +#endif /* _LINUX_STATS_H */ diff --git a/lib/Makefile b/lib/Makefile index 13a7c6a..18460d2 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -26,7 +26,8 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \ bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \ - percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o + percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \ + stats.o obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/stats.c b/lib/stats.c new file mode 100644 index 0000000..f5425d1 --- /dev/null +++ b/lib/stats.c @@ -0,0 +1,235 @@ +/* + * Implementation of basics statistics functions useful to compute on + * a stream of data. + * + * Copyright: (C) 2015-2016 Linaro Limited + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include <linux/export.h> +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/stats.h> +#include <linux/types.h> + +/** + * stats_n - number of elements present in the statistics + * + * @s: the statistic structure + * + * Returns an unsigned short representing the number of values in the + * statistics series + */ +unsigned short stats_n(struct stats *s) +{ + return s->n; +} + +/** + * stats_max - maximal value of the series + * + * @s: the statistic structure + * + * Returns a s32 representing the maximum value of the series + */ +s32 stats_max(struct stats *s) +{ + return s->max; +} +EXPORT_SYMBOL_GPL(stats_max); + +/** + * stats_min - minimal value of the series + * + * @s: the statistic structure + * + * Returns a s32 representing the minimal value of the series + */ +s32 stats_min(struct stats *s) +{ + return s->min; +} +EXPORT_SYMBOL_GPL(stats_min); + +/** + * stats_mean - compute the average + * + * @s: the statistics structure + * + * Returns an s32 corresponding to the mean value, or zero if there is + * no data + */ +s32 stats_mean(struct stats *s) +{ + return s->n ? s->sum / s->n : 0; +} +EXPORT_SYMBOL_GPL(stats_mean); + +/** + * stats_variance - compute the variance + * + * @s: the statistic structure + * + * Returns an u32 corresponding to the variance, or zero if there is no + * data + */ +u32 stats_variance(struct stats *s) +{ + s32 mean = stats_mean(s); + return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0; +} +EXPORT_SYMBOL_GPL(stats_variance); + +/** + * stats_stddev - compute the standard deviation + * + * @s: the statistic structure + * + * Returns an u32 corresponding to the standard deviation, or zero if + * there is no data + */ +u32 stats_stddev(struct stats *s) +{ + return int_sqrt(stats_variance(s)); +} +EXPORT_SYMBOL_GPL(stats_stddev); + +/** + * stats_add - add a new value in the statistic structure + * + * @s: the statistic structure + * @value: the new value to be added, max 2^24 - 1 + * + * Adds the value to the array, if the array is full, then the array + * is shifted. + * + */ +void stats_add(struct stats *s, s32 value) +{ + /* + * In order to prevent an overflow in the statistic code, we + * limit the value to 2^24 - 1, if it is greater we just + * ignore it with a WARN_ON_ONCE letting know to userspace we + * are dealing with out-of-range values. + */ + if (WARN_ON_ONCE(value >= ((2<<24) - 1))) + return; + + /* + * Insert the value in the array. If the array is already + * full, shift the values and add the value at the end of the + * series, otherwise add the value directly. + */ + if (likely(s->len == s->n)) { + s->sum -= s->values[s->w_ptr]; + s->sq_sum -= s->values[s->w_ptr] * s->values[s->w_ptr]; + s->values[s->w_ptr] = value; + s->w_ptr = (s->w_ptr + 1) % s->len; + } else { + s->values[s->n] = value; + s->n++; + } + + /* + * Keep track of the min and max. + */ + s->min = min(s->min, value); + s->max = max(s->max, value); + + /* + * In order to reduce the overhead and to prevent value + * derivation due to the integer computation, we just sum the + * value and do the division when the average and the variance + * are requested. + */ + s->sum += value; + s->sq_sum += value * value; +} +EXPORT_SYMBOL_GPL(stats_add); + +/** + * stats_reset - reset the stats + * + * @s: the statistic structure + * + * Reset the statistics and reset the values + * + */ +void stats_reset(struct stats *s) +{ + s->sum = s->sq_sum = s->n = s->w_ptr = 0; + s->max = S32_MIN; + s->min = S32_MAX; +} +EXPORT_SYMBOL_GPL(stats_reset); + +/** + * stats_alloc - allocate a structure for statistics + * + * @len: The number of items in the array which is limited by the + * unsigned short type. That allows to prevent overflow in all + * the statistics code. + * + * Allocates memory to store the different values and initialize the + * structure. + * + * In order to prevent an overflow in the computation, the maximum + * allowed number of values is 65536 and each value max is 2^24 - 1. + * + * The variance is the sum of the square value of the difference of + * each value to the average. The variance is a u64 (square values are + * always positives), so that gives a maximum of 18446744073709551615. + * We can store 65536 values, so: + * + * 18446744073709551615 / 65536 = 281474976710655 + * + * ... is the square max value we can have, hence the difference to + * the mean is max sqrt(281474976710655) = 16777215 (2^24 -1) + * + * Even if these values are not realistics (statistic in the kernel is + * for a few hundred values, large dispersion in the integer limits is + * very very rare so the sum won't be very high, or high integers + * series means low variance), we prevent any overflow in the code and + * we are safe. + * + * Returns a valid pointer a struct stats, NULL if the memory + * allocation fails. + */ +struct stats *stats_alloc(unsigned short len) +{ + struct stats *s; + s32 *values; + + s = kzalloc(sizeof(*s), GFP_KERNEL); + if (s) { + values = kzalloc(sizeof(*values) * len, GFP_KERNEL); + if (!values) { + kfree(s); + return NULL; + } + + s->values = values; + s->len = len; + s->min = S32_MAX; + s->max = S32_MIN; + } + + return s; +} +EXPORT_SYMBOL_GPL(stats_alloc); + +/** + * stats_free - free the statistics structure + * + * @s : the statistics structure + * + * Frees the memory allocated by the function stats_alloc. + */ +void stats_free(struct stats *s) +{ + kfree(s->values); + kfree(s); +} +EXPORT_SYMBOL_GPL(stats_free); -- 1.9.1
The interrupt framework gives a lot of information and statistics about each interrupt (time accounting, statistics, ...).
Unfortunately there is no way to measure when interrupts occur and provide a mathematical model for their behavior which could help in predicting their next occurence.
This framework allows for registering a callback function that is invoked when an interrupt occurs.
Each time an interrupt occurs, the callback will be called with the interval corresponding to the duration between two interrupts.
This framework allows a subsystem to register a handler in order to receive the timing information for the registered interrupt. That gives other subsystems the ability to compute predictions for the next interrupt occurence.
The main objective is to track and detect the periodic interrupts in order to predict the next event on a cpu and anticipate the sleeping time when entering idle. This fine grain approach allows to simplify and rationalize a wake up event prediction without IPIs interference, thus letting the scheduler to be smarter with the wakeup IPIs regarding the idle period.
The irq timings tracking showed, in the proof-of-concept, an improvement with the predictions, the approach is correct but my knowledge to the irq subsystem is limited. I am not sure this patch measuring irq time interval is correct or acceptable, so it is at the RFC state (minus some polishing).
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- include/linux/interrupt.h | 45 ++++++++++++++++++++++++++++++++ include/linux/irqdesc.h | 3 +++ kernel/irq/Kconfig | 4 +++ kernel/irq/handle.c | 12 +++++++++ kernel/irq/manage.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 128 insertions(+), 1 deletion(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index be7e75c..f48e8ff 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -123,6 +123,51 @@ struct irqaction {
extern irqreturn_t no_action(int cpl, void *dev_id);
+#ifdef CONFIG_IRQ_TIMINGS +/** + * timing handler to be called when an interrupt happens + */ +typedef void (*irqt_handler_t)(unsigned int, ktime_t, void *, void *); + +/** + * struct irqtimings - per interrupt irq timings descriptor + * @handler: interrupt handler timings function + * @data: pointer to the private data to be passed to the handler + * @timestamp: latest interruption occurence + */ +struct irqtimings { + irqt_handler_t handler; + void *data; +} ____cacheline_internodealigned_in_smp; + +/** + * struct irqt_ops - structure to be used by the subsystem to call the + * register and unregister ops when an irq is setup or freed. + * @setup: registering callback + * @free: unregistering callback + * + * The callbacks assumes the lock is held on the irq desc + */ +struct irqtimings_ops { + int (*setup)(unsigned int, struct irqaction *); + void (*free)(unsigned int, void *); +}; + +extern int register_irq_timings(struct irqtimings_ops *ops); +extern int setup_irq_timings(unsigned int irq, struct irqaction *act); +extern void free_irq_timings(unsigned int irq, void *dev_id); +#else +static inline int setup_irq_timings(unsigned int irq, struct irqaction *act) +{ + return 0; +} + +static inline void free_irq_timings(unsigned int irq, void *dev_id) +{ + ; +} +#endif + extern int __must_check request_threaded_irq(unsigned int irq, irq_handler_t handler, irq_handler_t thread_fn, diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h index a587a33..e0d4263 100644 --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -51,6 +51,9 @@ struct irq_desc { #ifdef CONFIG_IRQ_PREFLOW_FASTEOI irq_preflow_handler_t preflow_handler; #endif +#ifdef CONFIG_IRQ_TIMINGS + struct irqtimings *timings; +#endif struct irqaction *action; /* IRQ action list */ unsigned int status_use_accessors; unsigned int core_internal_state__do_not_mess_with_it; diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index 9a76e3b..1275fd1 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -73,6 +73,10 @@ config GENERIC_MSI_IRQ_DOMAIN config HANDLE_DOMAIN_IRQ bool
+config IRQ_TIMINGS + bool + default y + config IRQ_DOMAIN_DEBUG bool "Expose hardware/virtual IRQ mapping via debugfs" depends on IRQ_DOMAIN && DEBUG_FS diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c index e25a83b..ca8b0c5 100644 --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -132,6 +132,17 @@ void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action) wake_up_process(action->thread); }
+#ifdef CONFIG_IRQ_TIMINGS +void handle_irqt_event(struct irqtimings *irqt, struct irqaction *action) +{ + if (irqt) + irqt->handler(action->irq, ktime_get(), + action->dev_id, irqt->data); +} +#else +#define handle_irqt_event(a, b) +#endif + irqreturn_t handle_irq_event_percpu(struct irq_desc *desc, struct irqaction *action) { @@ -165,6 +176,7 @@ handle_irq_event_percpu(struct irq_desc *desc, struct irqaction *action) /* Fall through to add to randomness */ case IRQ_HANDLED: flags |= action->flags; + handle_irqt_event(desc->timings, action); break;
default: diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index f9a59f6..21cc7bf 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1017,6 +1017,60 @@ static void irq_release_resources(struct irq_desc *desc) c->irq_release_resources(d); }
+#ifdef CONFIG_IRQ_TIMINGS +/* + * Global variable, only used by accessor functions, currently only + * one user is allowed and it is up to the caller to make sure to + * setup the irq timings which are already setup. + */ +static struct irqtimings_ops *irqtimings_ops; + +/** + * register_irq_timings - register the ops when an irq is setup or freed + * + * @ops: the register/unregister ops to be called when at setup or + * free time + * + * Returns -EBUSY if the slot is already in use, zero on success. + */ +int register_irq_timings(struct irqtimings_ops *ops) +{ + if (irqtimings_ops) + return -EBUSY; + + irqtimings_ops = ops; + + return 0; +} + +/** + * setup_irq_timings - call the timing register callback + * + * @desc: an irq desc structure + * + * Returns -EINVAL in case of error, zero on success. + */ +int setup_irq_timings(unsigned int irq, struct irqaction *act) +{ + if (irqtimings_ops && irqtimings_ops->setup) + return irqtimings_ops->setup(irq, act); + return 0; +} + +/** + * free_irq_timings - call the timing unregister callback + * + * @irq: the interrupt number + * @dev_id: the device id + * + */ +void free_irq_timings(unsigned int irq, void *dev_id) +{ + if (irqtimings_ops && irqtimings_ops->free) + irqtimings_ops->free(irq, dev_id); +} +#endif /* CONFIG_IRQ_TIMINGS */ + /* * Internal function to register an irqaction - typically used to * allocate special interrupts that are part of the architecture. @@ -1037,6 +1091,9 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) if (!try_module_get(desc->owner)) return -ENODEV;
+ ret = setup_irq_timings(irq, new); + if (ret) + goto out_mput; /* * Check whether the interrupt nests into another interrupt * thread. @@ -1045,7 +1102,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) if (nested) { if (!new->thread_fn) { ret = -EINVAL; - goto out_mput; + goto out_free_timings; } /* * Replace the primary handler which was provided from @@ -1323,6 +1380,10 @@ out_thread: kthread_stop(t); put_task_struct(t); } + +out_free_timings: + free_irq_timings(irq, new->dev_id); + out_mput: module_put(desc->owner); return ret; @@ -1408,6 +1469,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
unregister_handler_proc(irq, action);
+ free_irq_timings(irq, dev_id); + /* Make sure it's not being used on another CPU: */ synchronize_irq(irq);
-- 1.9.1
Many IRQs are quiet most of the time, or they tend to come in bursts of fairly equal time intervals within each burst. It is therefore possible to detect those IRQs with stable intervals and guestimate when the next IRQ event is most likely to happen.
Examples of such IRQs may include audio related IRQs where the FIFO size and/or DMA descriptor size with the sample rate create stable intervals, block devices during large data transfers, etc. Even network streaming of multimedia content creates patterns of periodic network interface IRQs in some cases.
This patch adds code to track the mean interval and variance for each IRQ over a window of time intervals between IRQ events. Those statistics can be used to assist cpuidle in selecting the most appropriate sleep state by predicting the most likely time for the next interrupt.
Because the stats are gathered in interrupt context, the core computation is as light as possible.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- drivers/cpuidle/Kconfig | 5 + kernel/irq/Kconfig | 1 - kernel/sched/Makefile | 1 + kernel/sched/idle-sched.c | 409 ++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 415 insertions(+), 1 deletion(-) create mode 100644 kernel/sched/idle-sched.c
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 8c7930b..dd17215 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -25,6 +25,11 @@ config CPU_IDLE_GOV_MENU bool "Menu governor (for tickless system)" default y
+config CPU_IDLE_GOV_SCHED + bool "Sched idle governor" + select IRQ_TIMINGS + default y + config DT_IDLE_STATES bool
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index 1275fd1..81557ae 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -75,7 +75,6 @@ config HANDLE_DOMAIN_IRQ
config IRQ_TIMINGS bool - default y
config IRQ_DOMAIN_DEBUG bool "Expose hardware/virtual IRQ mapping via debugfs" diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 6768797..f7d5a35 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_IDLE_GOV_SCHED) += idle-sched.o diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c new file mode 100644 index 0000000..bfbd07f --- /dev/null +++ b/kernel/sched/idle-sched.c @@ -0,0 +1,409 @@ +/* + * Copyright (C) 2016 Linaro Ltd, Daniel Lezcano daniel.lezcano@linaro.org + * Nicolas Pitre nicolas.pitre@linaro.org + * + */ +#include <linux/cpuidle.h> +#include <linux/interrupt.h> +#include <linux/irqdesc.h> +#include <linux/ktime.h> +#include <linux/slab.h> +#include <linux/stats.h> +#include <linux/tick.h> +#include <linux/time64.h> + +/* + * Define the number of values we will dealing with to the monitor the + * interruptions. + */ +#define STATS_MAX_VALUE 8 + +struct wakeup { + struct stats *stats; + ktime_t timestamp; + int predictable; + int good; + int bad; +}; + +/* + * Per cpu and irq statistics. Each cpu receives interrupts and those + * ones can be distributed following an irq chip specific + * algorithm. Random irq distribution is the worst case to predict + * interruption behavior but usually that does not happen or could be + * fixed from userspace by setting the irq affinity. + */ +static DEFINE_PER_CPU(struct wakeup, *wakesup[NR_IRQS]); + +/** + * sched_idle_irq - irq timestamp callback + * + * @irq: the irq number + * @timestamp: when the interrupt occured + * @dev_id: device id for shared interrupt (not yet used) + * @data: private data given when registering this callback + * + * Interrupt callback called when an interrupt happens. This function + * is critical as it is called under an interrupt section: minimum + * operations as possible are done here. + * + */ +static void sched_idle_irq(unsigned int irq, ktime_t timestamp, + void *dev_id, void *data) +{ + s32 diff; + u32 stddev, mean; + unsigned int cpu = raw_smp_processor_id(); + struct wakeup *w = per_cpu(wakesup[irq], cpu); + + if (!w) + return; + + /* + * It is the first time the interrupt occurs, we can't do any stats, + * just store the timestamp and exit. + */ + if (!ktime_to_us(w->timestamp)) { + w->timestamp = timestamp; + return; + } + + /* + * Microsec resolution is enough for our purpose. + */ + diff = ktime_us_delta(timestamp, w->timestamp); + w->timestamp = timestamp; + + /* + * If ~one second elapsed since the last interrupt, we can + * just drop the statistics and set the wake up source as non + * predictable. We have to wait for the next interval to + * continue filling the stats. + */ + if (diff > (1 << 20)) { + stats_reset(w->stats); + w->predictable = 0; + return; + } + + mean = stats_mean(w->stats); + stddev = stats_stddev(w->stats); + stats_add(w->stats, diff); + + /* + * We check the value is between the mean - stddev and mean + stddev + * Until the value is in this interval, we are in a repeating pattern and + * the wakeup is considered predictable. + */ + if (diff > (mean - stddev) && diff < (mean + stddev)) { + w->predictable = 1; + w->good++; + } else { + w->bad++; + } +} + +/* + * Callback to be called when an interrupt happens. + */ +static struct irqtimings irq_timings = { + .handler = sched_idle_irq, +}; + +static ktime_t next_irq_event(void) +{ + unsigned int irq, cpu = raw_smp_processor_id(); + ktime_t diff, next, min = (ktime_t){ .tv64 = KTIME_MAX }; + struct wakeup *w; + + /* + * Lookup the interrupt array for this cpu and search for the earlier + * expected interruption. + */ + for (irq = 0; irq < NR_IRQS; irq++) { + w = per_cpu(wakesup[irq], cpu); + + /* + * The interrupt was not setup as a source of a wakeup + * or the wakeup source is not considered at this + * moment stable enough to do a prediction. + */ + if (!w || !w->predictable) + continue; + + /* + * Let's compute the next irq: the wakeup source is + * considered predictable, we add the average interval + * time added to the latest interruption event time. + */ + next = ktime_add_us(w->timestamp, stats_mean(w->stats)); + + /* + * If the interrupt is supposed to happen before the + * minimum time, then it becomse the minimum itself. + */ + if (ktime_before(next, min)) + min = next; + } + + /* + * At this point, we have our prediction but the caller is + * expecting the remaining time before the next event, so + * compute the diff. + */ + diff = ktime_sub(min, ktime_get()); + + /* + * The result could be negative for different reasons: + * - the prediction is incorrect + * - the prediction was too near now and expired while we were + * in this function + * + * In both cases, we return KTIME_MAX as a failure to do a + * prediction + */ + if (ktime_compare(diff, ktime_set(0, 0)) <= 0) + return (ktime_t){ .tv64 = KTIME_MAX }; + + return diff; +} + +/** + * sched_idle_next_wakeup - Predict the next wakeup on the current cpu + * + * The next event on the cpu is based on a statistic approach of the + * interrupt events and the timer deterministic value. From the timer + * or the irqs, we return the one expected to occur first. + * + * Returns the expected remaining idle time before being woken up by + * an interruption. + */ +s64 sched_idle_next_wakeup(void) +{ + s64 next_timer = ktime_to_us(tick_nohz_get_sleep_length()); + s64 next_irq = ktime_to_us(next_irq_event()); + + return min(next_irq, next_timer); +} + +/** + * sched_idle - go to idle for a specified amount of time + * + * @duration: the idle duration time + * @latency: the latency constraint + * + * Returns 0 on success, < 0 otherwise. + */ +int sched_idle(s64 duration, unsigned int latency) +{ + struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices); + struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev); + struct cpuidle_state_usage *su; + struct cpuidle_state *s; + int i, ret = 0, index = -1; + + rcu_idle_enter(); + + /* + * No cpuidle driver is available, let's use the default arch + * idle function. + */ + if (cpuidle_not_available(drv, dev)) + goto default_idle; + + /* + * Find the idle state with the lowest power while satisfying + * our constraints. We will save energy if the duration of the + * idle time is bigger than the target residency which is the + * break even point. The choice will be modulated by the + * latency. + */ + for (i = 0; i < drv->state_count; i++) { + + s = &drv->states[i]; + + su = &dev->states_usage[i]; + + if (s->disabled || su->disable) + continue; + if (s->target_residency > duration) + continue; + if (s->exit_latency > latency) + continue; + + index = i; + } + + /* + * The idle task must be scheduled, it is pointless to go to + * idle, just re-enable the interrupt and return. + */ + if (current_clr_polling_and_test()) { + local_irq_enable(); + goto out; + } + + if (index < 0) { + /* + * No idle callbacks fulfilled the constraints, jump + * to the default function like there wasn't any + * cpuidle driver. + */ + goto default_idle; + } else { + /* + * Enter the idle state previously returned by the + * governor decision. This function will block until + * an interrupt occurs and will take care of + * re-enabling the local interrupts + */ + return cpuidle_enter(drv, dev, index); + } + +default_idle: + default_idle_call(); +out: + rcu_idle_exit(); + return ret; +} + +/** + * sched_irq_timing_free - free_irq callback + * + * @irq: the irq number to stop tracking + * @dev_id: not used at the moment + * + * This function will remove from the wakeup source prediction table. + */ +static void sched_irq_timing_free(unsigned int irq, void *dev_id) +{ + struct irq_desc *desc = irq_to_desc(irq); + struct wakeup *w; + unsigned int cpu; + unsigned long flags; + + for_each_possible_cpu(cpu) { + w = per_cpu(wakesup[irq], cpu); + + if (!w) + continue; + + if (w->stats) + stats_free(w->stats); + + kfree(w); + + per_cpu(wakesup[irq], cpu) = NULL; + } + + raw_spin_lock_irqsave(&desc->lock, flags); + desc->timings = NULL; + raw_spin_unlock_irqrestore(&desc->lock, flags); +} + + +/** + * sched_irq_timing_setup - setup_irq callback + * + * @irq: the interrupt numbe to be tracked + * @act: the new irq action to be set to this interrupt + * + * Update the irq table to be tracked in order to predict the next event. + * + * Returns zero on success. On error it returns -ENOMEM. + */ +static int sched_irq_timing_setup(unsigned int irq, struct irqaction *act) +{ + struct irq_desc *desc = irq_to_desc(irq); + struct wakeup *w; + unsigned int cpu; + unsigned long flags; + int ret = -ENOMEM; + + /* + * No interrupt set for this descriptor or related to a timer. + * Timers are deterministic, so no need to try to do any + * prediction on them. No error for both cases, we are just not + * interested. + */ + if (!act || (act->flags & __IRQF_TIMER)) + return 0; + + /* + * Allocates the wakeup structure and the stats structure. As + * the interrupt can occur on any cpu, allocate the wakeup + * structure per cpu basis. + */ + for_each_possible_cpu(cpu) { + + w = kzalloc(sizeof(*w), GFP_KERNEL); + if (!w) + goto undo; + + w->stats = stats_alloc(STATS_MAX_VALUE); + if (!w->stats) { + kfree(w); + goto undo; + } + + per_cpu(wakesup[irq], cpu) = w; + } + + raw_spin_lock_irqsave(&desc->lock, flags); + desc->timings = &irq_timings; + raw_spin_unlock_irqrestore(&desc->lock, flags); + + ret = 0; +undo: + /* + * Rollback all allocations on failure + */ + if (ret) + sched_irq_timing_free(irq, act->dev_id); + + return ret; +} + +/* + * Setup/free irq callbacks + */ +static struct irqtimings_ops irqt_ops = { + .setup = sched_irq_timing_setup, + .free = sched_irq_timing_free, +}; + +/** + * sched_idle_init - setup the interrupt tracking table + * + * At init time, some interrupts could have been setup and in the + * system life time, some devices could be setup. In order to track + * all interrupts we are interested in, we first register a couple of + * callback to keep up-to-date the interrupt tracking table and then + * we setup the table with the interrupt which were already set up. + */ +int __init sched_idle_init(void) +{ + struct irq_desc *desc; + unsigned int irq; + int ret; + + /* + * Register the setup/free irq callbacks, so new interrupt or + * freed interrupt will update their tracking. + */ + ret = register_irq_timings(&irqt_ops); + if (ret) { + pr_err("Failed to register timings ops\n"); + return ret; + } + + /* + * For all the irq already setup, assign the timing callback. + * All interrupts with their desc NULL will be discarded. + */ + for_each_irq_desc(irq, desc) + sched_irq_timing_setup(irq, desc->action); + + return 0; +} +late_initcall(sched_idle_init); -- 1.9.1
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
Many IRQs are quiet most of the time, or they tend to come in bursts of fairly equal time intervals within each burst. It is therefore possible to detect those IRQs with stable intervals and guestimate when the next IRQ event is most likely to happen.
Examples of such IRQs may include audio related IRQs where the FIFO size and/or DMA descriptor size with the sample rate create stable intervals, block devices during large data transfers, etc. Even network streaming of multimedia content creates patterns of periodic network interface IRQs in some cases.
This patch adds code to track the mean interval and variance for each IRQ over a window of time intervals between IRQ events. Those statistics can be used to assist cpuidle in selecting the most appropriate sleep state by predicting the most likely time for the next interrupt.
Because the stats are gathered in interrupt context, the core computation is as light as possible.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org
I'm rather disappointed by this patch.
Comments below.
drivers/cpuidle/Kconfig | 5 + kernel/irq/Kconfig | 1 - kernel/sched/Makefile | 1 + kernel/sched/idle-sched.c | 409 ++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 415 insertions(+), 1 deletion(-) create mode 100644 kernel/sched/idle-sched.c
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index 8c7930b..dd17215 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -25,6 +25,11 @@ config CPU_IDLE_GOV_MENU bool "Menu governor (for tickless system)" default y
+config CPU_IDLE_GOV_SCHED
- bool "Sched idle governor"
- select IRQ_TIMINGS
- default y
config DT_IDLE_STATES bool
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index 1275fd1..81557ae 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -75,7 +75,6 @@ config HANDLE_DOMAIN_IRQ
config IRQ_TIMINGS bool
- default y
config IRQ_DOMAIN_DEBUG bool "Expose hardware/virtual IRQ mapping via debugfs" diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 6768797..f7d5a35 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_IDLE_GOV_SCHED) += idle-sched.o diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c new file mode 100644 index 0000000..bfbd07f --- /dev/null +++ b/kernel/sched/idle-sched.c @@ -0,0 +1,409 @@ +/*
- Copyright (C) 2016 Linaro Ltd, Daniel Lezcano daniel.lezcano@linaro.org
Nicolas Pitre <nicolas.pitre@linaro.org>
- */
+#include <linux/cpuidle.h> +#include <linux/interrupt.h> +#include <linux/irqdesc.h> +#include <linux/ktime.h> +#include <linux/slab.h> +#include <linux/stats.h> +#include <linux/tick.h> +#include <linux/time64.h>
+/*
- Define the number of values we will dealing with to the monitor the
- interruptions.
- */
+#define STATS_MAX_VALUE 8
+struct wakeup {
- struct stats *stats;
- ktime_t timestamp;
- int predictable;
- int good;
- int bad;
+};
+/*
- Per cpu and irq statistics. Each cpu receives interrupts and those
- ones can be distributed following an irq chip specific
- algorithm. Random irq distribution is the worst case to predict
- interruption behavior but usually that does not happen or could be
- fixed from userspace by setting the irq affinity.
- */
+static DEFINE_PER_CPU(struct wakeup, *wakesup[NR_IRQS]);
+/**
- sched_idle_irq - irq timestamp callback
- @irq: the irq number
- @timestamp: when the interrupt occured
- @dev_id: device id for shared interrupt (not yet used)
- @data: private data given when registering this callback
- Interrupt callback called when an interrupt happens. This function
- is critical as it is called under an interrupt section: minimum
- operations as possible are done here.
- */
+static void sched_idle_irq(unsigned int irq, ktime_t timestamp,
void *dev_id, void *data)
+{
- s32 diff;
- u32 stddev, mean;
- unsigned int cpu = raw_smp_processor_id();
- struct wakeup *w = per_cpu(wakesup[irq], cpu);
- if (!w)
return;
- /*
* It is the first time the interrupt occurs, we can't do any stats,
* just store the timestamp and exit.
*/
- if (!ktime_to_us(w->timestamp)) {
w->timestamp = timestamp;
return;
- }
- /*
* Microsec resolution is enough for our purpose.
*/
- diff = ktime_us_delta(timestamp, w->timestamp);
- w->timestamp = timestamp;
- /*
* If ~one second elapsed since the last interrupt, we can
* just drop the statistics and set the wake up source as non
* predictable. We have to wait for the next interval to
* continue filling the stats.
*/
- if (diff > (1 << 20)) {
stats_reset(w->stats);
w->predictable = 0;
return;
- }
- mean = stats_mean(w->stats);
- stddev = stats_stddev(w->stats);
After my review of your stat lib code, you must realize that you've inserted the two most expensive operations of that lib right there with those two calls. And if you fix those flaws that I highlighted earlier, it'll become even more expensive.
Yet, I provided you with equivalent code already that was highly optimized to be acceptable in interrupt context. Quoting the commit message that came with my patch
Because the stats are gathered in interrupt context, the core computation is as light as possible, turning into 3 subs, 3 adds and 6 mults where 4 of those mults involve a small compile-time constant.
Here with your version you literally increased the cost for the same computations by an order of magnitude if not more... especially on ARM32. I don't think this is ever going to be accepted upstream as it is. Given tglx's propension for flaming people submitting suboptimal code, I'd refrain from posting this if I were you.
- stats_add(w->stats, diff);
- /*
* We check the value is between the mean - stddev and mean + stddev
* Until the value is in this interval, we are in a repeating pattern and
* the wakeup is considered predictable.
*/
- if (diff > (mean - stddev) && diff < (mean + stddev)) {
w->predictable = 1;
w->good++;
- } else {
w->bad++;
- }
+}
+/*
- Callback to be called when an interrupt happens.
- */
+static struct irqtimings irq_timings = {
- .handler = sched_idle_irq,
+};
+static ktime_t next_irq_event(void) +{
- unsigned int irq, cpu = raw_smp_processor_id();
- ktime_t diff, next, min = (ktime_t){ .tv64 = KTIME_MAX };
- struct wakeup *w;
- /*
* Lookup the interrupt array for this cpu and search for the earlier
* expected interruption.
*/
for (irq = 0; irq < NR_IRQS; irq++) {
w = per_cpu(wakesup[irq], cpu);
/*
* The interrupt was not setup as a source of a wakeup
* or the wakeup source is not considered at this
* moment stable enough to do a prediction.
*/
if (!w || !w->predictable)
continue;
/*
* Let's compute the next irq: the wakeup source is
* considered predictable, we add the average interval
* time added to the latest interruption event time.
*/
next = ktime_add_us(w->timestamp, stats_mean(w->stats));
/*
* If the interrupt is supposed to happen before the
* minimum time, then it becomse the minimum itself.
*/
if (ktime_before(next, min))
min = next;
- }
- /*
* At this point, we have our prediction but the caller is
* expecting the remaining time before the next event, so
* compute the diff.
*/
- diff = ktime_sub(min, ktime_get());
- /*
* The result could be negative for different reasons:
* - the prediction is incorrect
* - the prediction was too near now and expired while we were
* in this function
*
* In both cases, we return KTIME_MAX as a failure to do a
* prediction
*/
- if (ktime_compare(diff, ktime_set(0, 0)) <= 0)
return (ktime_t){ .tv64 = KTIME_MAX };
- return diff;
+}
+/**
- sched_idle_next_wakeup - Predict the next wakeup on the current cpu
- The next event on the cpu is based on a statistic approach of the
- interrupt events and the timer deterministic value. From the timer
- or the irqs, we return the one expected to occur first.
- Returns the expected remaining idle time before being woken up by
- an interruption.
- */
+s64 sched_idle_next_wakeup(void) +{
- s64 next_timer = ktime_to_us(tick_nohz_get_sleep_length());
- s64 next_irq = ktime_to_us(next_irq_event());
- return min(next_irq, next_timer);
+}
+/**
- sched_idle - go to idle for a specified amount of time
- @duration: the idle duration time
- @latency: the latency constraint
- Returns 0 on success, < 0 otherwise.
- */
+int sched_idle(s64 duration, unsigned int latency) +{
- struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
- struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
- struct cpuidle_state_usage *su;
- struct cpuidle_state *s;
int i, ret = 0, index = -1;
rcu_idle_enter();
- /*
* No cpuidle driver is available, let's use the default arch
* idle function.
*/
if (cpuidle_not_available(drv, dev))
goto default_idle;
/*
* Find the idle state with the lowest power while satisfying
* our constraints. We will save energy if the duration of the
* idle time is bigger than the target residency which is the
* break even point. The choice will be modulated by the
* latency.
*/
for (i = 0; i < drv->state_count; i++) {
s = &drv->states[i];
su = &dev->states_usage[i];
if (s->disabled || su->disable)
continue;
if (s->target_residency > duration)
continue;
if (s->exit_latency > latency)
continue;
index = i;
}
/*
* The idle task must be scheduled, it is pointless to go to
* idle, just re-enable the interrupt and return.
*/
- if (current_clr_polling_and_test()) {
local_irq_enable();
goto out;
- }
- if (index < 0) {
/*
* No idle callbacks fulfilled the constraints, jump
* to the default function like there wasn't any
* cpuidle driver.
*/
goto default_idle;
- } else {
/*
* Enter the idle state previously returned by the
* governor decision. This function will block until
* an interrupt occurs and will take care of
* re-enabling the local interrupts
*/
return cpuidle_enter(drv, dev, index);
- }
+default_idle:
- default_idle_call();
+out:
- rcu_idle_exit();
- return ret;
+}
+/**
- sched_irq_timing_free - free_irq callback
- @irq: the irq number to stop tracking
- @dev_id: not used at the moment
- This function will remove from the wakeup source prediction table.
- */
+static void sched_irq_timing_free(unsigned int irq, void *dev_id) +{
- struct irq_desc *desc = irq_to_desc(irq);
- struct wakeup *w;
- unsigned int cpu;
- unsigned long flags;
- for_each_possible_cpu(cpu) {
w = per_cpu(wakesup[irq], cpu);
if (!w)
continue;
if (w->stats)
stats_free(w->stats);
kfree(w);
per_cpu(wakesup[irq], cpu) = NULL;
- }
- raw_spin_lock_irqsave(&desc->lock, flags);
- desc->timings = NULL;
- raw_spin_unlock_irqrestore(&desc->lock, flags);
+}
+/**
- sched_irq_timing_setup - setup_irq callback
- @irq: the interrupt numbe to be tracked
- @act: the new irq action to be set to this interrupt
- Update the irq table to be tracked in order to predict the next event.
- Returns zero on success. On error it returns -ENOMEM.
- */
+static int sched_irq_timing_setup(unsigned int irq, struct irqaction *act) +{
- struct irq_desc *desc = irq_to_desc(irq);
- struct wakeup *w;
- unsigned int cpu;
- unsigned long flags;
- int ret = -ENOMEM;
- /*
* No interrupt set for this descriptor or related to a timer.
* Timers are deterministic, so no need to try to do any
* prediction on them. No error for both cases, we are just not
* interested.
*/
- if (!act || (act->flags & __IRQF_TIMER))
return 0;
- /*
* Allocates the wakeup structure and the stats structure. As
* the interrupt can occur on any cpu, allocate the wakeup
* structure per cpu basis.
*/
- for_each_possible_cpu(cpu) {
w = kzalloc(sizeof(*w), GFP_KERNEL);
if (!w)
goto undo;
w->stats = stats_alloc(STATS_MAX_VALUE);
if (!w->stats) {
kfree(w);
goto undo;
}
per_cpu(wakesup[irq], cpu) = w;
- }
- raw_spin_lock_irqsave(&desc->lock, flags);
- desc->timings = &irq_timings;
- raw_spin_unlock_irqrestore(&desc->lock, flags);
- ret = 0;
+undo:
- /*
* Rollback all allocations on failure
*/
- if (ret)
sched_irq_timing_free(irq, act->dev_id);
- return ret;
+}
+/*
- Setup/free irq callbacks
- */
+static struct irqtimings_ops irqt_ops = {
- .setup = sched_irq_timing_setup,
- .free = sched_irq_timing_free,
+};
+/**
- sched_idle_init - setup the interrupt tracking table
- At init time, some interrupts could have been setup and in the
- system life time, some devices could be setup. In order to track
- all interrupts we are interested in, we first register a couple of
- callback to keep up-to-date the interrupt tracking table and then
- we setup the table with the interrupt which were already set up.
- */
+int __init sched_idle_init(void) +{
- struct irq_desc *desc;
- unsigned int irq;
- int ret;
- /*
* Register the setup/free irq callbacks, so new interrupt or
* freed interrupt will update their tracking.
*/
- ret = register_irq_timings(&irqt_ops);
- if (ret) {
pr_err("Failed to register timings ops\n");
return ret;
- }
- /*
* For all the irq already setup, assign the timing callback.
* All interrupts with their desc NULL will be discarded.
*/
- for_each_irq_desc(irq, desc)
sched_irq_timing_setup(irq, desc->action);
- return 0;
+}
+late_initcall(sched_idle_init);
1.9.1
On 12/10/2015 06:34 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
Many IRQs are quiet most of the time, or they tend to come in bursts of fairly equal time intervals within each burst. It is therefore possible to detect those IRQs with stable intervals and guestimate when the next IRQ event is most likely to happen.
Examples of such IRQs may include audio related IRQs where the FIFO size and/or DMA descriptor size with the sample rate create stable intervals, block devices during large data transfers, etc. Even network streaming of multimedia content creates patterns of periodic network interface IRQs in some cases.
This patch adds code to track the mean interval and variance for each IRQ over a window of time intervals between IRQ events. Those statistics can be used to assist cpuidle in selecting the most appropriate sleep state by predicting the most likely time for the next interrupt.
Because the stats are gathered in interrupt context, the core computation is as light as possible.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org
I'm rather disappointed by this patch.
So sorry for that.
Comments below.
[ ... ]
- if (diff > (1 << 20)) {
stats_reset(w->stats);
w->predictable = 0;
return;
- }
- mean = stats_mean(w->stats);
- stddev = stats_stddev(w->stats);
After my review of your stat lib code, you must realize that you've inserted the two most expensive operations of that lib right there with those two calls. And if you fix those flaws that I highlighted earlier, it'll become even more expensive.
Yet, I provided you with equivalent code already that was highly optimized to be acceptable in interrupt context. Quoting the commit message that came with my patch
Because the stats are gathered in interrupt context, the core computation is as light as possible, turning into 3 subs, 3 adds and 6 mults where 4 of those mults involve a small compile-time constant.
Here with your version you literally increased the cost for the same computations by an order of magnitude if not more... especially on ARM32. I don't think this is ever going to be accepted upstream as it is. Given tglx's propension for flaming people submitting suboptimal code, I'd refrain from posting this if I were you.
I agree.
I will change the code to have this computation done when searching the next wakeup event and probably replace the stddev by the optimization you did by comparing squared mean values with the variance instead of the mean with the square root of the variance. That will save us the int_sqrt and will reduce the number of operations in the irq handler.
Thanks for the review.
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 06:34 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
- mean = stats_mean(w->stats);
- stddev = stats_stddev(w->stats);
After my review of your stat lib code, you must realize that you've inserted the two most expensive operations of that lib right there with those two calls. And if you fix those flaws that I highlighted earlier, it'll become even more expensive.
Yet, I provided you with equivalent code already that was highly optimized to be acceptable in interrupt context. Quoting the commit message that came with my patch
Because the stats are gathered in interrupt context, the core computation is as light as possible, turning into 3 subs, 3 adds and 6 mults where 4 of those mults involve a small compile-time constant.
Here with your version you literally increased the cost for the same computations by an order of magnitude if not more... especially on ARM32. I don't think this is ever going to be accepted upstream as it is. Given tglx's propension for flaming people submitting suboptimal code, I'd refrain from posting this if I were you.
I agree.
I will change the code to have this computation done when searching the next wakeup event and probably replace the stddev by the optimization you did by comparing squared mean values with the variance instead of the mean with the square root of the variance. That will save us the int_sqrt and will reduce the number of operations in the irq handler.
Thanks for the review.
I still have to ask: why did you not use my code?
I do see that you borrowed some of its structure, but you completely ripped out all the code that made this computation extremely efficient.
Nicolas
On 12/10/2015 11:39 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 06:34 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
- mean = stats_mean(w->stats);
- stddev = stats_stddev(w->stats);
After my review of your stat lib code, you must realize that you've inserted the two most expensive operations of that lib right there with those two calls. And if you fix those flaws that I highlighted earlier, it'll become even more expensive.
Yet, I provided you with equivalent code already that was highly optimized to be acceptable in interrupt context. Quoting the commit message that came with my patch
Because the stats are gathered in interrupt context, the core computation is as light as possible, turning into 3 subs, 3 adds and 6 mults where 4 of those mults involve a small compile-time constant.
Here with your version you literally increased the cost for the same computations by an order of magnitude if not more... especially on ARM32. I don't think this is ever going to be accepted upstream as it is. Given tglx's propension for flaming people submitting suboptimal code, I'd refrain from posting this if I were you.
I agree.
I will change the code to have this computation done when searching the next wakeup event and probably replace the stddev by the optimization you did by comparing squared mean values with the variance instead of the mean with the square root of the variance. That will save us the int_sqrt and will reduce the number of operations in the irq handler.
Thanks for the review.
I still have to ask: why did you not use my code?
I do see that you borrowed some of its structure, but you completely ripped out all the code that made this computation extremely efficient.
That cames from the "librarification" of the code. I just needed to recapture the code to move it to the library.
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
On Fri, 11 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 11:39 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 06:34 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
- mean = stats_mean(w->stats);
- stddev = stats_stddev(w->stats);
After my review of your stat lib code, you must realize that you've inserted the two most expensive operations of that lib right there with those two calls. And if you fix those flaws that I highlighted earlier, it'll become even more expensive.
Yet, I provided you with equivalent code already that was highly optimized to be acceptable in interrupt context. Quoting the commit message that came with my patch
Because the stats are gathered in interrupt context, the core computation is as light as possible, turning into 3 subs, 3 adds and 6 mults where 4 of those mults involve a small compile-time constant.
Here with your version you literally increased the cost for the same computations by an order of magnitude if not more... especially on ARM32. I don't think this is ever going to be accepted upstream as it is. Given tglx's propension for flaming people submitting suboptimal code, I'd refrain from posting this if I were you.
I agree.
I will change the code to have this computation done when searching the next wakeup event and probably replace the stddev by the optimization you did by comparing squared mean values with the variance instead of the mean with the square root of the variance. That will save us the int_sqrt and will reduce the number of operations in the irq handler.
Thanks for the review.
I still have to ask: why did you not use my code?
I do see that you borrowed some of its structure, but you completely ripped out all the code that made this computation extremely efficient.
That cames from the "librarification" of the code. I just needed to recapture the code to move it to the library.
IMHO this is a bad tradeoff.
Not only this prevents all the optimizations I implemented, but this is also premature to create a lib if there is only one user.
Nicolas
On 12/11/2015 02:31 PM, Nicolas Pitre wrote:
On Fri, 11 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 11:39 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 06:34 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
- mean = stats_mean(w->stats);
- stddev = stats_stddev(w->stats);
After my review of your stat lib code, you must realize that you've inserted the two most expensive operations of that lib right there with those two calls. And if you fix those flaws that I highlighted earlier, it'll become even more expensive.
Yet, I provided you with equivalent code already that was highly optimized to be acceptable in interrupt context. Quoting the commit message that came with my patch
Because the stats are gathered in interrupt context, the core computation is as light as possible, turning into 3 subs, 3 adds and 6 mults where 4 of those mults involve a small compile-time constant.
Here with your version you literally increased the cost for the same computations by an order of magnitude if not more... especially on ARM32. I don't think this is ever going to be accepted upstream as it is. Given tglx's propension for flaming people submitting suboptimal code, I'd refrain from posting this if I were you.
I agree.
I will change the code to have this computation done when searching the next wakeup event and probably replace the stddev by the optimization you did by comparing squared mean values with the variance instead of the mean with the square root of the variance. That will save us the int_sqrt and will reduce the number of operations in the irq handler.
Thanks for the review.
I still have to ask: why did you not use my code?
I do see that you borrowed some of its structure, but you completely ripped out all the code that made this computation extremely efficient.
That cames from the "librarification" of the code. I just needed to recapture the code to move it to the library.
IMHO this is a bad tradeoff.
Not only this prevents all the optimizations I implemented, but this is also premature to create a lib if there is only one user.
Ok, I will follow your opinion and redraw the code in the same sense you did initially with the optimizations.
Thanks -- Daniel
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
From: Nicolas Pitre nicolas.pitre@linaro.org
Signed-off-by: Nicolas Pitre nico@linaro.org Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- include/trace/events/irq.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/idle-sched.c | 4 ++++ 2 files changed, 48 insertions(+)
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h index ff8f6c0..fd25da5 100644 --- a/include/trace/events/irq.h +++ b/include/trace/events/irq.h @@ -99,6 +99,50 @@ TRACE_EVENT(irq_handler_exit, __entry->irq, __entry->ret ? "handled" : "unhandled") );
+#ifdef CONFIG_IRQ_TIMINGS +/** + * irq_timings - provide updated IRQ timing statistics + * @irq: irq number + * @interval: time interval since last irq + * @stddev: time interval stddev + * @mean: mean interval + * @good: current count of predictable irqs + * @bad: current count of unpredictable irqs + * + */ +TRACE_EVENT(irq_timings, + + TP_PROTO(int irq, int cpu, s32 interval, s32 mean, + u32 stddev, u32 good, u32 bad), + + TP_ARGS(irq, cpu, interval, mean, stddev, good, bad), + + TP_STRUCT__entry( + __field( unsigned int, irq ) + __field( unsigned int, cpu ) + __field( s32, interval ) + __field( s32, mean ) + __field( u32, stddev ) + __field( u32, good ) + __field( u32, bad ) + ), + + TP_fast_assign( + __entry->irq = irq; + __entry->cpu = cpu; + __entry->interval = interval; + __entry->mean = mean; + __entry->stddev = stddev; + __entry->good = good; + __entry->bad = bad; + ), + + TP_printk("irq=%u/cpu=%u intv=%d mean=%d stddev=%u (%u vs %u)", + __entry->irq, __entry->cpu, __entry->interval, __entry->mean, + __entry->stddev, __entry->good, __entry->bad) +); +#endif + DECLARE_EVENT_CLASS(softirq,
TP_PROTO(unsigned int vec_nr), diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c index bfbd07f..f38ba67 100644 --- a/kernel/sched/idle-sched.c +++ b/kernel/sched/idle-sched.c @@ -12,6 +12,8 @@ #include <linux/tick.h> #include <linux/time64.h>
+#include <trace/events/irq.h> + /* * Define the number of values we will dealing with to the monitor the * interruptions. @@ -101,6 +103,8 @@ static void sched_idle_irq(unsigned int irq, ktime_t timestamp, } else { w->bad++; } + + trace_irq_timings(irq, cpu, diff, mean, stddev, w->good, w->bad); }
/* -- 1.9.1
This patch plugs the sched-idle code into the idle. It is experimental code and consequently it disables the suspend to idle code as this one is not included in the sched-idle.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- kernel/sched/idle-sched.c | 30 ++++++++++++++++++++++++++++++ kernel/sched/idle.c | 11 +++++++++-- kernel/sched/sched.h | 20 ++++++++++++++++++++ 3 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c index f38ba67..97dc8a6 100644 --- a/kernel/sched/idle-sched.c +++ b/kernel/sched/idle-sched.c @@ -37,6 +37,36 @@ struct wakeup { */ static DEFINE_PER_CPU(struct wakeup, *wakesup[NR_IRQS]);
+/* + * Variable flag to switch from cpuidle to idle-sched + */ +static int __read_mostly __sched_idle_enabled = 0; + +/* + * Enable the sched_idle subsystem + */ +void sched_idle_enable(void) +{ + __sched_idle_enabled = 1; +} + +/* + * Disable the sched_idle subsystem + */ +void sched_idle_disable(void) +{ + __sched_idle_enabled = 0; +} + +/* + * Accessor flag to check if this idle subsystem should be used + * instead of the old cpuidle framework. + */ +int sched_idle_enabled(void) +{ + return __sched_idle_enabled; +} + /** * sched_idle_irq - irq timestamp callback * diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 4a2ef5a..c3f9938 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -247,8 +247,15 @@ static void cpu_idle_loop(void) */ if (cpu_idle_force_poll || tick_check_broadcast_expired()) cpu_idle_poll(); - else - cpuidle_idle_call(); + else { + if (sched_idle_enabled()) { + int latency = pm_qos_request(PM_QOS_CPU_DMA_LATENCY); + s64 duration = sched_idle_next_wakeup(); + sched_idle(duration, latency); + } else { + cpuidle_idle_call(); + } + }
arch_cpu_idle_exit(); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6d2a119..4f80048 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -35,6 +35,26 @@ extern void update_cpu_load_active(struct rq *this_rq); static inline void update_cpu_load_active(struct rq *this_rq) { } #endif
+#ifdef CONFIG_CPU_IDLE_GOV_SCHED +extern int sched_idle(s64 duration, unsigned int latency); +extern s64 sched_idle_next_wakeup(void); +extern int sched_idle_enabled(void); +#else +static inline int sched_idle(s64, unsigned int) +{ + return 0; +} + +static inline s64 sched_idle_next_wakeup(void) +{ +} + +static inline int sched_idle_enabled(void) +{ + return 0; +} +#endif + /* * Helpers for converting nanosecond timing to jiffy resolution */ -- 1.9.1
Despite the sysfs presence with some statistics around the idle state usage, some information is missing to tune and check the cpuidle behavior.
Instead of polluting the sysfs directory with new debug information, let's create a debugfs file to be extended to show some interesting information about the cpuidle predictions.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- drivers/cpuidle/Kconfig | 7 ++ drivers/cpuidle/Makefile | 2 + drivers/cpuidle/cpuidle.c | 6 ++ drivers/cpuidle/debugfs.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++ drivers/cpuidle/debugfs.h | 19 +++++ include/linux/cpuidle.h | 15 ++++ kernel/sched/idle-sched.c | 8 +- 7 files changed, 246 insertions(+), 1 deletion(-) create mode 100644 drivers/cpuidle/debugfs.c create mode 100644 drivers/cpuidle/debugfs.h
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig index dd17215..8b0f6c5 100644 --- a/drivers/cpuidle/Kconfig +++ b/drivers/cpuidle/Kconfig @@ -30,6 +30,13 @@ config CPU_IDLE_GOV_SCHED select IRQ_TIMINGS default y
+config CPU_IDLE_DEBUG + bool "Debug information about cpuidle" + select DEBUG_FS + help + Enables the debug filesystem showing information about the governor + predictions. + config DT_IDLE_STATES bool
diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile index 3ba81b1..cc2f750 100644 --- a/drivers/cpuidle/Makefile +++ b/drivers/cpuidle/Makefile @@ -6,6 +6,8 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
+obj-$(CONFIG_CPU_IDLE_DEBUG) += debugfs.o + ################################################################################## # ARM SoC drivers obj-$(CONFIG_ARM_MVEBU_V7_CPUIDLE) += cpuidle-mvebu-v7.o diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 17a6dc0..309aa0f 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -23,6 +23,7 @@ #include <linux/tick.h> #include <trace/events/power.h>
+#include "debugfs.h" #include "cpuidle.h"
DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices); @@ -230,6 +231,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, */ dev->states_usage[entered_state].time += dev->last_residency; dev->states_usage[entered_state].usage++; + cpuidle_debugfs_prediction(drv, dev, diff, entered_state); } else { dev->last_residency = 0; } @@ -646,6 +648,10 @@ static int __init cpuidle_init(void) { int ret;
+ ret = cpuidle_debugfs_init(); + if (ret) + return ret; + if (cpuidle_disabled()) return -ENODEV;
diff --git a/drivers/cpuidle/debugfs.c b/drivers/cpuidle/debugfs.c new file mode 100644 index 0000000..c5aab98 --- /dev/null +++ b/drivers/cpuidle/debugfs.c @@ -0,0 +1,190 @@ +/* + * Debugfs support for cpuidle governors + * + * Copyright (C) 2015 Linaro : daniel.lezcano@linaro.org + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include <linux/cpu.h> +#include <linux/cpuidle.h> +#include <linux/debugfs.h> +#include <linux/atomic.h> + +#include "cpuidle.h" + +static atomic_t correct; +static atomic_t failed; +static atomic_t early; +static atomic_t late; +static atomic_t total; + +static DEFINE_PER_CPU(atomic_t, _correct); +static DEFINE_PER_CPU(atomic_t, _failed); +static DEFINE_PER_CPU(atomic_t, _early); +static DEFINE_PER_CPU(atomic_t, _late); +static DEFINE_PER_CPU(atomic_t, _total); + +void cpuidle_debugfs_reset(void) +{ + int cpu; + + atomic_set(&correct, 0); + atomic_set(&failed, 0); + atomic_set(&early, 0); + atomic_set(&late, 0); + atomic_set(&total, 0); + + for_each_possible_cpu(cpu) { + atomic_set(&per_cpu(_correct, cpu), 0); + atomic_set(&per_cpu(_failed, cpu), 0); + atomic_set(&per_cpu(_early, cpu), 0); + atomic_set(&per_cpu(_late, cpu), 0); + atomic_set(&per_cpu(_total, cpu), 0); + } +} + +void cpuidle_debugfs_correct_inc(int cpu) +{ + atomic_inc(&per_cpu(_total, cpu)); + atomic_inc(&per_cpu(_correct, cpu)); + atomic_inc(&correct); + atomic_inc(&total); +} + +void cpuidle_debugfs_early_inc(int cpu) +{ + atomic_inc(&per_cpu(_total, cpu)); + atomic_inc(&per_cpu(_early, cpu)); + atomic_inc(&per_cpu(_failed, cpu)); + atomic_inc(&early); + atomic_inc(&failed); + atomic_inc(&total); +} + +void cpuidle_debugfs_late_inc(int cpu) +{ + atomic_inc(&per_cpu(_total, cpu)); + atomic_inc(&per_cpu(_late, cpu)); + atomic_inc(&per_cpu(_failed, cpu)); + atomic_inc(&late); + atomic_inc(&failed); + atomic_inc(&total); +} + +ssize_t cpuidle_debugfs_write_reset(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + ssize_t size; + bool reset = false; + + file->private_data = &reset; + + size = debugfs_write_file_bool(file, user_buf, count, ppos); + if (size > 0 && reset) + cpuidle_debugfs_reset(); + + return size; +} + +void cpuidle_debugfs_prediction(struct cpuidle_driver *drv, + struct cpuidle_device *dev, + s64 duration, int state) +{ + if (drv->states[state].target_residency <= duration) { + /* It is not the last state, check against the next one */ + if (state < drv->state_count - 1) { + /* The prediction was shorter than expected */ + if (drv->states[state + 1].target_residency <= duration) { + cpuidle_debugfs_early_inc(dev->cpu); + } else { + cpuidle_debugfs_correct_inc(dev->cpu); + } + } else { + cpuidle_debugfs_correct_inc(dev->cpu); + } + } else { + /* The prediction was longer than expected */ + cpuidle_debugfs_late_inc(dev->cpu); + } +} + +static const struct file_operations reset_fops = { + .open = simple_open, + .write = cpuidle_debugfs_write_reset, + .llseek = seq_lseek, +}; + +static int cpuidle_debugfs_stats_show(struct seq_file *m, void *v) +{ + int cpu; + + seq_printf(m, "#cpu\tcorrect\tearly\tlate\tfailed\ttotal\n"); + + seq_printf(m, "all\t"); + seq_printf(m, "%d\t", atomic_read(&correct)); + seq_printf(m, "%d\t", atomic_read(&early)); + seq_printf(m, "%d\t", atomic_read(&late)); + seq_printf(m, "%d\t", atomic_read(&failed)); + seq_printf(m, "%d\n", atomic_read(&total)); + + for_each_online_cpu(cpu) { + seq_printf(m, "%d\t", cpu); + seq_printf(m, "%d\t", atomic_read(&per_cpu(_correct, cpu))); + seq_printf(m, "%d\t", atomic_read(&per_cpu(_early, cpu))); + seq_printf(m, "%d\t", atomic_read(&per_cpu(_late, cpu))); + seq_printf(m, "%d\t", atomic_read(&per_cpu(_failed, cpu))); + seq_printf(m, "%d\n", atomic_read(&per_cpu(_total, cpu))); + } + + return 0; +} + +static int cpuidle_debugfs_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, cpuidle_debugfs_stats_show, + &inode->i_private); +} + +static const struct file_operations stats_fops = { + .open = cpuidle_debugfs_stats_open, + .read = seq_read, + .llseek = seq_lseek, +}; + +int __init cpuidle_debugfs_init(void) +{ + struct dentry *top_dentry; + int ret = -ENOMEM; + + /* + * Topmost directory. The reference is kept in order to do a + * recursive remove in case of an error. + */ + top_dentry = debugfs_create_dir("cpuidle", NULL); + if (!top_dentry) + return -ENOMEM; + + /* + * Overall statistics for all cpus + */ + if (!debugfs_create_file("stats", 0400, + top_dentry, NULL, &stats_fops)) + goto out; + + /* + * Reset statistics file + */ + if (!debugfs_create_file("reset", 0200, + top_dentry, NULL, &reset_fops)) + goto out; + + ret = 0; +out: + if (ret) + debugfs_remove_recursive(top_dentry); + + return ret; +} diff --git a/drivers/cpuidle/debugfs.h b/drivers/cpuidle/debugfs.h new file mode 100644 index 0000000..20fd719 --- /dev/null +++ b/drivers/cpuidle/debugfs.h @@ -0,0 +1,19 @@ +#ifndef __CPUIDLE_DEBUGFS_H +#define __CPUIDLE_DEBUGFS_H + +#ifdef CONFIG_CPU_IDLE_DEBUG +extern int __init cpuidle_debugfs_init(void); +extern void cpuidle_debugfs_reset(void); +extern void cpuidle_debugfs_correct_inc(int cpu); +extern void cpuidle_debugfs_early_inc(int cpu); +extern void cpuidle_debugfs_late_inc(int cpu); +#else +static int inline cpuidle_debugfs_init(void) { return 0; } +static inline void cpuidle_debugfs_reset(void) { return; } +static inline void cpuidle_debugfs_correct_inc(int cpu) {} +static inline void cpuidle_debugfs_early_inc(int cpu) {} +static inline void cpuidle_debugfs_late_inc(int cpu) {} + +#endif /* CONFIG_CPU_IDLE_DEBUG */ + +#endif diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 786ad32..0884d76 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -207,6 +207,21 @@ static inline int cpuidle_enter_freeze(struct cpuidle_driver *drv, extern void sched_idle_set_state(struct cpuidle_state *idle_state); extern void default_idle_call(void);
+#ifdef CONFIG_CPU_IDLE_DEBUG +/* drivers/cpuidle/debugfs.c */ +extern void cpuidle_debugfs_prediction(struct cpuidle_driver *drv, + struct cpuidle_device *dev, + s64 diff, int state); +#else +static inline void cpuidle_debugfs_prediction(struct cpuidle_driver *, + struct cpuidle_device *, + s64 , int) +{ + ; +} +#endif + + #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a); #else diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c index 97dc8a6..d483ba3 100644 --- a/kernel/sched/idle-sched.c +++ b/kernel/sched/idle-sched.c @@ -291,7 +291,13 @@ int sched_idle(s64 duration, unsigned int latency) * an interrupt occurs and will take care of * re-enabling the local interrupts */ - return cpuidle_enter(drv, dev, index); + ktime_t before = ktime_get(), after; + ret = cpuidle_enter(drv, dev, index); + after = ktime_get(); + cpuidle_debugfs_prediction(drv, dev, + ktime_us_delta(after, before), + index); + return ret; }
default_idle: -- 1.9.1
The scheduler uses the idle timestamp stored in the struct rq to retrieve the time when the cpu went idle in order to find the idlest cpu. Unfortunately this information is wrong as it does not have the same meaning from the cpuidle point of view. The idle_stamp in the struct rq gives the information when the idle task has been scheduled while the idle task could be interrupted several times and the cpu going through an idle/wakeup multiple times.
Add the idle start time in the idle state structure.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org
Conflicts: drivers/cpuidle/cpuidle.c --- drivers/cpuidle/cpuidle.c | 10 ++++++---- include/linux/cpuidle.h | 1 + 2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 309aa0f..f2d6776 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -174,7 +174,6 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
struct cpuidle_state *target_state = &drv->states[index]; bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); - ktime_t time_start, time_end; s64 diff;
/* @@ -196,13 +195,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, sched_idle_set_state(target_state);
trace_cpu_idle_rcuidle(index, dev->cpu); - time_start = ktime_get(); + + target_state->idle_stamp = ktime_get_ns();
stop_critical_timings(); entered_state = target_state->enter(dev, drv, index); start_critical_timings();
- time_end = ktime_get(); + diff = (ktime_get_ns() - target_state->idle_stamp) >> 10; + + target_state->idle_stamp = 0; + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
/* The cpu is no longer idle or about to enter idle. */ @@ -218,7 +221,6 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, if (!cpuidle_state_is_coupled(drv, entered_state)) local_irq_enable();
- diff = ktime_to_us(ktime_sub(time_end, time_start)); if (diff > INT_MAX) diff = INT_MAX;
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 0884d76..3f8ccd5 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -44,6 +44,7 @@ struct cpuidle_state { int power_usage; /* in mW */ unsigned int target_residency; /* in US */ bool disabled; /* disabled on all CPUs */ + u64 idle_stamp;
int (*enter) (struct cpuidle_device *dev, struct cpuidle_driver *drv, -- 1.9.1
The find_idlest_cpu is assuming the rq->idle_stamp information reflects when the cpu entered the idle state. This is wrong as the cpu may exit and enter the idle state several times without the rq->idle_stamp being updated.
We have two informations here:
* rq->idle_stamp gives when the idle task has been scheduled * idle->idle_stamp gives when the cpu entered the idle state
The patch fixes that by using the latter information and fallbacks to the rq's timestamp when the idle state is not accessible
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++-------------- 1 file changed, 28 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9a5e60f..bcb51eb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4742,21 +4742,35 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) if (idle_cpu(i)) { struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); - if (idle && idle->exit_latency < min_exit_latency) { - /* - * We give priority to a CPU whose idle state - * has the smallest exit latency irrespective - * of any idle timestamp. - */ - min_exit_latency = idle->exit_latency; - latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; - } else if ((!idle || idle->exit_latency == min_exit_latency) && - rq->idle_stamp > latest_idle_timestamp) { + + if (idle) { + if (idle->exit_latency < min_exit_latency) { + /* + * We give priority to a CPU + * whose idle state has the + * smallest exit latency + * irrespective of any idle + * timestamp. + */ + min_exit_latency = idle->exit_latency; + latest_idle_timestamp = idle->idle_stamp; + shallowest_idle_cpu = i; + } else if (idle->exit_latency == min_exit_latency && + idle->idle_stamp > latest_idle_timestamp) { + /* + * If the CPU is in the same + * idle state, choose the more + * recent one as it might have + * a warmer cache + */ + latest_idle_timestamp = idle->idle_stamp; + shallowest_idle_cpu = i; + } + } else if (rq->idle_stamp > latest_idle_timestamp) { /* - * If equal or no active idle state, then - * the most recently idled CPU might have - * a warmer cache. + * If no active idle state, then the + * most recent idled CPU might have a + * warmer cache */ latest_idle_timestamp = rq->idle_stamp; shallowest_idle_cpu = i; -- 1.9.1
This change aggressively try to save energy by preventing the balance to choose an idle cpu which did not reach its target residency.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bcb51eb..d996544 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4742,8 +4742,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) if (idle_cpu(i)) { struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); + u64 now = ktime_to_us(ktime_get()); + + if (idle && ((idle->idle_stamp + now) > idle->target_residency)) {
- if (idle) { if (idle->exit_latency < min_exit_latency) { /* * We give priority to a CPU -- 1.9.1
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org --- drivers/cpuidle/debugfs.c | 54 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 48 insertions(+), 6 deletions(-)
diff --git a/drivers/cpuidle/debugfs.c b/drivers/cpuidle/debugfs.c index c5aab98..ecc552e 100644 --- a/drivers/cpuidle/debugfs.c +++ b/drivers/cpuidle/debugfs.c @@ -89,6 +89,47 @@ ssize_t cpuidle_debugfs_write_reset(struct file *file, return size; }
+static const struct file_operations reset_fops = { + .open = simple_open, + .write = cpuidle_debugfs_write_reset, + .llseek = seq_lseek, +}; + +extern void sched_idle_enable(void); +extern void sched_idle_disable(void); + +ssize_t cpuidle_debugfs_write_sched_idle(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + ssize_t size; + bool enable = false; + + file->private_data = &enable; + + size = debugfs_write_file_bool(file, user_buf, count, ppos); + if (size > 0) + enable ? sched_idle_enable() : sched_idle_disable(); + + return size; +} + +ssize_t cpuidle_debugfs_read_sched_idle(struct file *file, + char __user *user_buf, + size_t count, loff_t *ppos) +{ + bool enable; + file->private_data = &enable; + return debugfs_read_file_bool(file, user_buf, count, ppos); +} + +static const struct file_operations sched_idle_fops = { + .open = simple_open, + .write = cpuidle_debugfs_write_sched_idle, + .read = cpuidle_debugfs_read_sched_idle, + .llseek = seq_lseek, +}; + void cpuidle_debugfs_prediction(struct cpuidle_driver *drv, struct cpuidle_device *dev, s64 duration, int state) @@ -111,12 +152,6 @@ void cpuidle_debugfs_prediction(struct cpuidle_driver *drv, } }
-static const struct file_operations reset_fops = { - .open = simple_open, - .write = cpuidle_debugfs_write_reset, - .llseek = seq_lseek, -}; - static int cpuidle_debugfs_stats_show(struct seq_file *m, void *v) { int cpu; @@ -181,6 +216,13 @@ int __init cpuidle_debugfs_init(void) top_dentry, NULL, &reset_fops)) goto out;
+ /* + * Switch to sched-idle + */ + if (!debugfs_create_file("sched-idle", 0200, + top_dentry, NULL, &sched_idle_fops)) + goto out; + ret = 0; out: if (ret) -- 1.9.1
On 10-Dec 16:48, Daniel Lezcano wrote:
It is usually interesting to do some statistics on a set of data, especially when the values are coming in a stream (measured time by time for instance).
Instead of having the statistics formula inside a specific subsystem, this small library provides the basic statistics functions available for all the kernel.
The library is designed to do the minimum computation when a new value is added. Only the basic value storage, array shifting and accumulation is done.
The average, variance and standard deviation is computed when requested via the corresponding functions.
The statistic library can deal up to 65536 values in the range of -2^24 and 2^24 - 1. These are large values and does not really make sense to the kernel code to use the statistics at these limits, so it is up to developer to use wisely the library.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org
include/linux/stats.h | 29 +++++++ lib/Makefile | 3 +- lib/stats.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 266 insertions(+), 1 deletion(-) create mode 100644 include/linux/stats.h create mode 100644 lib/stats.c
diff --git a/include/linux/stats.h b/include/linux/stats.h new file mode 100644 index 0000000..664eb30 --- /dev/null +++ b/include/linux/stats.h @@ -0,0 +1,29 @@ +/*
- include/linux/stats.h
- */
+#ifndef _LINUX_STATS_H +#define _LINUX_STATS_H
+struct stats {
- s64 sum; /* sum of values */
- u64 sq_sum; /* sum of the square values */
- s32 *values; /* array of values */
- s32 min; /* minimal value of the entire series */
- s32 max; /* maximal value of the entire series */
- unsigned int n; /* current number of values */
- unsigned int w_ptr; /* current window pointer */
- unsigned short len; /* size of the value array */
+};
+extern s32 stats_max(struct stats *s); +extern s32 stats_min(struct stats *s); +extern s32 stats_mean(struct stats *s); +extern u32 stats_variance(struct stats *s); +extern u32 stats_stddev(struct stats *s); +extern void stats_add(struct stats *s, s32 val); +extern void stats_reset(struct stats *s); +extern void stats_free(struct stats *s); +extern struct stats *stats_alloc(unsigned short len); +extern unsigned short stats_n(struct stats *s);
+#endif /* _LINUX_STATS_H */ diff --git a/lib/Makefile b/lib/Makefile index 13a7c6a..18460d2 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -26,7 +26,8 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \ bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
stats.o
obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/stats.c b/lib/stats.c new file mode 100644 index 0000000..f5425d1 --- /dev/null +++ b/lib/stats.c @@ -0,0 +1,235 @@ +/*
- Implementation of basics statistics functions useful to compute on
- a stream of data.
- Copyright: (C) 2015-2016 Linaro Limited
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License version 2 as
- published by the Free Software Foundation.
- */
+#include <linux/export.h> +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/stats.h> +#include <linux/types.h>
+/**
- stats_n - number of elements present in the statistics
- @s: the statistic structure
- Returns an unsigned short representing the number of values in the
- statistics series
- */
+unsigned short stats_n(struct stats *s) +{
- return s->n;
+}
+/**
- stats_max - maximal value of the series
- @s: the statistic structure
- Returns a s32 representing the maximum value of the series
- */
+s32 stats_max(struct stats *s) +{
- return s->max;
+} +EXPORT_SYMBOL_GPL(stats_max);
+/**
- stats_min - minimal value of the series
- @s: the statistic structure
- Returns a s32 representing the minimal value of the series
- */
+s32 stats_min(struct stats *s) +{
- return s->min;
+} +EXPORT_SYMBOL_GPL(stats_min);
+/**
- stats_mean - compute the average
- @s: the statistics structure
- Returns an s32 corresponding to the mean value, or zero if there is
- no data
- */
+s32 stats_mean(struct stats *s) +{
- return s->n ? s->sum / s->n : 0;
+} +EXPORT_SYMBOL_GPL(stats_mean);
+/**
- stats_variance - compute the variance
- @s: the statistic structure
- Returns an u32 corresponding to the variance, or zero if there is no
- data
- */
+u32 stats_variance(struct stats *s) +{
- s32 mean = stats_mean(s);
- return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0;
+} +EXPORT_SYMBOL_GPL(stats_variance);
+/**
- stats_stddev - compute the standard deviation
- @s: the statistic structure
- Returns an u32 corresponding to the standard deviation, or zero if
- there is no data
- */
+u32 stats_stddev(struct stats *s) +{
- return int_sqrt(stats_variance(s));
+} +EXPORT_SYMBOL_GPL(stats_stddev);
+/**
- stats_add - add a new value in the statistic structure
- @s: the statistic structure
- @value: the new value to be added, max 2^24 - 1
- Adds the value to the array, if the array is full, then the array
- is shifted.
- */
+void stats_add(struct stats *s, s32 value) +{
- /*
* In order to prevent an overflow in the statistic code, we
* limit the value to 2^24 - 1, if it is greater we just
* ignore it with a WARN_ON_ONCE letting know to userspace we
* are dealing with out-of-range values.
*/
- if (WARN_ON_ONCE(value >= ((2<<24) - 1)))
^^^^^^^^^^^ Should this not be ((1<<24) -1)?
return;
/*
* Insert the value in the array. If the array is already
* full, shift the values and add the value at the end of the
* series, otherwise add the value directly.
*/
- if (likely(s->len == s->n)) {
s->sum -= s->values[s->w_ptr];
s->sq_sum -= s->values[s->w_ptr] * s->values[s->w_ptr];
s->values[s->w_ptr] = value;
s->w_ptr = (s->w_ptr + 1) % s->len;
- } else {
s->values[s->n] = value;
s->n++;
- }
- /*
* Keep track of the min and max.
*/
- s->min = min(s->min, value);
- s->max = max(s->max, value);
/*
* In order to reduce the overhead and to prevent value
* derivation due to the integer computation, we just sum the
* value and do the division when the average and the variance
* are requested.
*/
- s->sum += value;
- s->sq_sum += value * value;
+} +EXPORT_SYMBOL_GPL(stats_add);
+/**
- stats_reset - reset the stats
- @s: the statistic structure
- Reset the statistics and reset the values
- */
+void stats_reset(struct stats *s) +{
- s->sum = s->sq_sum = s->n = s->w_ptr = 0;
- s->max = S32_MIN;
- s->min = S32_MAX;
+} +EXPORT_SYMBOL_GPL(stats_reset);
+/**
- stats_alloc - allocate a structure for statistics
- @len: The number of items in the array which is limited by the
unsigned short type. That allows to prevent overflow in all
the statistics code.
- Allocates memory to store the different values and initialize the
- structure.
- In order to prevent an overflow in the computation, the maximum
- allowed number of values is 65536 and each value max is 2^24 - 1.
- The variance is the sum of the square value of the difference of
- each value to the average. The variance is a u64 (square values are
- always positives), so that gives a maximum of 18446744073709551615.
- We can store 65536 values, so:
- 18446744073709551615 / 65536 = 281474976710655
- ... is the square max value we can have, hence the difference to
- the mean is max sqrt(281474976710655) = 16777215 (2^24 -1)
- Even if these values are not realistics (statistic in the kernel is
- for a few hundred values, large dispersion in the integer limits is
- very very rare so the sum won't be very high, or high integers
- series means low variance), we prevent any overflow in the code and
- we are safe.
- Returns a valid pointer a struct stats, NULL if the memory
- allocation fails.
- */
+struct stats *stats_alloc(unsigned short len) +{
- struct stats *s;
- s32 *values;
- s = kzalloc(sizeof(*s), GFP_KERNEL);
- if (s) {
values = kzalloc(sizeof(*values) * len, GFP_KERNEL);
if (!values) {
kfree(s);
return NULL;
}
s->values = values;
s->len = len;
s->min = S32_MAX;
s->max = S32_MIN;
- }
- return s;
+} +EXPORT_SYMBOL_GPL(stats_alloc);
+/**
- stats_free - free the statistics structure
- @s : the statistics structure
- Frees the memory allocated by the function stats_alloc.
- */
+void stats_free(struct stats *s) +{
- kfree(s->values);
- kfree(s);
+}
+EXPORT_SYMBOL_GPL(stats_free);
1.9.1
eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
-- #include <best/regards.h>
Patrick Bellasi
On 12/10/2015 05:29 PM, Patrick Bellasi wrote:
[ ... ]
+void stats_add(struct stats *s, s32 value) +{
- /*
* In order to prevent an overflow in the statistic code, we
* limit the value to 2^24 - 1, if it is greater we just
* ignore it with a WARN_ON_ONCE letting know to userspace we
* are dealing with out-of-range values.
*/
- if (WARN_ON_ONCE(value >= ((2<<24) - 1)))
^^^^^^^^^^^
Should this not be ((1<<24) -1)?
Ah yes :)
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
It is usually interesting to do some statistics on a set of data, especially when the values are coming in a stream (measured time by time for instance).
Instead of having the statistics formula inside a specific subsystem, this small library provides the basic statistics functions available for all the kernel.
The library is designed to do the minimum computation when a new value is added. Only the basic value storage, array shifting and accumulation is done.
The average, variance and standard deviation is computed when requested via the corresponding functions.
The statistic library can deal up to 65536 values in the range of -2^24 and 2^24 - 1. These are large values and does not really make sense to the kernel code to use the statistics at these limits, so it is up to developer to use wisely the library.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org
Comments below.
include/linux/stats.h | 29 +++++++ lib/Makefile | 3 +- lib/stats.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 266 insertions(+), 1 deletion(-) create mode 100644 include/linux/stats.h create mode 100644 lib/stats.c
diff --git a/include/linux/stats.h b/include/linux/stats.h new file mode 100644 index 0000000..664eb30 --- /dev/null +++ b/include/linux/stats.h @@ -0,0 +1,29 @@ +/*
- include/linux/stats.h
- */
+#ifndef _LINUX_STATS_H +#define _LINUX_STATS_H
+struct stats {
- s64 sum; /* sum of values */
- u64 sq_sum; /* sum of the square values */
- s32 *values; /* array of values */
- s32 min; /* minimal value of the entire series */
- s32 max; /* maximal value of the entire series */
- unsigned int n; /* current number of values */
- unsigned int w_ptr; /* current window pointer */
- unsigned short len; /* size of the value array */
+};
+extern s32 stats_max(struct stats *s); +extern s32 stats_min(struct stats *s); +extern s32 stats_mean(struct stats *s); +extern u32 stats_variance(struct stats *s); +extern u32 stats_stddev(struct stats *s); +extern void stats_add(struct stats *s, s32 val); +extern void stats_reset(struct stats *s); +extern void stats_free(struct stats *s); +extern struct stats *stats_alloc(unsigned short len); +extern unsigned short stats_n(struct stats *s);
+#endif /* _LINUX_STATS_H */ diff --git a/lib/Makefile b/lib/Makefile index 13a7c6a..18460d2 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -26,7 +26,8 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \ bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
stats.o
obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/stats.c b/lib/stats.c new file mode 100644 index 0000000..f5425d1 --- /dev/null +++ b/lib/stats.c @@ -0,0 +1,235 @@ +/*
- Implementation of basics statistics functions useful to compute on
- a stream of data.
- Copyright: (C) 2015-2016 Linaro Limited
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License version 2 as
- published by the Free Software Foundation.
- */
+#include <linux/export.h> +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/stats.h> +#include <linux/types.h>
+/**
- stats_n - number of elements present in the statistics
- @s: the statistic structure
- Returns an unsigned short representing the number of values in the
- statistics series
- */
+unsigned short stats_n(struct stats *s) +{
- return s->n;
+}
+/**
- stats_max - maximal value of the series
- @s: the statistic structure
- Returns a s32 representing the maximum value of the series
- */
+s32 stats_max(struct stats *s) +{
- return s->max;
+} +EXPORT_SYMBOL_GPL(stats_max);
+/**
- stats_min - minimal value of the series
- @s: the statistic structure
- Returns a s32 representing the minimal value of the series
- */
+s32 stats_min(struct stats *s) +{
- return s->min;
+} +EXPORT_SYMBOL_GPL(stats_min);
+/**
- stats_mean - compute the average
- @s: the statistics structure
- Returns an s32 corresponding to the mean value, or zero if there is
- no data
- */
+s32 stats_mean(struct stats *s) +{
- return s->n ? s->sum / s->n : 0;
+} +EXPORT_SYMBOL_GPL(stats_mean);
Those are all very simple functions. Why didn't you make them static inline in stat.h instead? The overhead for calling them out of line alone is likely to be more cost ly than the actual result of the function and produce bigger code.
+/**
- stats_variance - compute the variance
- @s: the statistic structure
- Returns an u32 corresponding to the variance, or zero if there is no
- data
- */
+u32 stats_variance(struct stats *s) +{
- s32 mean = stats_mean(s);
- return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0;
Here sq_sum is a u64 value. On 32-bit architectures you are not allowed to do a straight division with 64-bit dividends in the kernel. You have to use do_div(). The do_div() is there to make the division much less costly than the plain division where gcc promotes both the dividend and the divisor to 64 bits. Still, do_div() is still a very expensive operation when the divisor isn't constant.
Furthermore, given you allow for values with a 24-bit magnitude, it is possible for the mean to have that magnitude. With mean being a s32 variable, any mean value with a magnitude greater than 15 bits (or actually 46340 to be precise) will overflow the product above.
+} +EXPORT_SYMBOL_GPL(stats_variance);
+/**
- stats_stddev - compute the standard deviation
- @s: the statistic structure
- Returns an u32 corresponding to the standard deviation, or zero if
- there is no data
- */
+u32 stats_stddev(struct stats *s) +{
- return int_sqrt(stats_variance(s));
+} +EXPORT_SYMBOL_GPL(stats_stddev);
You realize that int_sqrt() is also somewhat expensive.
+/**
- stats_add - add a new value in the statistic structure
- @s: the statistic structure
- @value: the new value to be added, max 2^24 - 1
- Adds the value to the array, if the array is full, then the array
- is shifted.
- */
+void stats_add(struct stats *s, s32 value) +{
- /*
* In order to prevent an overflow in the statistic code, we
* limit the value to 2^24 - 1, if it is greater we just
* ignore it with a WARN_ON_ONCE letting know to userspace we
* are dealing with out-of-range values.
*/
- if (WARN_ON_ONCE(value >= ((2<<24) - 1)))
return;
You don't trap large negative values?
/*
* Insert the value in the array. If the array is already
* full, shift the values and add the value at the end of the
* series, otherwise add the value directly.
*/
- if (likely(s->len == s->n)) {
s->sum -= s->values[s->w_ptr];
s->sq_sum -= s->values[s->w_ptr] * s->values[s->w_ptr];
s->values[s->w_ptr] = value;
s->w_ptr = (s->w_ptr + 1) % s->len;
- } else {
s->values[s->n] = value;
s->n++;
- }
- /*
* Keep track of the min and max.
*/
- s->min = min(s->min, value);
- s->max = max(s->max, value);
/*
* In order to reduce the overhead and to prevent value
* derivation due to the integer computation, we just sum the
* value and do the division when the average and the variance
* are requested.
*/
- s->sum += value;
- s->sq_sum += value * value;
Same issue as in stats_variance(): given value is a s32 variable, any value greater than 46340 will overflow the value * value product. For example, if value = 46341, you'll actually add -2147479015 (yes, a large negative value) to s->sq_sum -- probably not what you intended. And because s->sq_sum is unsigned, then that negative value will then be interpreted as a very large positive value later on.
+} +EXPORT_SYMBOL_GPL(stats_add);
+/**
- stats_reset - reset the stats
- @s: the statistic structure
- Reset the statistics and reset the values
- */
+void stats_reset(struct stats *s) +{
- s->sum = s->sq_sum = s->n = s->w_ptr = 0;
- s->max = S32_MIN;
- s->min = S32_MAX;
+} +EXPORT_SYMBOL_GPL(stats_reset);
+/**
- stats_alloc - allocate a structure for statistics
- @len: The number of items in the array which is limited by the
unsigned short type. That allows to prevent overflow in all
the statistics code.
- Allocates memory to store the different values and initialize the
- structure.
- In order to prevent an overflow in the computation, the maximum
- allowed number of values is 65536 and each value max is 2^24 - 1.
- The variance is the sum of the square value of the difference of
- each value to the average. The variance is a u64 (square values are
- always positives), so that gives a maximum of 18446744073709551615.
- We can store 65536 values, so:
- 18446744073709551615 / 65536 = 281474976710655
- ... is the square max value we can have, hence the difference to
- the mean is max sqrt(281474976710655) = 16777215 (2^24 -1)
- Even if these values are not realistics (statistic in the kernel is
- for a few hundred values, large dispersion in the integer limits is
- very very rare so the sum won't be very high, or high integers
- series means low variance), we prevent any overflow in the code and
- we are safe.
- Returns a valid pointer a struct stats, NULL if the memory
- allocation fails.
- */
+struct stats *stats_alloc(unsigned short len) +{
- struct stats *s;
- s32 *values;
- s = kzalloc(sizeof(*s), GFP_KERNEL);
- if (s) {
values = kzalloc(sizeof(*values) * len, GFP_KERNEL);
if (!values) {
kfree(s);
return NULL;
}
s->values = values;
s->len = len;
s->min = S32_MAX;
s->max = S32_MIN;
- }
- return s;
+} +EXPORT_SYMBOL_GPL(stats_alloc);
+/**
- stats_free - free the statistics structure
- @s : the statistics structure
- Frees the memory allocated by the function stats_alloc.
- */
+void stats_free(struct stats *s) +{
- kfree(s->values);
- kfree(s);
+}
+EXPORT_SYMBOL_GPL(stats_free);
1.9.1
On 12/10/2015 06:16 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
It is usually interesting to do some statistics on a set of data, especially when the values are coming in a stream (measured time by time for instance).
Instead of having the statistics formula inside a specific subsystem, this small library provides the basic statistics functions available for all the kernel.
The library is designed to do the minimum computation when a new value is added. Only the basic value storage, array shifting and accumulation is done.
The average, variance and standard deviation is computed when requested via the corresponding functions.
The statistic library can deal up to 65536 values in the range of -2^24 and 2^24 - 1. These are large values and does not really make sense to the kernel code to use the statistics at these limits, so it is up to developer to use wisely the library.
Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org
Comments below.
include/linux/stats.h | 29 +++++++ lib/Makefile | 3 +- lib/stats.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 266 insertions(+), 1 deletion(-) create mode 100644 include/linux/stats.h create mode 100644 lib/stats.c
diff --git a/include/linux/stats.h b/include/linux/stats.h new file mode 100644 index 0000000..664eb30 --- /dev/null +++ b/include/linux/stats.h @@ -0,0 +1,29 @@ +/*
- include/linux/stats.h
- */
+#ifndef _LINUX_STATS_H +#define _LINUX_STATS_H
+struct stats {
- s64 sum; /* sum of values */
- u64 sq_sum; /* sum of the square values */
- s32 *values; /* array of values */
- s32 min; /* minimal value of the entire series */
- s32 max; /* maximal value of the entire series */
- unsigned int n; /* current number of values */
- unsigned int w_ptr; /* current window pointer */
- unsigned short len; /* size of the value array */
+};
+extern s32 stats_max(struct stats *s); +extern s32 stats_min(struct stats *s); +extern s32 stats_mean(struct stats *s); +extern u32 stats_variance(struct stats *s); +extern u32 stats_stddev(struct stats *s); +extern void stats_add(struct stats *s, s32 val); +extern void stats_reset(struct stats *s); +extern void stats_free(struct stats *s); +extern struct stats *stats_alloc(unsigned short len); +extern unsigned short stats_n(struct stats *s);
+#endif /* _LINUX_STATS_H */ diff --git a/lib/Makefile b/lib/Makefile index 13a7c6a..18460d2 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -26,7 +26,8 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \ bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.ostats.o
diff --git a/lib/stats.c b/lib/stats.c new file mode 100644 index 0000000..f5425d1 --- /dev/null +++ b/lib/stats.c @@ -0,0 +1,235 @@ +/*
- Implementation of basics statistics functions useful to compute on
- a stream of data.
- Copyright: (C) 2015-2016 Linaro Limited
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License version 2 as
- published by the Free Software Foundation.
- */
+#include <linux/export.h> +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/stats.h> +#include <linux/types.h>
+/**
- stats_n - number of elements present in the statistics
- @s: the statistic structure
- Returns an unsigned short representing the number of values in the
- statistics series
- */
+unsigned short stats_n(struct stats *s) +{
- return s->n;
+}
+/**
- stats_max - maximal value of the series
- @s: the statistic structure
- Returns a s32 representing the maximum value of the series
- */
+s32 stats_max(struct stats *s) +{
- return s->max;
+} +EXPORT_SYMBOL_GPL(stats_max);
+/**
- stats_min - minimal value of the series
- @s: the statistic structure
- Returns a s32 representing the minimal value of the series
- */
+s32 stats_min(struct stats *s) +{
- return s->min;
+} +EXPORT_SYMBOL_GPL(stats_min);
+/**
- stats_mean - compute the average
- @s: the statistics structure
- Returns an s32 corresponding to the mean value, or zero if there is
- no data
- */
+s32 stats_mean(struct stats *s) +{
- return s->n ? s->sum / s->n : 0;
+} +EXPORT_SYMBOL_GPL(stats_mean);
Those are all very simple functions. Why didn't you make them static inline in stat.h instead? The overhead for calling them out of line alone is likely to be more cost ly than the actual result of the function and produce bigger code.
Ok, I changed them to static inline.
+/**
- stats_variance - compute the variance
- @s: the statistic structure
- Returns an u32 corresponding to the variance, or zero if there is no
- data
- */
+u32 stats_variance(struct stats *s) +{
- s32 mean = stats_mean(s);
- return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0;
Here sq_sum is a u64 value. On 32-bit architectures you are not allowed to do a straight division with 64-bit dividends in the kernel. You have to use do_div(). The do_div() is there to make the division much less costly than the plain division where gcc promotes both the dividend and the divisor to 64 bits. Still, do_div() is still a very expensive operation when the divisor isn't constant.
Ah, ok. Will fix this.
Furthermore, given you allow for values with a 24-bit magnitude, it is possible for the mean to have that magnitude. With mean being a s32 variable, any mean value with a magnitude greater than 15 bits (or actually 46340 to be precise) will overflow the product above.
Thanks for spotting this.
I think the return type is wrong, it should be an u64.
But then how behaves an u64 with int_sqrt which is based on BITS_PER_LONG ?
+} +EXPORT_SYMBOL_GPL(stats_variance);
+/**
- stats_stddev - compute the standard deviation
- @s: the statistic structure
- Returns an u32 corresponding to the standard deviation, or zero if
- there is no data
- */
+u32 stats_stddev(struct stats *s) +{
- return int_sqrt(stats_variance(s));
+} +EXPORT_SYMBOL_GPL(stats_stddev);
You realize that int_sqrt() is also somewhat expensive.
Yes, this is why the stats library should be used wisely.
+/**
- stats_add - add a new value in the statistic structure
- @s: the statistic structure
- @value: the new value to be added, max 2^24 - 1
- Adds the value to the array, if the array is full, then the array
- is shifted.
- */
+void stats_add(struct stats *s, s32 value) +{
- /*
* In order to prevent an overflow in the statistic code, we
* limit the value to 2^24 - 1, if it is greater we just
* ignore it with a WARN_ON_ONCE letting know to userspace we
* are dealing with out-of-range values.
*/
- if (WARN_ON_ONCE(value >= ((2<<24) - 1)))
return;
You don't trap large negative values?
Right, fixed now. Thanks !
/*
* Insert the value in the array. If the array is already
* full, shift the values and add the value at the end of the
* series, otherwise add the value directly.
*/
- if (likely(s->len == s->n)) {
s->sum -= s->values[s->w_ptr];
s->sq_sum -= s->values[s->w_ptr] * s->values[s->w_ptr];
s->values[s->w_ptr] = value;
s->w_ptr = (s->w_ptr + 1) % s->len;
- } else {
s->values[s->n] = value;
s->n++;
- }
- /*
* Keep track of the min and max.
*/
- s->min = min(s->min, value);
- s->max = max(s->max, value);
/*
* In order to reduce the overhead and to prevent value
* derivation due to the integer computation, we just sum the
* value and do the division when the average and the variance
* are requested.
*/
- s->sum += value;
- s->sq_sum += value * value;
Same issue as in stats_variance(): given value is a s32 variable, any value greater than 46340 will overflow the value * value product. For example, if value = 46341, you'll actually add -2147479015 (yes, a large negative value) to s->sq_sum -- probably not what you intended. And because s->sq_sum is unsigned, then that negative value will then be interpreted as a very large positive value later on.
Ok, I trust what you say but 46340 is a small value and I did not see this overflow occurring at runtime in the traces. May be I missed it.
What will you suggest ?
Shall we use an temporary u64 variable to store the result of the product ?
Or change the mean type from s32 to s64 ?
+} +EXPORT_SYMBOL_GPL(stats_add);
+/**
- stats_reset - reset the stats
- @s: the statistic structure
- Reset the statistics and reset the values
- */
+void stats_reset(struct stats *s) +{
- s->sum = s->sq_sum = s->n = s->w_ptr = 0;
- s->max = S32_MIN;
- s->min = S32_MAX;
+} +EXPORT_SYMBOL_GPL(stats_reset);
+/**
- stats_alloc - allocate a structure for statistics
- @len: The number of items in the array which is limited by the
unsigned short type. That allows to prevent overflow in all
the statistics code.
- Allocates memory to store the different values and initialize the
- structure.
- In order to prevent an overflow in the computation, the maximum
- allowed number of values is 65536 and each value max is 2^24 - 1.
- The variance is the sum of the square value of the difference of
- each value to the average. The variance is a u64 (square values are
- always positives), so that gives a maximum of 18446744073709551615.
- We can store 65536 values, so:
- 18446744073709551615 / 65536 = 281474976710655
- ... is the square max value we can have, hence the difference to
- the mean is max sqrt(281474976710655) = 16777215 (2^24 -1)
- Even if these values are not realistics (statistic in the kernel is
- for a few hundred values, large dispersion in the integer limits is
- very very rare so the sum won't be very high, or high integers
- series means low variance), we prevent any overflow in the code and
- we are safe.
- Returns a valid pointer a struct stats, NULL if the memory
- allocation fails.
- */
+struct stats *stats_alloc(unsigned short len) +{
- struct stats *s;
- s32 *values;
- s = kzalloc(sizeof(*s), GFP_KERNEL);
- if (s) {
values = kzalloc(sizeof(*values) * len, GFP_KERNEL);
if (!values) {
kfree(s);
return NULL;
}
s->values = values;
s->len = len;
s->min = S32_MAX;
s->max = S32_MIN;
- }
- return s;
+} +EXPORT_SYMBOL_GPL(stats_alloc);
+/**
- stats_free - free the statistics structure
- @s : the statistics structure
- Frees the memory allocated by the function stats_alloc.
- */
+void stats_free(struct stats *s) +{
- kfree(s->values);
- kfree(s);
+}
+EXPORT_SYMBOL_GPL(stats_free);
1.9.1
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 06:16 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
+u32 stats_variance(struct stats *s) +{
- s32 mean = stats_mean(s);
- return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0;
Here sq_sum is a u64 value. On 32-bit architectures you are not allowed to do a straight division with 64-bit dividends in the kernel. You have to use do_div(). The do_div() is there to make the division much less costly than the plain division where gcc promotes both the dividend and the divisor to 64 bits. Still, do_div() is still a very expensive operation when the divisor isn't constant.
Ah, ok. Will fix this.
Note that this is the reason why in my version of the IRQ stat code, I avoided all non constant divisions by factoring out the n term. Here you have:
sq_sum / n - mean * mean
where mean is obtained by:
sum / n
So by calling stats_variance() you pay the cost of two expensive divisions. If you call stats_mean() separately from higher level code (as you do in patch #3) then you add another division to the mix.
One reason I didn't use a library abstraction in my own implementation is because I wanted to have n always be a constant for such divisions (plain or do_div() style) so that gcc could optimize them with reciprocal multiplications at compile time. And some other divisions were factored out and postponed to the end, etc.
Furthermore, given you allow for values with a 24-bit magnitude, it is possible for the mean to have that magnitude. With mean being a s32 variable, any mean value with a magnitude greater than 15 bits (or actually 46340 to be precise) will overflow the product above.
Thanks for spotting this.
I think the return type is wrong, it should be an u64.
Right.
But then how behaves an u64 with int_sqrt which is based on BITS_PER_LONG ?
The prototype has an unsigned long for argument. Therefore, on a 32-bit architecture, your u64 value will be truncated.
+} +EXPORT_SYMBOL_GPL(stats_variance);
+/**
- stats_stddev - compute the standard deviation
- @s: the statistic structure
- Returns an u32 corresponding to the standard deviation, or zero if
- there is no data
- */
+u32 stats_stddev(struct stats *s) +{
- return int_sqrt(stats_variance(s));
+} +EXPORT_SYMBOL_GPL(stats_stddev);
You realize that int_sqrt() is also somewhat expensive.
Yes, this is why the stats library should be used wisely.
+/**
- stats_add - add a new value in the statistic structure
- @s: the statistic structure
- @value: the new value to be added, max 2^24 - 1
- Adds the value to the array, if the array is full, then the array
- is shifted.
- */
+void stats_add(struct stats *s, s32 value) +{
- /*
* In order to prevent an overflow in the statistic code, we
* limit the value to 2^24 - 1, if it is greater we just
* ignore it with a WARN_ON_ONCE letting know to userspace we
* are dealing with out-of-range values.
*/
- if (WARN_ON_ONCE(value >= ((2<<24) - 1)))
return;
You don't trap large negative values?
Right, fixed now. Thanks !
/*
* Insert the value in the array. If the array is already
* full, shift the values and add the value at the end of the
* series, otherwise add the value directly.
*/
- if (likely(s->len == s->n)) {
s->sum -= s->values[s->w_ptr];
s->sq_sum -= s->values[s->w_ptr] * s->values[s->w_ptr];
s->values[s->w_ptr] = value;
s->w_ptr = (s->w_ptr + 1) % s->len;
- } else {
s->values[s->n] = value;
s->n++;
- }
- /*
* Keep track of the min and max.
*/
- s->min = min(s->min, value);
- s->max = max(s->max, value);
/*
* In order to reduce the overhead and to prevent value
* derivation due to the integer computation, we just sum the
* value and do the division when the average and the variance
* are requested.
*/
- s->sum += value;
- s->sq_sum += value * value;
Same issue as in stats_variance(): given value is a s32 variable, any value greater than 46340 will overflow the value * value product. For example, if value = 46341, you'll actually add -2147479015 (yes, a large negative value) to s->sq_sum -- probably not what you intended. And because s->sq_sum is unsigned, then that negative value will then be interpreted as a very large positive value later on.
Ok, I trust what you say but 46340 is a small value and I did not see this overflow occurring at runtime in the traces. May be I missed it.
What will you suggest ?
s->sq_sum += (s64)value * value;
This way both terms will be promoted to s64. On ARM, gcc is smart enough to use the smlal instruction that performs it all, including the addition.
Nicolas
On 12/10/2015 11:31 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 06:16 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
+u32 stats_variance(struct stats *s) +{
- s32 mean = stats_mean(s);
- return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0;
Here sq_sum is a u64 value. On 32-bit architectures you are not allowed to do a straight division with 64-bit dividends in the kernel. You have to use do_div(). The do_div() is there to make the division much less costly than the plain division where gcc promotes both the dividend and the divisor to 64 bits. Still, do_div() is still a very expensive operation when the divisor isn't constant.
Ah, ok. Will fix this.
Note that this is the reason why in my version of the IRQ stat code, I avoided all non constant divisions by factoring out the n term. Here you have:
sq_sum / n - mean * mean
where mean is obtained by:
sum / n
So by calling stats_variance() you pay the cost of two expensive divisions. If you call stats_mean() separately from higher level code (as you do in patch #3) then you add another division to the mix.
One reason I didn't use a library abstraction in my own implementation is because I wanted to have n always be a constant for such divisions (plain or do_div() style) so that gcc could optimize them with reciprocal multiplications at compile time. And some other divisions were factored out and postponed to the end, etc.
Ok, I understand the optimization. It is a fair argument.
Furthermore, given you allow for values with a 24-bit magnitude, it is possible for the mean to have that magnitude. With mean being a s32 variable, any mean value with a magnitude greater than 15 bits (or actually 46340 to be precise) will overflow the product above.
Thanks for spotting this.
I think the return type is wrong, it should be an u64.
Right.
But then how behaves an u64 with int_sqrt which is based on BITS_PER_LONG ?
The prototype has an unsigned long for argument. Therefore, on a 32-bit architecture, your u64 value will be truncated.
Mmh, ok. It would be simple then to remove the stddev function and let the caller to deal with the variance.
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
On 12/10/2015 11:31 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 06:16 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
+u32 stats_variance(struct stats *s) +{
- s32 mean = stats_mean(s);
- return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0;
Here sq_sum is a u64 value. On 32-bit architectures you are not allowed to do a straight division with 64-bit dividends in the kernel. You have to use do_div(). The do_div() is there to make the division much less costly than the plain division where gcc promotes both the dividend and the divisor to 64 bits. Still, do_div() is still a very expensive operation when the divisor isn't constant.
Ah, ok. Will fix this.
Note that this is the reason why in my version of the IRQ stat code, I avoided all non constant divisions by factoring out the n term. Here you have:
sq_sum / n - mean * mean
where mean is obtained by:
sum / n
So by calling stats_variance() you pay the cost of two expensive divisions. If you call stats_mean() separately from higher level code (as you do in patch #3) then you add another division to the mix.
One reason I didn't use a library abstraction in my own implementation is because I wanted to have n always be a constant for such divisions (plain or do_div() style) so that gcc could optimize them with reciprocal multiplications at compile time. And some other divisions were factored out and postponed to the end, etc.
Ok, if n is constant and I define it as a 2^x value, bits shifting will be much more efficient than an optimized division, right ?
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
On Fri, 11 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 11:31 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
On 12/10/2015 06:16 PM, Nicolas Pitre wrote:
On Thu, 10 Dec 2015, Daniel Lezcano wrote:
+u32 stats_variance(struct stats *s) +{
- s32 mean = stats_mean(s);
- return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0;
Here sq_sum is a u64 value. On 32-bit architectures you are not allowed to do a straight division with 64-bit dividends in the kernel. You have to use do_div(). The do_div() is there to make the division much less costly than the plain division where gcc promotes both the dividend and the divisor to 64 bits. Still, do_div() is still a very expensive operation when the divisor isn't constant.
Ah, ok. Will fix this.
Note that this is the reason why in my version of the IRQ stat code, I avoided all non constant divisions by factoring out the n term. Here you have:
sq_sum / n - mean * mean
where mean is obtained by:
sum / n
So by calling stats_variance() you pay the cost of two expensive divisions. If you call stats_mean() separately from higher level code (as you do in patch #3) then you add another division to the mix.
One reason I didn't use a library abstraction in my own implementation is because I wanted to have n always be a constant for such divisions (plain or do_div() style) so that gcc could optimize them with reciprocal multiplications at compile time. And some other divisions were factored out and postponed to the end, etc.
Ok, if n is constant and I define it as a 2^x value, bits shifting will be much more efficient than an optimized division, right ?
Absolutely. And you don't have to shift values manually as gcc is smart enough to do it on its own.
Nicolas