the cap_idx should be set to energy_env->cap_idx,
it is used by group_norm_usage() later.
Signed-off-by: Mark Yang <mark.yang(a)spreadtrum.com>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93005c9..ced4a99 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4617,7 +4617,7 @@ static int find_new_capacity(struct energy_env *eenv,
for (idx = 0; idx < sge->nr_cap_states; idx++) {
if (sge->cap_states[idx].cap >= util)
- return idx;
+ break;
}
eenv->cap_idx = idx;
--
2.5.0
Hi all,
as promised during our meeting last week here is a link to the
report I've presented regarding "Task Estimation Utilization":
https://docs.google.com/document/d/1f2NpnYUS0ci_sLn4i2IY6ZKrRVOqRNnktI9xnt4…
Comment can be added to the document.
Patches are still under internal review, we will post the initial
proposal as soon as we get them validated.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
This patch series is a following up for EASv5 power profiling on Hikey.
>From profiling result, rt-app-31/38/44 are inconsistent; Finally
found this issue can be fixed by these 4 patches. After applied these
patch, we can get good improvement for these cases (mW):
Energy BestComb Mainline(ndm) noEAS(ndm) EAS(ndm) EAS(sched) EAS(Applied Patches)
mp3 412 604.41 551.79 528.99 530.20 491.10
rt-app-6 676 864.18 846.72 792.88 840.33 759.96
rt-app-13 968 1222.47 1210.35 1673.04 1332.13 1253.99
rt-app-19 1348 1412.08 1474.86 1612.12 1421.28 1355.49
rt-app-25 1619 1718.67 1710.73 2104.41 2028.25 1584.25
rt-app-31 1878 1968.08 1965.87 2318.11 2976.59 1903.69
rt-app-38 2283 2580.23 2540.45 2576.46 2724.32 2241.29
rt-app-44 2578 3092.66 3056.92 2913.91 2669.91 2406.45
rt-app-50 2848 3492.36 3423.26 3489.14 3429.41 3290.25
This patch series is ONLY for EXPERIMENTAL purpose.
Leo Yan (4):
sched/fair: EASv5: Fix CPU shared capacity issue
sched/fair: EASv5: snapshot CPU's utilization
sched/fair: EASv5: Add CPU's total utilization
sched/fair: EASv5: update new capacity index
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 1 +
2 files changed, 74 insertions(+), 15 deletions(-)
--
1.9.1
Hi all,
At connect, Steve also brought out related questions: pack tasks with
higher OPP or spread tasks with lower OPP? so I'd like to summary this
and combind with recently profiling result:
- When task_A is waken up, then scheduler need decide it should pack
task_A onto a busy CPU, or scheduler need spread task_A to a idle
CPU.
If pack task_A onto one busy CPU, this will introduce possible power
penalty caused by higher OPP; on the other hand, if spread task_A to
a idle CPU (The idle CPU's cluster may also stay in idle state),
then this will introduce power penalty caused by extra power domain.
So I think we can enhance energy calculation algorithm when wake up
the task in function energy_aware_wake_cpu(). For example, we can
select two candidate CPUs for waken up task, one possible CPU is in
the same schedule group with the task's original CPU, and another
possible CPU is in another schedule group (this schedule group
should have best or equal power efficiency in system). Then finally
we can get to know if need spread tasks to different cluster, or need
spread task to different CPU but in the same cluster, or just stay
on original CPU.
- I also observed here have another possible scenario. For example, if
tasks have been already packed to several CPUs, and though every
task's workload is not quite high (such like rt-app-13) but they
accumulate load on one CPU, so finally CPU will run at high OPP.
So if EAS pick up only one of these tasks and try to migrate the
task to another CPU, usually will not migrate to that CPU. The
reason is even target CPU have run at high OPP, but usually it still
have capacity to run more workloads with highest OPP; so energy_diff()
also will get worse power result after increase OPP, and task will
stay on original CPU. [1][2]
Even if pick one idle CPU from another cluster, still cannot resolve
this issue. Because if spread task to another cluster, the original
cluster and CPU's OPP will not decrease but introduce extra power by
the new cluster and CPU.
So in this case, should consider as a global view and define some
criteria:
* CPUs don't stay on lowest OPP, but system have idle CPUs;
* CPU's lower OPP can meet capacity requirement for all task's
average load;
* CPU's lower OPP can meet capacity requirement for the highest load
task in system.
If meet these criteria, EAS can select idle CPU from schedule group
with best power efficiency.
I think you guys may have discussed this topic yet, so before I start
to try with these ideas, want to check if I missed some discussion and
welcom any suggestion.
[1] http://people.linaro.org/~leo.yan/eas_profiling/eas_tasks_in_one_cluster_hi…
[2] http://people.linaro.org/~leo.yan/eas_profiling/eas_tasks_in_one_cluster_en…
Thanks,
Leo Yan
Fix the following compilation warnings:
rt-app_utils.c: In function ‘gettid’:
rt-app_utils.c:150:9: warning: implicit declaration of function ‘syscall’ [-Wimplicit-function-declaration]
return syscall(__NR_gettid);
^
rt-app_utils.c: In function ‘ftrace_write’:
rt-app_utils.c:277:4: warning: implicit declaration of function ‘write’ [-Wimplicit-function-declaration]
write(mark_fd, tmp, n);
^
mv -f .deps/rt-app_utils.Tpo .deps/rt-app_utils.Po
gcc -DHAVE_CONFIG_H -I. -I./../libdl/ -g -O2 -MT rt-app_args.o -MD -MP -MF .deps/rt-app_args.Tpo -c -o rt-app_args.o rt-app_args.c
rt-app_args.c: In function ‘parse_command_line’:
rt-app_args.c:44:3: warning: implicit declaration of function ‘parse_config’ [-Wimplicit-function-declaration]
parse_config(argv[1], opts);
^
rt-app_args.c:47:3: warning: implicit declaration of function ‘parse_config_stdin’ [-Wimplicit-function-declaration]
parse_config_stdin(opts);
^
mv -f .deps/rt-app_args.Tpo .deps/rt-app_args.Po
gcc -DHAVE_CONFIG_H -I. -I./../libdl/ -g -O2 -MT rt-app.o -MD -MP -MF .deps/rt-app.Tpo -c -o rt-app.o rt-app.c
rt-app.c:173:15: warning: return type defaults to ‘int’ [-Wimplicit-int]
static inline loadwait(unsigned long exec)
^
rt-app.c: In function ‘ioload’:
rt-app.c:195:9: warning: implicit declaration of function ‘write’ [-Wimplicit-function-declaration]
ret = write(io_fd, iomem->ptr, size);
^
rt-app.c: In function ‘run’:
rt-app.c:340:4: warning: ‘return’ with no value, in function returning non-void
return;
^
rt-app.c: In function ‘shutdown’:
rt-app.c:381:3: warning: implicit declaration of function ‘close’ [-Wimplicit-function-declaration]
close(ft_data.trace_fd);
^
rt-app.c: In function ‘main’:
rt-app.c:848:3: warning: implicit declaration of function ‘sleep’ [-Wimplicit-function-declaration]
sleep(opts.duration);
Signed-off-by: Daniel Lezcano <daniel.lezcano(a)linaro.org>
---
src/rt-app.c | 5 +++--
src/rt-app_args.c | 1 +
src/rt-app_parse_config.h | 2 ++
src/rt-app_utils.c | 2 +-
4 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/src/rt-app.c b/src/rt-app.c
index fef12d8..c3e5df4 100644
--- a/src/rt-app.c
+++ b/src/rt-app.c
@@ -21,6 +21,7 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#define _GNU_SOURCE
#include <fcntl.h>
+#include <unistd.h>
#include "rt-app.h"
#include "rt-app_utils.h"
#include <sched.h>
@@ -170,7 +171,7 @@ int calibrate_cpu_cycles(int clock)
}
-static inline loadwait(unsigned long exec)
+static inline unsigned long loadwait(unsigned long exec)
{
unsigned long load_count;
@@ -337,7 +338,7 @@ int run(int ind, event_data_t *events,
for (i = 0; i < nbevents; i++)
{
if (!continue_running && !lock)
- return;
+ return 0;
log_debug("[%d] runs events %d type %d ", ind, i, events[i].type);
if (opts.ftrace)
diff --git a/src/rt-app_args.c b/src/rt-app_args.c
index e16415d..c4d56de 100644
--- a/src/rt-app_args.c
+++ b/src/rt-app_args.c
@@ -19,6 +19,7 @@ along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
*/
+#include "rt-app_parse_config.h"
#include "rt-app_args.h"
void
diff --git a/src/rt-app_parse_config.h b/src/rt-app_parse_config.h
index 023cabd..9b0e5fa 100644
--- a/src/rt-app_parse_config.h
+++ b/src/rt-app_parse_config.h
@@ -45,5 +45,7 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
void
parse_config(const char *filename, rtapp_options_t *opts);
+void
+parse_config_stdin(rtapp_options_t *opts);
#endif // _RTAPP_PARSE_CONFIG_H
diff --git a/src/rt-app_utils.c b/src/rt-app_utils.c
index c4840db..190affc 100644
--- a/src/rt-app_utils.c
+++ b/src/rt-app_utils.c
@@ -18,7 +18,7 @@ You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
*/
-
+#include <unistd.h>
#include "rt-app_utils.h"
unsigned long
--
1.9.1
The interrupt framework gives a lot of information and statistics about
each interrupt (time accounting, statistics, ...).
Unfortunately there is no way to measure when interrupts occur and provide
a mathematical model for their behavior which could help in predicting
their next occurence.
This framework allows for registering a callback function that is invoked
when an interrupt occurs.
Each time an interrupt occurs, the callback will be called with the interval
corresponding to the duration between two interrupts.
This framework allows a subsystem to register a handler in order to receive
the timing information for the registered interrupt. That gives other
subsystems the ability to compute predictions for the next interrupt occurence.
The main objective is to track and detect the periodic interrupts in order
to predict the next event on a cpu and anticipate the sleeping time when
entering idle. This fine grain approach allows to simplify and rationalize
a wake up event prediction without IPIs interference, thus letting the
scheduler to be smarter with the wakeup IPIs regarding the idle period.
The irq timings tracking showed, in the proof-of-concept, an improvement with
the predictions, the approach is correct but my knowledge to the irq
subsystem is limited. I am not sure this patch measuring irq time interval
is correct or acceptable, so it is at the RFC state (minus some polishing).
Signed-off-by: Daniel Lezcano <daniel.lezcano(a)linaro.org>
---
include/linux/interrupt.h | 45 ++++++++++++++++++++++++++++++++
include/linux/irqdesc.h | 3 +++
kernel/irq/Kconfig | 4 +++
kernel/irq/handle.c | 12 +++++++++
kernel/irq/manage.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 128 insertions(+), 1 deletion(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index be7e75c..f48e8ff 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -123,6 +123,51 @@ struct irqaction {
extern irqreturn_t no_action(int cpl, void *dev_id);
+#ifdef CONFIG_IRQ_TIMINGS
+/**
+ * timing handler to be called when an interrupt happens
+ */
+typedef void (*irqt_handler_t)(unsigned int, ktime_t, void *, void *);
+
+/**
+ * struct irqtimings - per interrupt irq timings descriptor
+ * @handler: interrupt handler timings function
+ * @data: pointer to the private data to be passed to the handler
+ * @timestamp: latest interruption occurence
+ */
+struct irqtimings {
+ irqt_handler_t handler;
+ void *data;
+} ____cacheline_internodealigned_in_smp;
+
+/**
+ * struct irqt_ops - structure to be used by the subsystem to call the
+ * register and unregister ops when an irq is setup or freed.
+ * @setup: registering callback
+ * @free: unregistering callback
+ *
+ * The callbacks assumes the lock is held on the irq desc
+ */
+struct irqtimings_ops {
+ int (*setup)(unsigned int, struct irqaction *);
+ void (*free)(unsigned int, void *);
+};
+
+extern int register_irq_timings(struct irqtimings_ops *ops);
+extern int setup_irq_timings(unsigned int irq, struct irqaction *act);
+extern void free_irq_timings(unsigned int irq, void *dev_id);
+#else
+static inline int setup_irq_timings(unsigned int irq, struct irqaction *act)
+{
+ return 0;
+}
+
+static inline void free_irq_timings(unsigned int irq, void *dev_id)
+{
+ ;
+}
+#endif
+
extern int __must_check
request_threaded_irq(unsigned int irq, irq_handler_t handler,
irq_handler_t thread_fn,
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index a587a33..e0d4263 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -51,6 +51,9 @@ struct irq_desc {
#ifdef CONFIG_IRQ_PREFLOW_FASTEOI
irq_preflow_handler_t preflow_handler;
#endif
+#ifdef CONFIG_IRQ_TIMINGS
+ struct irqtimings *timings;
+#endif
struct irqaction *action; /* IRQ action list */
unsigned int status_use_accessors;
unsigned int core_internal_state__do_not_mess_with_it;
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 9a76e3b..1275fd1 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -73,6 +73,10 @@ config GENERIC_MSI_IRQ_DOMAIN
config HANDLE_DOMAIN_IRQ
bool
+config IRQ_TIMINGS
+ bool
+ default y
+
config IRQ_DOMAIN_DEBUG
bool "Expose hardware/virtual IRQ mapping via debugfs"
depends on IRQ_DOMAIN && DEBUG_FS
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index e25a83b..ca8b0c5 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -132,6 +132,17 @@ void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action)
wake_up_process(action->thread);
}
+#ifdef CONFIG_IRQ_TIMINGS
+void handle_irqt_event(struct irqtimings *irqt, struct irqaction *action)
+{
+ if (irqt)
+ irqt->handler(action->irq, ktime_get(),
+ action->dev_id, irqt->data);
+}
+#else
+#define handle_irqt_event(a, b)
+#endif
+
irqreturn_t
handle_irq_event_percpu(struct irq_desc *desc, struct irqaction *action)
{
@@ -165,6 +176,7 @@ handle_irq_event_percpu(struct irq_desc *desc, struct irqaction *action)
/* Fall through to add to randomness */
case IRQ_HANDLED:
flags |= action->flags;
+ handle_irqt_event(desc->timings, action);
break;
default:
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index f9a59f6..21cc7bf 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1017,6 +1017,60 @@ static void irq_release_resources(struct irq_desc *desc)
c->irq_release_resources(d);
}
+#ifdef CONFIG_IRQ_TIMINGS
+/*
+ * Global variable, only used by accessor functions, currently only
+ * one user is allowed and it is up to the caller to make sure to
+ * setup the irq timings which are already setup.
+ */
+static struct irqtimings_ops *irqtimings_ops;
+
+/**
+ * register_irq_timings - register the ops when an irq is setup or freed
+ *
+ * @ops: the register/unregister ops to be called when at setup or
+ * free time
+ *
+ * Returns -EBUSY if the slot is already in use, zero on success.
+ */
+int register_irq_timings(struct irqtimings_ops *ops)
+{
+ if (irqtimings_ops)
+ return -EBUSY;
+
+ irqtimings_ops = ops;
+
+ return 0;
+}
+
+/**
+ * setup_irq_timings - call the timing register callback
+ *
+ * @desc: an irq desc structure
+ *
+ * Returns -EINVAL in case of error, zero on success.
+ */
+int setup_irq_timings(unsigned int irq, struct irqaction *act)
+{
+ if (irqtimings_ops && irqtimings_ops->setup)
+ return irqtimings_ops->setup(irq, act);
+ return 0;
+}
+
+/**
+ * free_irq_timings - call the timing unregister callback
+ *
+ * @irq: the interrupt number
+ * @dev_id: the device id
+ *
+ */
+void free_irq_timings(unsigned int irq, void *dev_id)
+{
+ if (irqtimings_ops && irqtimings_ops->free)
+ irqtimings_ops->free(irq, dev_id);
+}
+#endif /* CONFIG_IRQ_TIMINGS */
+
/*
* Internal function to register an irqaction - typically used to
* allocate special interrupts that are part of the architecture.
@@ -1037,6 +1091,9 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
if (!try_module_get(desc->owner))
return -ENODEV;
+ ret = setup_irq_timings(irq, new);
+ if (ret)
+ goto out_mput;
/*
* Check whether the interrupt nests into another interrupt
* thread.
@@ -1045,7 +1102,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
if (nested) {
if (!new->thread_fn) {
ret = -EINVAL;
- goto out_mput;
+ goto out_free_timings;
}
/*
* Replace the primary handler which was provided from
@@ -1323,6 +1380,10 @@ out_thread:
kthread_stop(t);
put_task_struct(t);
}
+
+out_free_timings:
+ free_irq_timings(irq, new->dev_id);
+
out_mput:
module_put(desc->owner);
return ret;
@@ -1408,6 +1469,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
unregister_handler_proc(irq, action);
+ free_irq_timings(irq, dev_id);
+
/* Make sure it's not being used on another CPU: */
synchronize_irq(irq);
--
1.9.1
It is usually interesting to do some statistics on a set of data,
especially when the values are coming in a stream (measured time by time
for instance).
Instead of having the statistics formula inside a specific subsystem,
this small library provides the basic statistics functions available for
all the kernel.
The library is designed to do the minimum computation when a new value
is added. Only the basic value storage, array shifting and accumulation
is done.
The average, variance and standard deviation is computed when requested
via the corresponding functions.
The statistic library can deal up to 65536 values in the range of -2^24
and 2^24 - 1. These are large values and does not really make sense to
the kernel code to use the statistics at these limits, so it is up to
developer to use wisely the library.
Signed-off-by: Daniel Lezcano <daniel.lezcano(a)linaro.org>
---
include/linux/stats.h | 29 +++++++
lib/Makefile | 3 +-
lib/stats.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 266 insertions(+), 1 deletion(-)
create mode 100644 include/linux/stats.h
create mode 100644 lib/stats.c
diff --git a/include/linux/stats.h b/include/linux/stats.h
new file mode 100644
index 0000000..664eb30
--- /dev/null
+++ b/include/linux/stats.h
@@ -0,0 +1,29 @@
+/*
+ * include/linux/stats.h
+ */
+#ifndef _LINUX_STATS_H
+#define _LINUX_STATS_H
+
+struct stats {
+ s64 sum; /* sum of values */
+ u64 sq_sum; /* sum of the square values */
+ s32 *values; /* array of values */
+ s32 min; /* minimal value of the entire series */
+ s32 max; /* maximal value of the entire series */
+ unsigned int n; /* current number of values */
+ unsigned int w_ptr; /* current window pointer */
+ unsigned short len; /* size of the value array */
+};
+
+extern s32 stats_max(struct stats *s);
+extern s32 stats_min(struct stats *s);
+extern s32 stats_mean(struct stats *s);
+extern u32 stats_variance(struct stats *s);
+extern u32 stats_stddev(struct stats *s);
+extern void stats_add(struct stats *s, s32 val);
+extern void stats_reset(struct stats *s);
+extern void stats_free(struct stats *s);
+extern struct stats *stats_alloc(unsigned short len);
+extern unsigned short stats_n(struct stats *s);
+
+#endif /* _LINUX_STATS_H */
diff --git a/lib/Makefile b/lib/Makefile
index 13a7c6a..18460d2 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -26,7 +26,8 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o memweight.o kfifo.o \
- percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o
+ percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
+ stats.o
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
obj-y += hexdump.o
diff --git a/lib/stats.c b/lib/stats.c
new file mode 100644
index 0000000..f5425d1
--- /dev/null
+++ b/lib/stats.c
@@ -0,0 +1,235 @@
+/*
+ * Implementation of basics statistics functions useful to compute on
+ * a stream of data.
+ *
+ * Copyright: (C) 2015-2016 Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/export.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/stats.h>
+#include <linux/types.h>
+
+/**
+ * stats_n - number of elements present in the statistics
+ *
+ * @s: the statistic structure
+ *
+ * Returns an unsigned short representing the number of values in the
+ * statistics series
+ */
+unsigned short stats_n(struct stats *s)
+{
+ return s->n;
+}
+
+/**
+ * stats_max - maximal value of the series
+ *
+ * @s: the statistic structure
+ *
+ * Returns a s32 representing the maximum value of the series
+ */
+s32 stats_max(struct stats *s)
+{
+ return s->max;
+}
+EXPORT_SYMBOL_GPL(stats_max);
+
+/**
+ * stats_min - minimal value of the series
+ *
+ * @s: the statistic structure
+ *
+ * Returns a s32 representing the minimal value of the series
+ */
+s32 stats_min(struct stats *s)
+{
+ return s->min;
+}
+EXPORT_SYMBOL_GPL(stats_min);
+
+/**
+ * stats_mean - compute the average
+ *
+ * @s: the statistics structure
+ *
+ * Returns an s32 corresponding to the mean value, or zero if there is
+ * no data
+ */
+s32 stats_mean(struct stats *s)
+{
+ return s->n ? s->sum / s->n : 0;
+}
+EXPORT_SYMBOL_GPL(stats_mean);
+
+/**
+ * stats_variance - compute the variance
+ *
+ * @s: the statistic structure
+ *
+ * Returns an u32 corresponding to the variance, or zero if there is no
+ * data
+ */
+u32 stats_variance(struct stats *s)
+{
+ s32 mean = stats_mean(s);
+ return s->n ? (s->sq_sum / s->n) - (mean * mean) : 0;
+}
+EXPORT_SYMBOL_GPL(stats_variance);
+
+/**
+ * stats_stddev - compute the standard deviation
+ *
+ * @s: the statistic structure
+ *
+ * Returns an u32 corresponding to the standard deviation, or zero if
+ * there is no data
+ */
+u32 stats_stddev(struct stats *s)
+{
+ return int_sqrt(stats_variance(s));
+}
+EXPORT_SYMBOL_GPL(stats_stddev);
+
+/**
+ * stats_add - add a new value in the statistic structure
+ *
+ * @s: the statistic structure
+ * @value: the new value to be added, max 2^24 - 1
+ *
+ * Adds the value to the array, if the array is full, then the array
+ * is shifted.
+ *
+ */
+void stats_add(struct stats *s, s32 value)
+{
+ /*
+ * In order to prevent an overflow in the statistic code, we
+ * limit the value to 2^24 - 1, if it is greater we just
+ * ignore it with a WARN_ON_ONCE letting know to userspace we
+ * are dealing with out-of-range values.
+ */
+ if (WARN_ON_ONCE(value >= ((2<<24) - 1)))
+ return;
+
+ /*
+ * Insert the value in the array. If the array is already
+ * full, shift the values and add the value at the end of the
+ * series, otherwise add the value directly.
+ */
+ if (likely(s->len == s->n)) {
+ s->sum -= s->values[s->w_ptr];
+ s->sq_sum -= s->values[s->w_ptr] * s->values[s->w_ptr];
+ s->values[s->w_ptr] = value;
+ s->w_ptr = (s->w_ptr + 1) % s->len;
+ } else {
+ s->values[s->n] = value;
+ s->n++;
+ }
+
+ /*
+ * Keep track of the min and max.
+ */
+ s->min = min(s->min, value);
+ s->max = max(s->max, value);
+
+ /*
+ * In order to reduce the overhead and to prevent value
+ * derivation due to the integer computation, we just sum the
+ * value and do the division when the average and the variance
+ * are requested.
+ */
+ s->sum += value;
+ s->sq_sum += value * value;
+}
+EXPORT_SYMBOL_GPL(stats_add);
+
+/**
+ * stats_reset - reset the stats
+ *
+ * @s: the statistic structure
+ *
+ * Reset the statistics and reset the values
+ *
+ */
+void stats_reset(struct stats *s)
+{
+ s->sum = s->sq_sum = s->n = s->w_ptr = 0;
+ s->max = S32_MIN;
+ s->min = S32_MAX;
+}
+EXPORT_SYMBOL_GPL(stats_reset);
+
+/**
+ * stats_alloc - allocate a structure for statistics
+ *
+ * @len: The number of items in the array which is limited by the
+ * unsigned short type. That allows to prevent overflow in all
+ * the statistics code.
+ *
+ * Allocates memory to store the different values and initialize the
+ * structure.
+ *
+ * In order to prevent an overflow in the computation, the maximum
+ * allowed number of values is 65536 and each value max is 2^24 - 1.
+ *
+ * The variance is the sum of the square value of the difference of
+ * each value to the average. The variance is a u64 (square values are
+ * always positives), so that gives a maximum of 18446744073709551615.
+ * We can store 65536 values, so:
+ *
+ * 18446744073709551615 / 65536 = 281474976710655
+ *
+ * ... is the square max value we can have, hence the difference to
+ * the mean is max sqrt(281474976710655) = 16777215 (2^24 -1)
+ *
+ * Even if these values are not realistics (statistic in the kernel is
+ * for a few hundred values, large dispersion in the integer limits is
+ * very very rare so the sum won't be very high, or high integers
+ * series means low variance), we prevent any overflow in the code and
+ * we are safe.
+ *
+ * Returns a valid pointer a struct stats, NULL if the memory
+ * allocation fails.
+ */
+struct stats *stats_alloc(unsigned short len)
+{
+ struct stats *s;
+ s32 *values;
+
+ s = kzalloc(sizeof(*s), GFP_KERNEL);
+ if (s) {
+ values = kzalloc(sizeof(*values) * len, GFP_KERNEL);
+ if (!values) {
+ kfree(s);
+ return NULL;
+ }
+
+ s->values = values;
+ s->len = len;
+ s->min = S32_MAX;
+ s->max = S32_MIN;
+ }
+
+ return s;
+}
+EXPORT_SYMBOL_GPL(stats_alloc);
+
+/**
+ * stats_free - free the statistics structure
+ *
+ * @s : the statistics structure
+ *
+ * Frees the memory allocated by the function stats_alloc.
+ */
+void stats_free(struct stats *s)
+{
+ kfree(s->values);
+ kfree(s);
+}
+EXPORT_SYMBOL_GPL(stats_free);
--
1.9.1
The current approach to select an idle state is based on the idle period
statistics computation.
Useless to say this approach satisfied everyone as a solution to find the
best trade-off between the performances and the energy saving via the menu
governor.
However, the kernel is evolving to act pro-actively regarding the energy
constraints with the scheduler and the different power management subsystems
are not collaborating with the scheduler as the conductor of the decisions,
they all act independently.
In order to integrate the cpuidle framework into the scheduler, we have to
radically change the approach by clearly identifying what is causing a wake
up and how it behaves. The cpuidle governors are based on idle period
statistics, hence without knowledge of what woke up the cpu. In these sources
of wakes up, the IPI are of course accounted which results in doing statistics
on the scheduler behavior too. It is no sense to let the scheduler to take a
decision based on a next prediction of its own decisions.
This serie inverts the logic.
First there is a small statistic library do to basic and fast statistics
computation, put in the library directory and make it available to everyone.
It is mathematically proven there is no overflow in the code (check the log
and comments).
The second patch provides a callback to be registered in the irq subsystem
and to be called when an interrupt is handled with a timestamp. Interrupts
related to timers are discarded.
The third patch uses the callback provided by the patch above to compute an
average for each interrupt on each cpu. When the interrupt intervals are in
standard deviation +/- mean value, then the source of wake up is considered
stable and enters in the 'predictable' category. Then the next prediction
wakeup for a specific cpu is the minimum remaining time of each interrupt's
next prediction / or the timer.
These are the results with a workload emulator (mp3, video, browser, ...) on
a Dual Xeon 6 cores. Each test has been run 10 times.
--------------------------
successful predictions (%)
--------------------------
scripts/rt-app-browser.sh.menu.dat:
N min max sum mean stddev
10 56.51 68.61 631.27 63.127 3.6882
scripts/rt-app-browser.sh.irq.dat:
N min max sum mean stddev
10 72.88 79.94 774.43 77.443 2.10055
--------------------------
Successful predictions (%)
--------------------------
scripts/rt-app-mp3.sh.menu.dat:
N min max sum mean stddev
10 65.4 69.53 675.51 67.551 1.42503
scripts/rt-app-mp3.sh.irq.dat:
N min max sum mean stddev
10 82.03 92.13 854.69 85.469 2.63553
--------------------------
Successful predictions (%)
--------------------------
scripts/rt-app-video.sh.menu.dat:
N min max sum mean stddev
10 57.69 77.72 625.58 62.558 5.54488
scripts/rt-app-video.sh.irq.dat:
N min max sum mean stddev
10 73.19 75.2 742.33 74.233 0.752316
--------------------------
Successful predictions (%)
--------------------------
scripts/video.sh.menu.dat:
N min max sum mean stddev
10 40.7 59.08 463.02 46.302 5.25094
scripts/video.sh.irq.dat:
N min max sum mean stddev
10 29.64 84.59 425.58 42.558 16.007
The next prediction algorithm is very simple at the moment but it opens
the door for the following improvements:
- Detect patterns (eg. 1, 1, 3, 1, 1, 3, ...)
- Each devices behave differently, thus the prediction algorithm can be
per interrupt. Eg. disk ios have a burst of fast interrupt followed by
a couple of slow interrupts.
If a simplistic algorithm gives better results than the menu governor,
there is a high probability an optimized one will do much better.
* Regarding how this integrates into the scheduler
At the moment the integration is the first step, hence there is just a very
small integration when the scheduler tries to find a cpu it will prevent to
use an idle cpu where the idle period did not reach the energy break even.
Invoking the API to enter idle is simplified on purpose to let the scheduler
to take a decision between it asks when is expected the next wakeup on the cpu
and when it enters idle.
- sched_idle_next_wakeup() => returns a s64 telling the remaining time before
a wakeup occurs
- sched_idle(duration, latency) => goes idle with the specified duration and
the latency constraint
Daniel Lezcano (9):
lib: Add a simple statistics library
irq: Add a framework to measure interrupt timings
sched: idle: IRQ based next prediction for idle period
sched-idle: Plug sched idle with the idle task
cpuidle: Add statistics and debug information with debugfs
cpuidle: Store the idle start time stamp
sched: fair: Fix wrong idle timestamp usage
sched/fair: Prevent to break the target residency
sched-idle: Add a debugfs entry to switch from cpuidle to sched-idle
Nicolas Pitre (1):
idle-sched: Add a trace event when an interrupt occurs
drivers/cpuidle/Kconfig | 12 ++
drivers/cpuidle/Makefile | 2 +
drivers/cpuidle/cpuidle.c | 16 +-
drivers/cpuidle/debugfs.c | 232 +++++++++++++++++++++++
drivers/cpuidle/debugfs.h | 19 ++
include/linux/cpuidle.h | 16 ++
include/linux/interrupt.h | 45 +++++
include/linux/irqdesc.h | 3 +
include/linux/stats.h | 29 +++
include/trace/events/irq.h | 44 +++++
kernel/irq/Kconfig | 3 +
kernel/irq/handle.c | 12 ++
kernel/irq/manage.c | 65 ++++++-
kernel/sched/Makefile | 1 +
kernel/sched/fair.c | 44 +++--
kernel/sched/idle-sched.c | 449 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/idle.c | 11 +-
kernel/sched/sched.h | 20 ++
lib/Makefile | 3 +-
lib/stats.c | 235 ++++++++++++++++++++++++
20 files changed, 1239 insertions(+), 22 deletions(-)
create mode 100644 drivers/cpuidle/debugfs.c
create mode 100644 drivers/cpuidle/debugfs.h
create mode 100644 include/linux/stats.h
create mode 100644 kernel/sched/idle-sched.c
create mode 100644 lib/stats.c
--
1.9.1
Hi,
Geekbench is a widely used benchmark on Android and iOs devices. It's
also available on Linux and Mac OS X. It somehow provides a cross-platform
benchmark which we can use as an index of how powerful the tested CPUs are.
I was trying to run Geekbench on MTK's internal EAS-enabled Android-based
device and found Geekbench scores are lower than expected because of
relative slow response time of PELT. See the attached pdf for a bit more
description of a GeekBench trace analysed using TRAPpy.
--
// freedom
Dear All
I am going through the EAS project work and trying to port them on my
ARM based SMP system (3.10 Linux version)
Could you please help me clarify, will EAS be helpful in terms of
power/performance for SMP systems as well?
Thanks & Regards
Nitish Ambastha
To summarize the current problem with idle CPU capacity votes:
- When the last task on a CPU (say CPU X) sleeps and the CPU goes idle,
we currently drop its capacity vote to zero. We do not immediately
update the cluster frequency based on this information however.
- It depends on when other CPUs in the frequency domain have an event
which forces re-evaluation of the capacity votes and corresponding
frequency. It could occur right away, lowering the frequency only to
require raising it again immediately if CPU X is idle a very short time.
Or it could be a very long time before such an event occurs which will
leave the cluster at an unnecessarily high OPP and waste energy.
I have a draft of a change which modifies the nohz idle balance path a
bit to ensure that update_blocked_averages() is called for tickless idle
CPUs at least every X ms. This alone won't solve the above problems
though. You need to force re-evaluation of the capacity votes somewhere
to update the cluster frequency. I was originally going to call into
cpufreq_sched as idle CPU loads are decayed to update the frequency
there but folks didn't seem to like this during Thursday's call.
We could get rid of the clearing of the capacity vote when entering idle
and use a passive update when decaying idle CPU utilizations (setting
the capacity vote but not triggering a re-evaluation of cluster
frequency). That would solve the problem of risking the cluster
frequency dropping to fmin during a very short idle and having to be
immediately ramped up again. It will not solve the issue of the cluster
potentially getting stuck idle at fmax/high frequency for long periods
of time and wasting energy though.
There's been some discussion on this issue in the context of integration
of cpuidle with cpufreq and the scheduler (see attached). Rather than
force regular load decay updates via the load balancer and figure out
when to force frequency re-evaluation I'm inclined to just remove the
clearing of the capacity vote in dequeue_task_fair when going idle and
tackle this problem within cpuidle as part of an energy aware/platform
aware decision (see #2 in the attachment). A possible policy in cpuidle
might look like:
- If it's a short idle, don't bother removing capacity vote.
- If it's a long idle and the system doesn't burn extra power in idle at
elevated frequency, passively remove the capacity vote. Frequency gets
adjusted if another CPU has a freq-evaluating event, like today.
- If it's a long idle and the system burns extra power in idle, actively
remove the capacity vote, immediately adjusting frequency if needed.
A slack timer mechanism may still be desirable in cpuidle to guard
against the prediction being wrong (you think it's a short idle and
leave a high capacity vote in, but it ends up being a long idle).
Thanks if you've read this far! Also, I hope to migrate these
discussions to lkml+linux-pm. Perhaps after the next sched-freq RFC
posting which will surely spawn discussions there anyway and get
everyone up to speed on our current status and issues, making it a good
cutover point.
Hi Juri, Steve,
It looks like if cpuX sets its own OPP level in task_tick_fair (to
capacity_orig), another cpuY can override this to any value (at least
via enqueue_task_fair) before cpuX's request can take effect (i.e.
before the throttling timestamp is updated via the kcpufreq thread). The
request from cpuX at the next tick may be throttled or the task may go
to sleep and its load is decayed enough that the next request after
wakeup no longer crosses the threshold and hence we lose the opportunity
to go to FMAX. It seems like we need to have a mechanism where a current
higher request from cpu for its own capacity should override any other
cpu's lower request?
Thanks,
Vikram
Hello,
I'm using the EASv5 3.18 tree with cpufreq_sched. With the sched
governor enabled I've noticed that after a migration or after a switch
from a non-fair task to the idle task, the source CPU goes idle and its
(possibly max) capacity request stays in place, preventing other
requests from going through until that source CPU decides to wake up and
take up some work. I know that there are some ongoing discussions about
how to actually enforce a frequency reduction when a CPU enters idle to
save power, but this seems to be a more immediate problem since the
other CPU(s)' requests are also basically ignored. How about a
reset_capacity call in pick_next_task_idle? Throttling is a concern I
suppose, but I think the check in dequeue_task_fair is doing the same
thing already, so the following would just repeat for
non_fair_class->idle_task.
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index c65dac8..555c21d 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,8 @@ pick_next_task_idle(struct rq *rq, struct task_struct
*prev)
{
put_prev_task(rq, prev);
+ cpufreq_sched_reset_cap(cpu_of(rq));
+
schedstat_inc(rq, sched_goidle);
return rq->idle;
}
Thanks,
Vikram
In cpufreq_sched_set_cap we currently have this:
/*
* We only change frequency if this cpu's capacity request represents a
* new max. If another cpu has requested a capacity greater than the
* previous max then we rely on that cpu to hit this code path and make
* the change. IOW, the cpu with the new max capacity is responsible
* for setting the new capacity/frequency.
*
* If this cpu is not the new maximum then bail
*/
if (capacity_max > capacity)
goto out;
But this can lead to situations like (2 CPU cluster, CPUs start with cap
request of 0):
1. CPU0 gets heavily loaded, requests cap = 1024 (fmax)
2. CPU1 gets lightly loaded, requests cap = 10
3. CPU0's load goes away, requests cap = 0
4. CPU1's load of 10 persists for a long time
In step #3 we could've set the cluster capacity to 10/1024 but did not
because the CPU we were working with at the time (CPU0) was not the CPU
driving the new cluster maximum capacity request. As a result we run
unnecessarily at fmax for a long time.
Any reason to not set the OPP associated with the new max capacity
request immediately, regardless of what CPU is driving it?
thanks,
Steve
Hi Dietmar, Juri,
I'm evaluating EAS RFCv5 on Qualcomm SoCs. I have a question about the
CPU invariant utilization tracking in __update_entity_load_avg.
I have Juri's arch-support patches for arm64 such that
arch_scale_cpu_capacity(cpu)
- returns the maximum capacity of the CPU from the energy model
Consider a dual-cluster system that has equally capable CPUs in both
clusters with each cluster on its own clock source, but with different
max OPP levels for each cluster. If both the clusters are at the same
OPP level, the lower-max-opp cluster would accumulate utilization_avg at
a slower rate. This doesn't seem right; at the same frequency, a task
should exhibit equal performance on either cluster since the CPUs are
otherwise equally capable. On such a system scale_cpu should be 1024 for
both clusters at least in the context of utilization tracking. What do
you think?
Also, please confirm my understanding with respect to traditional
Big.Little - where utilization will accumulate slower on a little CPU
due to the CPU invariance factor. I can understand the frequency scaling
factor - a task consumes more absolute CPU cycles at a higher frequency;
now it would seem that given the same frequency scaling invariance
factor (say little CPU's is 500MHz/1000MHz and big CPU is
1000MHz/2000MHz), we still want to have the little CPU accumulate
utilization slower because the amount of work done (in IPC terms
perhaps) is less on the little CPU?
Thanks,
Vikram
--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.
Hi Daniel,
I've reviewed the draft doc you sent on cpuidle/cpufreq integration. Things have progressed a bit since the initial round of comments in the doc in April, also I thought it'd be good to open the discussion on eas-dev, so I figured I'd try and summarize the main points of the doc here and comment on them. Apologies if I mis-state anything, please correct me if necessary.
Some high level thoughts:
- I have yet to see a platform where race to idle is a win for power. Because of the exponential shape of the power/perf curves and other issues such as random wakeups interrupting deep sleep, it's always been better in my experience to run at as low an OPP as possible within the performance requirements of the workload. As a general policy at least.
- The validation tests mentioned in the doc seemed to be focused uniformly on performance but I think power measurements must be given equal consideration and mention in validating any of these changes.
My comments on each of the individual proposals:
1. Managing frequency during idle when blocked load is high.
The proposal in this section was to keep the frequency unchanged when entering idle and there is significant blocked load. Blocked load has since been included in the utilization metric which determines frequency. But the policy in sched-freq on what to do when the last task on a runqueue blocks (i.e. the CPU goes idle) is still evolving.
Currently the CPU's CFS capacity vote in the frequency domain is passively dropped when that CPU goes idle. It zeros out its capacity vote but does not trigger a recalculation of the frequency domain's new overall required capacity or set a new frequency. This means that the frequency will remain as-is until a different event occurs which forces re-evaluation of the CPU capacity votes in the frequency domain. I won't go into enumerating those events here, suffice it to say I don't think the current policy will work. It's possible for the CPU to stay at an elevated frequency for far too long which would have an unacceptable power impact.
I'm also concerned about the way blocked load is included in the utilization metric and potentially keying off that for frequency during periods of idle. It's certainly a more power-hungry policy than what is in place today, plus there's really no way to tune it since it's part of the per-entity load tracking scheme. The interactive governor had a tunable (the slack timer) which controlled how long you could sit idle at a frequency greater than fmin.
Schedtune aims to provide a per-task tunable which can boost/scale up the load value for that task calculated by PELT. So that would provide some mechanism to tweak this although it affects the task's contribution at all times rather than only when it is blocked. It also currently only is built to inflate/scale up a task's demand rather than decrease it.
2. Make the idle task in charge of changing the frequency to minimum
Proposal is to have idle main loop set frequency to minimum, do idle, and then restore frequency coming out of idle.
My thinking is that when we enter idle, somewhere there has to be a decision made as to whether it is worth it to change frequency. The input for that decision could include
- the current frequency
- the energy data of the target (power consumption at each frequency at the C-state we will be entering)
- the expected duration of the idle period
- the latency of changing between the two frequencies in question, as well as the latency of C-state entry and exit
- performance and latency requirements for the currently blocked tasks
The idle task seems to me like a reasonable place for the decision logic, which could then call some not-yet-existent API into sched-freq to ask for the frequency change. Sched class capacity votes would be retained and reinstated on idle exit. The temporary idle frequency would respect any min or max freq constraints that had been previously registered in cpufreq.
3. The expected sleep duration is less than the frequency transition latency
Agreed this needs to be considered, a full implementation of the logic I mentioned in #2 would cover this one.
4. Align frequency with task priority
I'd agree with MikeT's comment in the doc that changing frequency according to the niceness of the running tasks would be a pretty big change in semantics that we should stay away from. Schedtune may offer a way to negatively bias the performance of a task, although currently it only can be used to inflate a task's performance demands.
5. Consider CPU frequency when evaluating wake-up latency
Agreed this should be taken into account. Again would be part of the logic I mentioned in #2 I think, deciding whether we can even change frequencies during idle at all (possibly ruled out due to QoS constraints), and then if it is possible, whether it is worth it from an energy standpoint.
6. Multiple freq governors as input to cpufreq core
I'd agree this seems like it'd have value but barring a major shift in priorities, it's going to be a while before this would get focus given the effort required just to get the basic sched-freq feature merged.
7. Increase freq of little cluster when freq of big cluster increases, to make migrations back to little faster.
This wouldn't be for everyone IMO due to the power impact. I'm not sure where exactly this policy would go, especially given the current crusade in the community against plugin governors and tunables. Perhaps in the platform-level cpufreq driver.
thanks,
Steve
It was pointed out in today's technical syncup that sched-dvfs should rely on its own static key, and that static key should not indicate a dependency on EAS.
The static key was named sched_energy_freq. I've renamed/shortened it to sched_freq. I also think we should try and use this name for the feature in general since I believe less people are familiar with the term/acronym DVFS. A git grep -i dvfs suggests it's almost exclusively an ARM term, at least as far as the kernel source goes, with a little usage in sound codecs and GPU.
If anyone has a better name or concerns please share...
thanks,
Steve
Hi all,
Below are raw power data on Hikey board; with this power data i'd like to
create the power model for Hikey.
- Measure method:
On Hikey board, we cannot measure buck1 which is dedicated for AP
subsystem; so turned to measure VDD_4V2 to remove R247 and remount
shunt resistor 470mOhm. At a result, the power data includes many
other LDO's power data.
+--------------+ +-------------+
4.2v | | Buck1 | |
---- Shunt Resistor --->| PMIC: Hi6553 |------>| SoC: Hi6220 |
^ ^ | | | ACPU |
| | +--------------+ +-------------+
|-> Energy Probe <-|
- Measured raw data:
sys_suspend: AP system suspend state
cluster_off: two clusters are powered off
cluster_on: 1 cluster is powered on, all cpus are powered off
cpu_wfi: 1 cluster is powered on, last cpu enters 'wfi' but other cpus
are powered off
voltage: voltage for every OPP
# OPP sys_suspend cluster_off cluster_on cpu_wfi cpu_on voltage
208 328 347 366 374 435 1.04
432 328 344 374 388 499 1.04
729 331 351 400 409 606 1.09
960 329 353 430 443 750 1.18
1200 331 365 486 506 988 1.33
Hikey Power Model
Power [mW]
500 ++--------+-------------+---------+---------++ cluster_off **A***
+ + + + D cluster_on ##B###
450 ++.........................................%%+ cpu_wfi $$C$$$
400 ++.......................................%%.++ cpu p-state %%D%%%
| : : : %% |
350 ++...................................%%.....++
| : : :%% |
300 ++...............................%D.........++
250 ++...........................%%%%...........++
| : : %%% : |
200 ++....................%%D%..................++
| : %%%%% : : |
150 ++...........%%%%...........................++
| %%D%% : : #####B
100 ++.%%%%%.....................#####B#####....++
50 D%%..............#######B####...............++
A*********A*************A*********A**********A
0 C$$$$$$$$$C$$$$$$$$$$$$$C$$$$$$$$$C$$$$$----++
208 432 729 960 1200
Frequency [MHz]
Hikey Power Efficiency
Power Efficiency [mW/MHz] Voltage [v]
0.45 ++-------+----------+--------+-------++ 3 Cluster Power **A***
+ + + + + CPU Static Power ##B###
0.4 ++...................................$C CPU Dynamic power $$C$$$
| : : : $$$++ 2.5 Voltage %%D%%%
0.35 ++.............................$$$...++
| : : $$C$ |
0.3 C$$$$$$$$..............$$$$..........++ 2
| C$$$$$$$$$$C$$ : |
0.25 ++...................................++ 1.5
0.2 ++................................%%%%D
| : : %%%%D%%%% |
0.15 D%%%%%%%%D%%%%%%%%%%D%%%%............++ 1
| : : : |
0.1 A********.........................****A
| A**********A********A**** ++ 0.5
0.05 ++...................................++
B########B##########+ ####B########B
0 ++-------+----------B####----+-------++ 0
208 432 729 960 1200
Frequency [MHz]
- Power Model On Hikey:
According to before we have discussed for power model, i think below is the
prefered power data for Hikey which calculated from raw power date:
static struct idle_state idle_states_cluster_a53[] = {
{ .power = 0 },
{ .power = 0 },
};
/*
* Use (cluster_on - cluster_off) for every OPP
*/
static struct capacity_state cap_states_cluster_a53[] = {
/* Power per cluster */
{ .cap = 178, .power = 19, },
{ .cap = 369, .power = 30, },
{ .cap = 622, .power = 49, },
{ .cap = 819, .power = 77, },
{ .cap = 1024, .power = 121, },
};
/*
* Use (cpu_wfi - cluster_on) for every OPP, then calculate the
* average value for wfi's power data; But we can see actually
* the idle state of "WFI" will be impacted by voltage.
*/
static struct idle_state idle_states_core_a53[] = {
{ .power = 12 },
{ .power = 0 },
};
/*
* Use (cpu_on - cluster_off) for every OPP
*/
static struct capacity_state cap_states_core_a53[] = {
/* Power per cpu */
{ .cap = 178, .power = 69, }, /* 208MHz */
{ .cap = 369, .power = 125, }, /* 432MHz */
{ .cap = 622, .power = 206, }, /* 729MHz */
{ .cap = 819, .power = 320, }, /* 960MHz */
{ .cap = 1024, .power = 502, }, /* 1.2GHz */
};
If have any questions or issues for upper energy model data, please let me know;
Appreciate review and comments in advance.
- Some other questions
Q1: Jian & Dan, voltage for 1.2GHz is quite high, could you help check
the voltage table for OPPs, if there have any unexpected value?
Q2: Morten, if i want to do more profiling on EAS, do you suggest i should
refer which branch now? I think now EASv5 patches are relative old,
so want to check if we have better candidate or not.
i downloaded the git repo: git://www.linux-arm.org/linux-power.git, branch
energy_model_rfc_v5.1; but there have no sched-dvfs related patches.
Thanks,
Leo Yan
I see there is a jank measurement capability in workload automation:
http://pythonhosted.org/wlauto/_modules/wlauto/instrumentation/fps.html
Has anyone used this? Limitations, prerequisites etc? I'd like to use this to measure janks during scrolls. I'll give it a shot but I've never used WA before so I thought I'd check first if anyone had comments/experiences to share.
thanks,
Steve
[+eas-dev]
On 19/10/15 09:06, Vincent Guittot wrote:
> On 8 October 2015 at 12:28, Morten Rasmussen <morten.rasmussen(a)arm.com> wrote:
>> On Thu, Oct 08, 2015 at 10:54:15AM +0200, Vincent Guittot wrote:
>>> On 8 October 2015 at 02:59, Steve Muckle <steve.muckle(a)linaro.org> wrote:
>>>> At Linaro Connect a couple weeks back there was some sentiment that
>>>> taking the max of multiple capacity requests would be the only viable
>>>> policy when cpufreq_sched is extended to support multiple sched classes.
>>>> But I'm concerned that this is not workable - if CFS is requesting 1GHz
>>>> worth of bandwidth on a 2GHz CPU, and DEADLINE is also requesting 1GHz
>>>> of bandwidth, we would run the CPU at 1GHz and starve the CFS tasks
>>>> indefinitely.
>>>>
>>>> I'd think there has to be a summing of bandwidth requests from scheduler
>>>> class clients. MikeT raised the concern that in such schemes you often
>>>> end up with a bunch of extra overhead because everyone adds their own
>>>> fudge factor (Mike please correct me if I'm misstating our concern
>>>> here). We should be able to control this in the scheduler classes though
>>>> and ensure headroom is only added after the requests are combined.
>>>>
>>>> Thoughts?
>>>
>>> I have always been in favor of using summing instead of maximum
>>> because of the example you mentioned above. IIRC, we also said that
>>> the scheduler classes should not request more than needed capacity
>>> with regards of schedtune knob position it means that if schedtune is
>>> set to max power save, no margin should be taken any scheduler class
>>> other than to filter uncertainties in cpu util computation).
>>> Regarding RT, it's a bit less straight foward as we much ensure an
>>> unknown responsiveness constraint (unlike deadline) so we could easily
>>> request the max capacity to be sure to ensure this unknown constraint.
>>
>> Agreed. I'm in favor with summing the requests, but with a minor twist.
>> As Steve points out, and I have discussed it with Juri as well, with
>> three sched classes and using the max capacity request we would always
>> request too little capacity if more than one class has tasks. Worst case
>> we would only request a third of the required capacity. Deadline would
>> take it all and leave nothing for RT and CFS.
>>
>> Summing the requests instead should be fine, but deadline might cause
>> us to reserve too much capacity if we have short deadline tasks with a
>> tight deadline. For example, a 2ms task (@max capacity) with a 4ms
>> deadline and a 10ms period. In this case deadline would have to request
>> 50% capacity (at least) to meet its deadline but it only uses 20%
>> capacity (scale-invariant). Since deadline has higher priority than RT
>> and CFS we can safely assume that they can use the remaining 30% without
>> harming the deadline task. We can take this into account if we let
>> deadline provide a utilization request (20%) and a minimum capacity
>> request (50%). We would sum the utilization request with the utilization
>> requests of RT and CFS. If sum < deadline_min_capacity, we would choose
>> deadline_min_capacity instead of the sum to determine the capacity. What
>> do you think? It might not be worth the trouble as there are plenty of
>> other scenarios where we would request too much capacity for deadline
>> tasks that can't be fixed.
>
> I have some concern with a deadline min capacity field. If we take the
> example above, the request seems to be a bit too much static, deadline
We should be able to get away from having a special "min capacity" field
for SCHED_DEADLINE, yes. Requests coming from SCHED_DEADLINE should
always concern minimum requirements (too meet deadlines). However, I'll
have to play a bit with all this before being 100% sure.
> scheduler should request 50% only when the task is running. The 50%
> make only sense if the task start ro run at t he beg of the period and
> inorder to run the complete 4ms time slot of the deadline. But, the
> request might even have to be increased to 100% if for some reasons
> like another deadline task or a irq or a disable preemption, the task
> starts to run in the last 2ms of the deadline time slot. Once the task
> has finished its running period, the request should go back to 0.
Capacity requests of SCHED_DEADLINE are supposed to be more stable, I
think it's built in how the thing works. There is a particular instant
of time, relative to each period, called "0-lag" point after which, if
the task is not running, we are sure (by construction) that we can
safely release a capacity request relative to a certain task. It is
about theory behind SCHED_DEADLINE implementation, but we should be
able to use this information to ask for the right capacity. As said
above, I'll need more time to think this through and experiment with
it.
> I agree that reality is a bit more complex because we don't have
> "immediate" change of the freq/capacity so we must take into account
> the time needed to change the capacity of the CPU but we should try to
> make the requirement as close as possible to the reality.
>
> Using a min value just mean that we are not able to evaluate the
> current capacity requirement of the deadline class and that we will
> just steal the capacity requested by other class which is not a good
> solution IMHO
>
> Juri, what will be the granularity of the computation of the bandwidth
> of the patches you are going to send ?
>
I'm not sure I get what you mean by granularity here. The patches will
add a 0..100% bandwidth number, that we'll have to normalize to 0..1024,
for SCHED_DEADLINE tasks currently active. Were you expecting something
else?
Thanks,
- Juri
>>
>> As Vincent points out, RT is a bit tricky. AFAIK, it doesn't have any
>> utilization tracking at all. I think we have to fix that somehow.
>> Regarding responsiveness, RT doesn't provide any guarantees by design
>> (in the same way deadline does) so we shouldn't be violating any
>> policies by slowing RT tasks down. The users might not be happy though
>> so we could favor performance for RT tasks to avoid breaking legacy
>> software and ask users that care about energy to migrate to deadline
>> where we actually know the performance constraints of the tasks.
>
(changing subject, +eas-dev)
On 10/19/2015 12:34 AM, Vincent Guittot wrote:
>> FWIW as I tested with the browser-short rt-app workload I noticed the
>> > rt-app calibration causes the CPU frequency to oscillate between fmin
>> > and fmax.
>
> It's a normal behavior. During calibration step, we try to force the
> governor to use the max OPP to evaluate the minimum ns per loop. We
> have 2 sequence: The 1st one just loop on the calibration loop until
> the ns per loop value reach a stable value. The second one alternate
> run and sleep phase to prevent the trig of thermal mitigation which
> can be triggered during 1 sequence
Understood that rt-app would want to hit fmax for calibration, but the
oscillation directly between fmin and fmax, and the timing of the
transitions, seem concerning.
I've copied a trace of this to http://smuckle.net/calibrate.txt . As an
example at ~41.808:
- rt-app starts executing and executes continuously at fmin for about
85ms. That's a long time IMO to underserve a continuously running
workload before changing frequency.
- The frequency then goes directly to fmax and rt-app runs for 3ms more.
We should be going to an intermediate frequency first. This has come up
on lkml and I think everyone wants that change but I'm including it here
for completeness.
- The system then sits idle at fmax for almost 200ms since nothing is
decaying the usage on the idle CPU. This also came up on lkml though
it's probably worth mentioning that it's so easy to reproduce with rt-app.
Just curious, is rt-app doing a fixed amount of work in these bursts of
execution? Or is it watching cpufreq nodes so that it knows when the CPU
frequency has hit fmax, so it can then do calibration work?
(changing subject, +eas-dev)
On 10/19/2015 12:34 AM, Vincent Guittot wrote:
>> FWIW as I tested with the browser-short rt-app workload I noticed the
>> > rt-app calibration causes the CPU frequency to oscillate between fmin
>> > and fmax.
>
> It's a normal behavior. During calibration step, we try to force the
> governor to use the max OPP to evaluate the minimum ns per loop. We
> have 2 sequence: The 1st one just loop on the calibration loop until
> the ns per loop value reach a stable value. The second one alternate
> run and sleep phase to prevent the trig of thermal mitigation which
> can be triggered during 1 sequence
Understood that rt-app would want to hit fmax for calibration, but the
oscillation directly between fmin and fmax, and the timing of the
transitions, seem concerning.
I've copied a trace of this to http://smuckle.net/calibrate.txt . As an
example at ~41.808:
- rt-app starts executing and executes continuously at fmin for about
85ms. That's a long time IMO to underserve a continuously running
workload before changing frequency.
- The frequency then goes directly to fmax and rt-app runs for 3ms more.
We should be going to an intermediate frequency first. This has come up
on lkml and I think everyone wants that change but I'm including it here
for completeness.
- The system then sits idle at fmax for almost 200ms since nothing is
decaying the usage on the idle CPU. This also came up on lkml though
it's probably worth mentioning that it's so easy to reproduce with rt-app.
Just curious, is rt-app doing a fixed amount of work in these bursts of
execution? Or is it watching cpufreq nodes so that it knows when the CPU
frequency has hit fmax, so it can then do calibration work?
(adding eas-dev)
On 10/09/2015 01:41 AM, Patrick Bellasi wrote:
>>> The users might not be happy though
>>> > > so we could favor performance for RT tasks to avoid breaking legacy
>>> > > software and ask users that care about energy to migrate to deadline
>>> > > where we actually know the performance constraints of the tasks.
>> >
>> > Given that at the moment RT tasks are treated no differently than CFS
>> > tasks w.r.t. cpu frequency I'd expect that we could get away without any
>> > sort of perf bias for RT bandwidth, which I think would be cost
>> > prohibitive for power.
>
> Are you specifically considering instantaneous power?
> Because from a power standpoint I cannot see any difference for
> example w.r.t having a batch CFS task.
>
> From an energy standpoint instead, do not you think that a
> "race-to-idle" policy could be better at least for RT-BATCH tasks?
Sorry I should have said energy rather than power...
>From my experience race to idle has never panned out as an
energy-efficient strategy, presumably due to the nonlinear increase in
power cost as performance increases. Because of this I think a policy of
increasing the OPP when RT tasks are runnable will cause a net increase
in energy consumption, which need not be incurred since RT tasks do not
receive this preferential OPP treatment today.
[+eas-dev]
On Mon, Oct 12, 2015 at 02:16:42AM -0700, Michael Turquette wrote:
> Quoting Patrick Bellasi (2015-10-12 01:51:29)
> > On Fri, Oct 09, 2015 at 11:58:22AM +0100, Michael Turquette wrote:
> > > Steve,
> > >
> > > On Thu, Oct 8, 2015 at 1:59 AM, Steve Muckle <steve.muckle(a)linaro.org> wrote:
> > > > At Linaro Connect a couple weeks back there was some sentiment that
> > > > taking the max of multiple capacity requests would be the only viable
> > > > policy when cpufreq_sched is extended to support multiple sched classes.
> > > > But I'm concerned that this is not workable - if CFS is requesting 1GHz
> > > > worth of bandwidth on a 2GHz CPU, and DEADLINE is also requesting 1GHz
> > > > of bandwidth, we would run the CPU at 1GHz and starve the CFS tasks
> > > > indefinitely.
> > >
> > > For the scheduler I think that summing makes sense. For peripheral
> > > devices it places a much higher burden on the system integrator to
> > > figure out how much compute is needed to achieve a specific use case.
> >
> > Are we still thinking about exposing an interface to device drivers?
> >
> > With sched-DVFS we are able to select the CPU's OPP based on real
> > (expected) task demand. Thus, I expect constraints from device drivers
> > being useful only in these use cases:
> >
> > a) the tasks activated by the driver are not "big enough" to
> > generate an OPP switch, but still we want to race-to-idle their
> > execution
> > e.g. a light-weight control thread for an external accelerator,
> > which could benefit from running on an higher OPP to reduce
> > overall processing latencies
> > b) we do not know which tasks require a boost in performances
> > e.g. something like the "boost pulse" exposed by the Interactive
> > governor, where the input subsystem is considered to trigger
> > latency sensitive operations and thus the system deserves to be
> > globally boosted
> >
> > There are other classes of use-cases?
> >
> > If these are the only use-cases we are thinking about, I'm wondering
> > if we could not try to cover all them via SchedTune and the boost
> > value it exposes.
> >
> > The use-case a) is already covered by the first implementation of
> > SchedTune. If we know which task should be boosted, technically we
> > could expose the SchedTune interface to drivers to allow them to boosts
> > specific tasks.
>
> It sounds like could work for me.
>
> My concerns are bandwidth-related use cases. I've observed high speed
> MMC controllers, WiFi chips, GPUs and other devices whose CPU tasks were
> "small", but their performance was adversely affected when the CPU runs
> at a slower rate. This is true even on relatively quiet system.
I see your point, I'm wondering if most of these use-case should not be
better implemented using DEADLINE instead of FIFO/RR.
For these cases the problem will be solved once DL is properly
integrated with sched-DVFS. For the remaining use-case where we still
want (or we are "limited") to use FIFO/RT, I'm more on the idea that a
race-to-idle strategy could just work.
> Tasks running on the CPU can be viewed as latencies from the perspective
> of a peripheral/IO device. TI and many other vendors have implemented
> out-of-tree solutions to hold a CPU at a minimum OPP/frequency when
> these drivers are running (usually with terrible hacks, but in some
> cases nicely wrapped up in runtime_pm_{get,put} callbacks).
I know very well these scenarios, in the past I experimented a lot
with x86 machines running OpenCL workloads offloaded on GPGPUs.
Quite frequently the bottleneck was the control thread running on the
CPU, at the point that by co-scheduling two apps on the same GPU you
can get better performance (for both apps) than just running one app
at the time.
This calls out for a kind of coordination, between workloads running on
the CPU and the accelerator, on the frequency selection.
But if you consider the specific GPGPU use-case, many time the source
of knowledge about the required bandwidth is not kernel-space but
instead in user-space. The OpenCL run-time, as well as many other
run-time, could provide valuable input to the scheduler about these
dependencies.
> Does this fit with your model of how schedtune is supposed to work? I
I think it's worth to have a try... provided that, in case of
promising results, we are eventually satisfied to replace the CPUFreq
specific API exposed to drivers with a more generic interface exposed to
both kernel- and user-space.
If instead we end up with two different APIs to achieve the same goal
this will be just confusing.
> have not looked at that stuff at all... are start_the_work() and
> stop_the_work(() critical-section functions exposed to drivers?
Mmm... I cannot get that question.
Which functions/critical-sections are you referring to?
> Regards,
> Mike
Cheers Patrick
> >
> > Regarding the second use-case b), this is a feature we try to address
> > using the global boost value. Right now the only consumer of that
> > global boost value is the FAIR scheduling class. However, it should be
> > quite easy to extend it with the integration of SchedDVFS into other
> > scheduling classes.
> >
> > > However, I might be the only one that is concerned with use case right
> > > now.
> > >
> > > >
> > > > I'd think there has to be a summing of bandwidth requests from scheduler
> > > > class clients. MikeT raised the concern that in such schemes you often
> > > > end up with a bunch of extra overhead because everyone adds their own
> > > > fudge factor (Mike please correct me if I'm misstating our concern
> > > > here).
> > >
> > > To be clear, I raised that point because I've actually seen that in
> > > the past when TI implemented out-of-tree cpu frequency constraint
> > > systems. I'm not being hypothetical :-)
> >
> > That's the point, SchedTune aims at becoming a sort of (hopefully)
> > official solution to setup "frequency constraints".
> >
> > > > We should be able to control this in the scheduler classes though
> > > > and ensure headroom is only added after the requests are combined.
> > >
> > > Sounds promising. Everyone else in this thread supports aggregation
> > > (or summing) over maximum value, and I definitely won't argue the
> > > point on the list unless it presents a real problem in testing.
> > >
> > > Regards,
> > > Mike
> > >
> > > >
> > > > Thoughts?
> > >
> > >
> > >
> > > --
> > > Michael Turquette
> > > CEO
> > > BayLibre - At the Heart of Embedded Linux
> > > http://baylibre.com/
> >
> > Cheers Patrick
> >
> > --
> > #include <best/regards.h>
> >
> > Patrick Bellasi
> >
>
--
#include <best/regards.h>
Patrick Bellasi