Re: [RFC] Energy/power monitoring within the kernel

23 Oct 2012


      On Tue, 2012-10-23 at 18:30 +0100, Pawel Moll wrote:
...
=== Option 1: Trace event ===
This seems to be the "cheapest" option. Simply defining a trace event
that can be generated by a hwmon (or any other) driver makes the
interesting data immediately available to any ftrace/perf user. Of
course it doesn't really help with the cpufreq case, but seems to be
a good place to start with.
The question is how to define it... I've came up with two prototypes:
= Generic hwmon trace event =
This one allows any driver to generate a trace event whenever any
"hwmon attribute" (measured value) gets updated. The rate at which the
updates happen can be controlled by already existing "update_interval"
attribute.
8<-------------------------------------------
TRACE_EVENT(hwmon_attr_update,
   TP_PROTO(struct device *dev, struct attribute *attr, long long input),
   TP_ARGS(dev, attr, input),
TP_STRUCT__entry(
   	__string(       dev,		dev_name(dev))
   	__string(	attr,		attr->name)
   	__field(	long long,	input)
   ),
TP_fast_assign(
   	__assign_str(dev, dev_name(dev));
   	__assign_str(attr, attr->name);
   	__entry->input = input;
   ),
TP_printk("%s %s %lld", __get_str(dev), __get_str(attr), __entry->input)
);
8<-------------------------------------------
It generates such ftrace message:
<...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361
One issue with this is that some external knowledge is required to
relate a number to a processor core. Or maybe it's not an issue at all
because it should be left for the user(space)?
If the external knowledge can be characterized in a userspace tool with
the given data here, I see no issues with this.
...
= CPU power/energy/temperature trace event =
This one is designed to emphasize the relation between the measured
value (whether it is energy, temperature or any other physical
phenomena, really) and CPUs, so it is quite specific (too specific?)
8<-------------------------------------------
TRACE_EVENT(cpus_environment,
   TP_PROTO(const struct cpumask *cpus, long long value, char unit),
   TP_ARGS(cpus, value, unit),
TP_STRUCT__entry(
   	__array(	unsigned char,	cpus,	sizeof(struct cpumask))
   	__field(	long long,	value)
   	__field(	char,		unit)
   ),
TP_fast_assign(
   	memcpy(__entry->cpus, cpus, sizeof(struct cpumask));
Copying the entire cpumask seems like overkill. Especially when you have
4096 CPU machines.
...
__entry->value = value;
__entry->unit = unit;

),
TP_printk("cpus %s %lld[%c]",
   	__print_cpumask((struct cpumask *)__entry->cpus),
   	__entry->value, __entry->unit)
);
8<-------------------------------------------
And the equivalent ftrace message is:
<...>127.063107: cpus_environment: cpus 0,1,2,3 34361[C]
It's a cpumask, not just single cpu id, because the sensor may measure
the value per set of CPUs, eg. a temperature of the whole silicon die
(so all the cores) or an energy consumed by a subset of cores (this
is my particular use case - two meters monitor a cluster of two
processors and a cluster of three processors, all working as a SMP
system).
Of course the cpus __array could be actually a special __cpumask field
type (I've just hacked the __print_cpumask so far). And I've just
realised that the unit field should actually be a string to allow unit
prefixes to be specified (the above should obviously be "34361[mC]"
not "[C]"). Also - excuse the "cpus_environment" name - this was the
best I was able to come up with at the time and I'm eager to accept
any alternative suggestions :-)
Perhaps making a field that can be a subset of cpus may be better. That
way we don't waste the ring buffer with lots of zeros. I'm guessing that
it will only be a group of cpus, and not a scattered list? Of course,
I've seen boxes where the cpu numbers went from core to core. That is,
cpu 0 was on core 1, cpu 1 was on core 2, and then it would repeat. 
cpu 8 was on core 1, cpu 9 was on core 2, etc.
But still, this could be compressed somehow.
I'll let others comment on the rest.
-- Steve

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [RFC] Energy/power monitoring within the kernel