On Wed, Oct 24, 2012 at 05:37:27PM +0100, Pawel Moll wrote:
On Tue, 2012-10-23 at 23:02 +0100, Guenter Roeck wrote:
Traditionally such data should be exposed to the user via hwmon sysfs interface, and that's exactly what I did for "my" platform - I have a /sys/class/hwmon/hwmon*/device/energy*_input and this was good enough to draw pretty graphs in userspace. Everyone was happy...
Only driver supporting "energy" output so far is ibmaem, and the reported energy is supposed to be cumulative, as in energy = power * time. Do you mean power, possibly ?
So the vexpress would be the second one, than :-) as the energy "monitor" actually on the latest tiles reports 64-bit value of microJoules consumed (or produced) since the power-up.
Some of the older boards were able to report instant power, but this metrics is less useful in our case.
Now I am getting new requests to do more with this data. In particular I'm asked how to add such information to ftrace/perf output. The second most frequent request is about providing it to a "energy aware" cpufreq governor.
Anything energy related would have to be along the line of "do something after a certain amount of work has been performed", which at least at the surface does not make much sense to me, unless you mean something along the line of a process scheduler which schedules a process not based on time slices but based on energy consumed, ie if you want to define a time slice not in milli-seconds but in Joule.
Actually there is some research being done in this direction, but it's way too early to draw any conclusions...
If so, I would argue that a similar behavior could be achieved by varying the duration of time slices with the current CPU speed, or simply by using cycle count instead of time as time slice parameter. Not that I am sure if such an approach would really be of interest for anyone.
Or do you really mean power, not energy, such as in "reduce CPU speed if its power consumption is above X Watt" ?
Uh. To be completely honest I must answer: I'm not sure how the "energy aware" cpufreq governor is supposed to work. I have been simply asked to provide the data in some standard way, if possible.
I am not sure how this would be expected to work. hwmon is, by its very nature, a passive subsystem: It doesn't do anything unless data is explicitly requested from it. It does not update an attribute unless that attribute is read. That does not seem to fit well with the idea of tracing - which assumes that some activity is happening, ultimately, all by itself, presumably periodically. The idea to have a user space application read hwmon data only for it to trigger trace events does not seem to be very compelling to me.
What I had in mind was similar to what adt7470 driver does. The driver would automatically access the device every now and then to update it's internal state and generate the trace event on the way. This auto-refresh "feature" is particularly appealing for me, as on some of "my" platforms can take up to 500 microseconds to actually get the data. So doing this in background (and providing users with the last known value in the meantime) seems attractive.
A bad example doesn't mean it should be used elsewhere.
adt7470 needs up to two seconds for a temperature measurement cycle, and it can not perform automatic cycles all by itself. In this context, executing temperature measurement cycles in the background makes a lot of sense, especially since one does not want to wait for two seconds when reading a sysfs attribute.
But that only means that the chip is most likely not a good choice when selecting a temperature sensor, not that the code necessary to get it working should be used as an example for other drivers.
Guenter
An exception is if a monitoring device suppports interrupts, and if its driver actually implements those interrupts. This is, however, not the case for most of the current drivers (if any), mostly because interrupt support for hardware monitoring devices is very platform dependent and thus difficult to implement.
Interestingly enough the newest version of our platform control micro (doing the energy monitoring as well) can generate and interrupt when a transaction is finished, so I was planning to periodically update the all sort of values. And again, generating a trace event on this opportunity would be trivial.
Of course a particular driver could register its own perf PMU on its own. It's certainly an option, just very suboptimal in my opinion. Or maybe not? Maybe the task is so specialized that it makes sense?
We had a couple of attempts to provide an in-kernel API. Unfortunately, the result was, at least so far, more complexity on the driver side. So the difficulty is really to define an API which is really simple, and does not just complicate driver development for a (presumably) rare use case.
Yes, I appreciate this. That's why this option is actually my least favourite. Anyway, what I was thinking about was just a thin shin that *can* be used by a driver to register some particular value with the core (so it can be enumerated and accessed by in-kernel clients) and the core could (or not) create a sysfs attribute for this value on behalf of the driver. Seems lightweight enough, unless previous experience suggests otherwise?
Cheers!
Paweł