Dear All
I am going through the EAS project work and trying to port them on my
ARM based SMP system (3.10 Linux version)
Could you please help me clarify, will EAS be helpful in terms of
power/performance for SMP systems as well?
Thanks & Regards
Nitish Ambastha
Hi all,
Below are raw power data on Hikey board; with this power data i'd like to
create the power model for Hikey.
- Measure method:
On Hikey board, we cannot measure buck1 which is dedicated for AP
subsystem; so turned to measure VDD_4V2 to remove R247 and remount
shunt resistor 470mOhm. At a result, the power data includes many
other LDO's power data.
+--------------+ +-------------+
4.2v | | Buck1 | |
---- Shunt Resistor --->| PMIC: Hi6553 |------>| SoC: Hi6220 |
^ ^ | | | ACPU |
| | +--------------+ +-------------+
|-> Energy Probe <-|
- Measured raw data:
sys_suspend: AP system suspend state
cluster_off: two clusters are powered off
cluster_on: 1 cluster is powered on, all cpus are powered off
cpu_wfi: 1 cluster is powered on, last cpu enters 'wfi' but other cpus
are powered off
voltage: voltage for every OPP
# OPP sys_suspend cluster_off cluster_on cpu_wfi cpu_on voltage
208 328 347 366 374 435 1.04
432 328 344 374 388 499 1.04
729 331 351 400 409 606 1.09
960 329 353 430 443 750 1.18
1200 331 365 486 506 988 1.33
Hikey Power Model
Power [mW]
500 ++--------+-------------+---------+---------++ cluster_off **A***
+ + + + D cluster_on ##B###
450 ++.........................................%%+ cpu_wfi $$C$$$
400 ++.......................................%%.++ cpu p-state %%D%%%
| : : : %% |
350 ++...................................%%.....++
| : : :%% |
300 ++...............................%D.........++
250 ++...........................%%%%...........++
| : : %%% : |
200 ++....................%%D%..................++
| : %%%%% : : |
150 ++...........%%%%...........................++
| %%D%% : : #####B
100 ++.%%%%%.....................#####B#####....++
50 D%%..............#######B####...............++
A*********A*************A*********A**********A
0 C$$$$$$$$$C$$$$$$$$$$$$$C$$$$$$$$$C$$$$$----++
208 432 729 960 1200
Frequency [MHz]
Hikey Power Efficiency
Power Efficiency [mW/MHz] Voltage [v]
0.45 ++-------+----------+--------+-------++ 3 Cluster Power **A***
+ + + + + CPU Static Power ##B###
0.4 ++...................................$C CPU Dynamic power $$C$$$
| : : : $$$++ 2.5 Voltage %%D%%%
0.35 ++.............................$$$...++
| : : $$C$ |
0.3 C$$$$$$$$..............$$$$..........++ 2
| C$$$$$$$$$$C$$ : |
0.25 ++...................................++ 1.5
0.2 ++................................%%%%D
| : : %%%%D%%%% |
0.15 D%%%%%%%%D%%%%%%%%%%D%%%%............++ 1
| : : : |
0.1 A********.........................****A
| A**********A********A**** ++ 0.5
0.05 ++...................................++
B########B##########+ ####B########B
0 ++-------+----------B####----+-------++ 0
208 432 729 960 1200
Frequency [MHz]
- Power Model On Hikey:
According to before we have discussed for power model, i think below is the
prefered power data for Hikey which calculated from raw power date:
static struct idle_state idle_states_cluster_a53[] = {
{ .power = 0 },
{ .power = 0 },
};
/*
* Use (cluster_on - cluster_off) for every OPP
*/
static struct capacity_state cap_states_cluster_a53[] = {
/* Power per cluster */
{ .cap = 178, .power = 19, },
{ .cap = 369, .power = 30, },
{ .cap = 622, .power = 49, },
{ .cap = 819, .power = 77, },
{ .cap = 1024, .power = 121, },
};
/*
* Use (cpu_wfi - cluster_on) for every OPP, then calculate the
* average value for wfi's power data; But we can see actually
* the idle state of "WFI" will be impacted by voltage.
*/
static struct idle_state idle_states_core_a53[] = {
{ .power = 12 },
{ .power = 0 },
};
/*
* Use (cpu_on - cluster_off) for every OPP
*/
static struct capacity_state cap_states_core_a53[] = {
/* Power per cpu */
{ .cap = 178, .power = 69, }, /* 208MHz */
{ .cap = 369, .power = 125, }, /* 432MHz */
{ .cap = 622, .power = 206, }, /* 729MHz */
{ .cap = 819, .power = 320, }, /* 960MHz */
{ .cap = 1024, .power = 502, }, /* 1.2GHz */
};
If have any questions or issues for upper energy model data, please let me know;
Appreciate review and comments in advance.
- Some other questions
Q1: Jian & Dan, voltage for 1.2GHz is quite high, could you help check
the voltage table for OPPs, if there have any unexpected value?
Q2: Morten, if i want to do more profiling on EAS, do you suggest i should
refer which branch now? I think now EASv5 patches are relative old,
so want to check if we have better candidate or not.
i downloaded the git repo: git://www.linux-arm.org/linux-power.git, branch
energy_model_rfc_v5.1; but there have no sched-dvfs related patches.
Thanks,
Leo Yan
[+eas-dev]
On 19/10/15 09:06, Vincent Guittot wrote:
> On 8 October 2015 at 12:28, Morten Rasmussen <morten.rasmussen(a)arm.com> wrote:
>> On Thu, Oct 08, 2015 at 10:54:15AM +0200, Vincent Guittot wrote:
>>> On 8 October 2015 at 02:59, Steve Muckle <steve.muckle(a)linaro.org> wrote:
>>>> At Linaro Connect a couple weeks back there was some sentiment that
>>>> taking the max of multiple capacity requests would be the only viable
>>>> policy when cpufreq_sched is extended to support multiple sched classes.
>>>> But I'm concerned that this is not workable - if CFS is requesting 1GHz
>>>> worth of bandwidth on a 2GHz CPU, and DEADLINE is also requesting 1GHz
>>>> of bandwidth, we would run the CPU at 1GHz and starve the CFS tasks
>>>> indefinitely.
>>>>
>>>> I'd think there has to be a summing of bandwidth requests from scheduler
>>>> class clients. MikeT raised the concern that in such schemes you often
>>>> end up with a bunch of extra overhead because everyone adds their own
>>>> fudge factor (Mike please correct me if I'm misstating our concern
>>>> here). We should be able to control this in the scheduler classes though
>>>> and ensure headroom is only added after the requests are combined.
>>>>
>>>> Thoughts?
>>>
>>> I have always been in favor of using summing instead of maximum
>>> because of the example you mentioned above. IIRC, we also said that
>>> the scheduler classes should not request more than needed capacity
>>> with regards of schedtune knob position it means that if schedtune is
>>> set to max power save, no margin should be taken any scheduler class
>>> other than to filter uncertainties in cpu util computation).
>>> Regarding RT, it's a bit less straight foward as we much ensure an
>>> unknown responsiveness constraint (unlike deadline) so we could easily
>>> request the max capacity to be sure to ensure this unknown constraint.
>>
>> Agreed. I'm in favor with summing the requests, but with a minor twist.
>> As Steve points out, and I have discussed it with Juri as well, with
>> three sched classes and using the max capacity request we would always
>> request too little capacity if more than one class has tasks. Worst case
>> we would only request a third of the required capacity. Deadline would
>> take it all and leave nothing for RT and CFS.
>>
>> Summing the requests instead should be fine, but deadline might cause
>> us to reserve too much capacity if we have short deadline tasks with a
>> tight deadline. For example, a 2ms task (@max capacity) with a 4ms
>> deadline and a 10ms period. In this case deadline would have to request
>> 50% capacity (at least) to meet its deadline but it only uses 20%
>> capacity (scale-invariant). Since deadline has higher priority than RT
>> and CFS we can safely assume that they can use the remaining 30% without
>> harming the deadline task. We can take this into account if we let
>> deadline provide a utilization request (20%) and a minimum capacity
>> request (50%). We would sum the utilization request with the utilization
>> requests of RT and CFS. If sum < deadline_min_capacity, we would choose
>> deadline_min_capacity instead of the sum to determine the capacity. What
>> do you think? It might not be worth the trouble as there are plenty of
>> other scenarios where we would request too much capacity for deadline
>> tasks that can't be fixed.
>
> I have some concern with a deadline min capacity field. If we take the
> example above, the request seems to be a bit too much static, deadline
We should be able to get away from having a special "min capacity" field
for SCHED_DEADLINE, yes. Requests coming from SCHED_DEADLINE should
always concern minimum requirements (too meet deadlines). However, I'll
have to play a bit with all this before being 100% sure.
> scheduler should request 50% only when the task is running. The 50%
> make only sense if the task start ro run at t he beg of the period and
> inorder to run the complete 4ms time slot of the deadline. But, the
> request might even have to be increased to 100% if for some reasons
> like another deadline task or a irq or a disable preemption, the task
> starts to run in the last 2ms of the deadline time slot. Once the task
> has finished its running period, the request should go back to 0.
Capacity requests of SCHED_DEADLINE are supposed to be more stable, I
think it's built in how the thing works. There is a particular instant
of time, relative to each period, called "0-lag" point after which, if
the task is not running, we are sure (by construction) that we can
safely release a capacity request relative to a certain task. It is
about theory behind SCHED_DEADLINE implementation, but we should be
able to use this information to ask for the right capacity. As said
above, I'll need more time to think this through and experiment with
it.
> I agree that reality is a bit more complex because we don't have
> "immediate" change of the freq/capacity so we must take into account
> the time needed to change the capacity of the CPU but we should try to
> make the requirement as close as possible to the reality.
>
> Using a min value just mean that we are not able to evaluate the
> current capacity requirement of the deadline class and that we will
> just steal the capacity requested by other class which is not a good
> solution IMHO
>
> Juri, what will be the granularity of the computation of the bandwidth
> of the patches you are going to send ?
>
I'm not sure I get what you mean by granularity here. The patches will
add a 0..100% bandwidth number, that we'll have to normalize to 0..1024,
for SCHED_DEADLINE tasks currently active. Were you expecting something
else?
Thanks,
- Juri
>>
>> As Vincent points out, RT is a bit tricky. AFAIK, it doesn't have any
>> utilization tracking at all. I think we have to fix that somehow.
>> Regarding responsiveness, RT doesn't provide any guarantees by design
>> (in the same way deadline does) so we shouldn't be violating any
>> policies by slowing RT tasks down. The users might not be happy though
>> so we could favor performance for RT tasks to avoid breaking legacy
>> software and ask users that care about energy to migrate to deadline
>> where we actually know the performance constraints of the tasks.
>
(changing subject, +eas-dev)
On 10/19/2015 12:34 AM, Vincent Guittot wrote:
>> FWIW as I tested with the browser-short rt-app workload I noticed the
>> > rt-app calibration causes the CPU frequency to oscillate between fmin
>> > and fmax.
>
> It's a normal behavior. During calibration step, we try to force the
> governor to use the max OPP to evaluate the minimum ns per loop. We
> have 2 sequence: The 1st one just loop on the calibration loop until
> the ns per loop value reach a stable value. The second one alternate
> run and sleep phase to prevent the trig of thermal mitigation which
> can be triggered during 1 sequence
Understood that rt-app would want to hit fmax for calibration, but the
oscillation directly between fmin and fmax, and the timing of the
transitions, seem concerning.
I've copied a trace of this to http://smuckle.net/calibrate.txt . As an
example at ~41.808:
- rt-app starts executing and executes continuously at fmin for about
85ms. That's a long time IMO to underserve a continuously running
workload before changing frequency.
- The frequency then goes directly to fmax and rt-app runs for 3ms more.
We should be going to an intermediate frequency first. This has come up
on lkml and I think everyone wants that change but I'm including it here
for completeness.
- The system then sits idle at fmax for almost 200ms since nothing is
decaying the usage on the idle CPU. This also came up on lkml though
it's probably worth mentioning that it's so easy to reproduce with rt-app.
Just curious, is rt-app doing a fixed amount of work in these bursts of
execution? Or is it watching cpufreq nodes so that it knows when the CPU
frequency has hit fmax, so it can then do calibration work?
(changing subject, +eas-dev)
On 10/19/2015 12:34 AM, Vincent Guittot wrote:
>> FWIW as I tested with the browser-short rt-app workload I noticed the
>> > rt-app calibration causes the CPU frequency to oscillate between fmin
>> > and fmax.
>
> It's a normal behavior. During calibration step, we try to force the
> governor to use the max OPP to evaluate the minimum ns per loop. We
> have 2 sequence: The 1st one just loop on the calibration loop until
> the ns per loop value reach a stable value. The second one alternate
> run and sleep phase to prevent the trig of thermal mitigation which
> can be triggered during 1 sequence
Understood that rt-app would want to hit fmax for calibration, but the
oscillation directly between fmin and fmax, and the timing of the
transitions, seem concerning.
I've copied a trace of this to http://smuckle.net/calibrate.txt . As an
example at ~41.808:
- rt-app starts executing and executes continuously at fmin for about
85ms. That's a long time IMO to underserve a continuously running
workload before changing frequency.
- The frequency then goes directly to fmax and rt-app runs for 3ms more.
We should be going to an intermediate frequency first. This has come up
on lkml and I think everyone wants that change but I'm including it here
for completeness.
- The system then sits idle at fmax for almost 200ms since nothing is
decaying the usage on the idle CPU. This also came up on lkml though
it's probably worth mentioning that it's so easy to reproduce with rt-app.
Just curious, is rt-app doing a fixed amount of work in these bursts of
execution? Or is it watching cpufreq nodes so that it knows when the CPU
frequency has hit fmax, so it can then do calibration work?
(adding eas-dev)
On 10/09/2015 01:41 AM, Patrick Bellasi wrote:
>>> The users might not be happy though
>>> > > so we could favor performance for RT tasks to avoid breaking legacy
>>> > > software and ask users that care about energy to migrate to deadline
>>> > > where we actually know the performance constraints of the tasks.
>> >
>> > Given that at the moment RT tasks are treated no differently than CFS
>> > tasks w.r.t. cpu frequency I'd expect that we could get away without any
>> > sort of perf bias for RT bandwidth, which I think would be cost
>> > prohibitive for power.
>
> Are you specifically considering instantaneous power?
> Because from a power standpoint I cannot see any difference for
> example w.r.t having a batch CFS task.
>
> From an energy standpoint instead, do not you think that a
> "race-to-idle" policy could be better at least for RT-BATCH tasks?
Sorry I should have said energy rather than power...
>From my experience race to idle has never panned out as an
energy-efficient strategy, presumably due to the nonlinear increase in
power cost as performance increases. Because of this I think a policy of
increasing the OPP when RT tasks are runnable will cause a net increase
in energy consumption, which need not be incurred since RT tasks do not
receive this preferential OPP treatment today.
[+eas-dev]
On Mon, Oct 12, 2015 at 02:16:42AM -0700, Michael Turquette wrote:
> Quoting Patrick Bellasi (2015-10-12 01:51:29)
> > On Fri, Oct 09, 2015 at 11:58:22AM +0100, Michael Turquette wrote:
> > > Steve,
> > >
> > > On Thu, Oct 8, 2015 at 1:59 AM, Steve Muckle <steve.muckle(a)linaro.org> wrote:
> > > > At Linaro Connect a couple weeks back there was some sentiment that
> > > > taking the max of multiple capacity requests would be the only viable
> > > > policy when cpufreq_sched is extended to support multiple sched classes.
> > > > But I'm concerned that this is not workable - if CFS is requesting 1GHz
> > > > worth of bandwidth on a 2GHz CPU, and DEADLINE is also requesting 1GHz
> > > > of bandwidth, we would run the CPU at 1GHz and starve the CFS tasks
> > > > indefinitely.
> > >
> > > For the scheduler I think that summing makes sense. For peripheral
> > > devices it places a much higher burden on the system integrator to
> > > figure out how much compute is needed to achieve a specific use case.
> >
> > Are we still thinking about exposing an interface to device drivers?
> >
> > With sched-DVFS we are able to select the CPU's OPP based on real
> > (expected) task demand. Thus, I expect constraints from device drivers
> > being useful only in these use cases:
> >
> > a) the tasks activated by the driver are not "big enough" to
> > generate an OPP switch, but still we want to race-to-idle their
> > execution
> > e.g. a light-weight control thread for an external accelerator,
> > which could benefit from running on an higher OPP to reduce
> > overall processing latencies
> > b) we do not know which tasks require a boost in performances
> > e.g. something like the "boost pulse" exposed by the Interactive
> > governor, where the input subsystem is considered to trigger
> > latency sensitive operations and thus the system deserves to be
> > globally boosted
> >
> > There are other classes of use-cases?
> >
> > If these are the only use-cases we are thinking about, I'm wondering
> > if we could not try to cover all them via SchedTune and the boost
> > value it exposes.
> >
> > The use-case a) is already covered by the first implementation of
> > SchedTune. If we know which task should be boosted, technically we
> > could expose the SchedTune interface to drivers to allow them to boosts
> > specific tasks.
>
> It sounds like could work for me.
>
> My concerns are bandwidth-related use cases. I've observed high speed
> MMC controllers, WiFi chips, GPUs and other devices whose CPU tasks were
> "small", but their performance was adversely affected when the CPU runs
> at a slower rate. This is true even on relatively quiet system.
I see your point, I'm wondering if most of these use-case should not be
better implemented using DEADLINE instead of FIFO/RR.
For these cases the problem will be solved once DL is properly
integrated with sched-DVFS. For the remaining use-case where we still
want (or we are "limited") to use FIFO/RT, I'm more on the idea that a
race-to-idle strategy could just work.
> Tasks running on the CPU can be viewed as latencies from the perspective
> of a peripheral/IO device. TI and many other vendors have implemented
> out-of-tree solutions to hold a CPU at a minimum OPP/frequency when
> these drivers are running (usually with terrible hacks, but in some
> cases nicely wrapped up in runtime_pm_{get,put} callbacks).
I know very well these scenarios, in the past I experimented a lot
with x86 machines running OpenCL workloads offloaded on GPGPUs.
Quite frequently the bottleneck was the control thread running on the
CPU, at the point that by co-scheduling two apps on the same GPU you
can get better performance (for both apps) than just running one app
at the time.
This calls out for a kind of coordination, between workloads running on
the CPU and the accelerator, on the frequency selection.
But if you consider the specific GPGPU use-case, many time the source
of knowledge about the required bandwidth is not kernel-space but
instead in user-space. The OpenCL run-time, as well as many other
run-time, could provide valuable input to the scheduler about these
dependencies.
> Does this fit with your model of how schedtune is supposed to work? I
I think it's worth to have a try... provided that, in case of
promising results, we are eventually satisfied to replace the CPUFreq
specific API exposed to drivers with a more generic interface exposed to
both kernel- and user-space.
If instead we end up with two different APIs to achieve the same goal
this will be just confusing.
> have not looked at that stuff at all... are start_the_work() and
> stop_the_work(() critical-section functions exposed to drivers?
Mmm... I cannot get that question.
Which functions/critical-sections are you referring to?
> Regards,
> Mike
Cheers Patrick
> >
> > Regarding the second use-case b), this is a feature we try to address
> > using the global boost value. Right now the only consumer of that
> > global boost value is the FAIR scheduling class. However, it should be
> > quite easy to extend it with the integration of SchedDVFS into other
> > scheduling classes.
> >
> > > However, I might be the only one that is concerned with use case right
> > > now.
> > >
> > > >
> > > > I'd think there has to be a summing of bandwidth requests from scheduler
> > > > class clients. MikeT raised the concern that in such schemes you often
> > > > end up with a bunch of extra overhead because everyone adds their own
> > > > fudge factor (Mike please correct me if I'm misstating our concern
> > > > here).
> > >
> > > To be clear, I raised that point because I've actually seen that in
> > > the past when TI implemented out-of-tree cpu frequency constraint
> > > systems. I'm not being hypothetical :-)
> >
> > That's the point, SchedTune aims at becoming a sort of (hopefully)
> > official solution to setup "frequency constraints".
> >
> > > > We should be able to control this in the scheduler classes though
> > > > and ensure headroom is only added after the requests are combined.
> > >
> > > Sounds promising. Everyone else in this thread supports aggregation
> > > (or summing) over maximum value, and I definitely won't argue the
> > > point on the list unless it presents a real problem in testing.
> > >
> > > Regards,
> > > Mike
> > >
> > > >
> > > > Thoughts?
> > >
> > >
> > >
> > > --
> > > Michael Turquette
> > > CEO
> > > BayLibre - At the Heart of Embedded Linux
> > > http://baylibre.com/
> >
> > Cheers Patrick
> >
> > --
> > #include <best/regards.h>
> >
> > Patrick Bellasi
> >
>
--
#include <best/regards.h>
Patrick Bellasi
On 10/12/2015 06:55 AM, Vincent Guittot wrote:
>> For the RT story, are you thinking to use rq->rt_avg in some way?
>
> Yes that was my goal but the deadline is also accounted in rq->rt_avg
> which is then used in scale_rt_capacity. I was planning to remove the
> deadline class from rq->rt_avg and to use the per cpu deadline
> bandwidth in scale_rt_capacity instead
If the deadline contribution to rq->rt_avg is left in place, can RT and
deadline both be accounted for via that one mechanism for the purposes
of sched-dvfs?
I'm not sure how the accuracy/behavior of the deadline accounting to
rq->rt_avg differs from the forthcoming patchset Juri mentioned, but if
it's not as good, perhaps the mechanisms could be combined so that
rt_avg (and by extension the per-CPU capacity adjustments in CFS)
benefit as well.