On Sun, Sep 08, 2013 at 05:16:16PM +0100, Nicolas Pitre wrote:
On Sun, 8 Sep 2013, Catalin Marinas wrote:
On 7 Sep 2013, at 21:31, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Sat, 7 Sep 2013, Catalin Marinas wrote:
You still want the power decision (policy) to happen in the non-secure OS but with the actual hardware access in firmware.
That's where things get murky. The policy comes as a result of last man determination, etc. In other words, the policy is not only about "I want to save power now". It is also "what kind of power saving I can afford now". And that's basically what MCPM does. With an abstract interface such as PSCI, that policy decision is moved into firmware.
Wrong. PSCI ops get an affinity parameter whether its CPU or cluster power down/suspend. Of course, you can always ask for cluster if only interested in power saving and PSCI can choose what it is safe. There isn't anything in PSCI that would take CPU vs cluster decision away from the non-secure OS.
And the MCPM framework is not the place for such CPU vs cluster policy either. This needs is decided higher up in the cpuidle subsystem and in an abstract terms like target residency, time taken to recover from various low power states. You may go for cluster down directly if 'last man' but may as well go for CPU down first even if 'last man'. This is a decision to be taken by the cpuidle governor and *not* by MCPM. PSCI already allows this via affinity parameter.
I think this shows a misunderstanding of the role of MCPM on your part.
My understanding is mostly based on what's currently in mainline and the to be merged TC2 code. What I may not be aware of is future plans for MCPM, the future use of residency parameter (which I don't think should be handled in MCPM, see more below).
Indeed the cpuidle layer is responsible for deciding what level of power saving should be applied. But that is done on a per CPU basis. It *has* to be done on a per CPU basis because it is too difficult to track what's going on on the other CPUs in every subsystems interested in some form of power management.
I agree.
What MCPM does is to receive this power saving request from cpuidle on individual CPUs including their target residency, etc. It also receives similar requests for CPU hotplug and so on. And then MCPM _arbitrates_ the combination of those requests according to
I also agree that (in the absence of anything else) MCPM needs to arbitrate the combination of such requests.
- the sttrictest restrictions in terms of wake-up latency of _all_
CPUs in the same power domain, and
Wouldn't the strictest restrictions just translate to min(C-state(CPUs-in-cluster)), min(C-state(clusters)) etc.? IOW simple if/then/or/and rules because deeper C states have higher target residency and wake-up latency?
- the state of the other CPUs which might be in the process of coming
back from an interrupt or any other event, and 3) the particularities of the hardware platform where this is happening.
It's fine for MCPM to handle in the absence of any other synchronisation (which could be firmware).
So the concept of "policy" has to be split in two parts: what is _desired_ by the upper layer such as cpuidle as determined by the governor and its view of the system load and utilisation patterns vs implied costs, and the second part which is the _possible_ power saving mode according to the sum of all the constraints presented to MCPM by various requestors.
And that's where I think MCPM (or PSCI) should only be concerned with C-state concepts (and correct arbitration). Pushing actions based on the expected residency down to the MCPM back-end is a bad design decision IMHO.
Taking the TC2 code example (it may be extended, I don't know the plans here) it seems that the cpuidle driver is only concerned with the C1 state (CPU rather than cluster suspend). IIUC, cpuidle is not aware of deeper sleep states. The MCPM back-end would get an expected residency information and make another decision for deeper sleep states. Where does it get the residency information from? Isn't this the estimation done by the cpuidle governor? At this point you pretty much move part of cpuidle governor functionality (and its concepts like target residency) down to the MCPM back-end level. Such split will have bad consequences longer term with code duplication between back-ends, harder to understand and maintain cpuidle decision points.
And because the action of shutting down a CPU or a cluster may take some time (think of cache flushing) then those constraints may also change _during_ the operation and proper measures should be taken to re-evaluate the power management decision dynamically. And that can be achieved only by having simultaneous visibility into both the higher level requirements and the lower level changing hardware states.
I understand the races and how MCPM avoids them. But why not keep the concepts clear: (1) residency and best C-state recommendation in cpuidle (policy), (2) actual C-state hardware setting in MCPM (mechanism).
Point (1) is a cpuidle driver defining C states (for a single CPU, it doesn't need to be concerned with cluster state, just abstract states):
C1: CPU suspend mode C2: cluster suspend mode C3: system suspend mode etc.
Each of these states have corresponding target_residency, exit_latency. The cpuidle governor makes the best recommendation for a each CPU individually. If, for example, it expects long sleep for a CPU, can ask for (or recommend) a C2/C3 state directly.
Point (2) above is about MCPM (or PSCI) having an overall view of the cluster/system that allows it to select the best safe recommended C-state. Simplified pseudo-code:
if (all CPUs in cluster (have a recommended C2 state || are in power-down) && no CPU in cluster is coming up)
Enable cluster suspend
You can continue the logic for other C states, add more logic about CPUs coming up to avoid races. But this still means the strictest of all states (which normally means Cx stricter than Cy for x < y) in a race-free manner.
What I don't get is why you want to make decisions based on expected residency in the MCPM (framework or back-end). Isn't the C-state and the strictness ordering enough?
So I reitterate my assertion that something is wrong in the overall secure OS architecture if it has to be that intimate with power management to the point of locking it up into firmware in order to remain secure.
One of the ARM security architecture features is secure vs non-secure cache separation. Once the non-secure OS actions can affect the secure caches, the security model is broken. In such case the only way the secure OS can be secure is by not relying on its caches. That's a pretty simple model.