On Sun, 8 Sep 2013, Catalin Marinas wrote:
On 7 Sep 2013, at 21:31, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Sat, 7 Sep 2013, Catalin Marinas wrote:
You still want the power decision (policy) to happen in the non-secure OS but with the actual hardware access in firmware.
That's where things get murky. The policy comes as a result of last man determination, etc. In other words, the policy is not only about "I want to save power now". It is also "what kind of power saving I can afford now". And that's basically what MCPM does. With an abstract interface such as PSCI, that policy decision is moved into firmware.
Wrong. PSCI ops get an affinity parameter whether its CPU or cluster power down/suspend. Of course, you can always ask for cluster if only interested in power saving and PSCI can choose what it is safe. There isn't anything in PSCI that would take CPU vs cluster decision away from the non-secure OS.
And the MCPM framework is not the place for such CPU vs cluster policy either. This needs is decided higher up in the cpuidle subsystem and in an abstract terms like target residency, time taken to recover from various low power states. You may go for cluster down directly if 'last man' but may as well go for CPU down first even if 'last man'. This is a decision to be taken by the cpuidle governor and *not* by MCPM. PSCI already allows this via affinity parameter.
I think this shows a misunderstanding of the role of MCPM on your part.
Indeed the cpuidle layer is responsible for deciding what level of power saving should be applied. But that is done on a per CPU basis. It *has* to be done on a per CPU basis because it is too difficult to track what's going on on the other CPUs in every subsystems interested in some form of power management.
What MCPM does is to receive this power saving request from cpuidle on individual CPUs including their target residency, etc. It also receives similar requests for CPU hotplug and so on. And then MCPM _arbitrates_ the combination of those requests according to 1) the sttrictest restrictions in terms of wake-up latency of _all_ CPUs in the same power domain, and 2) the state of the other CPUs which might be in the process of coming back from an interrupt or any other event, and 3) the particularities of the hardware platform where this is happening. Not only that, but the determination of the best power saving mode to engage must be done in a race free manner that satisfies all the criteria on all CPUs. And the "race free" here must not be underestimated because the hardware might be in all varying state of coherency here hence the MCPM ad hoc state machine outside of the regular kernel exclusion mechanisms which took us so long to get right.
So the concept of "policy" has to be split in two parts: what is _desired_ by the upper layer such as cpuidle as determined by the governor and its view of the system load and utilisation patterns vs implied costs, and the second part which is the _possible_ power saving mode according to the sum of all the constraints presented to MCPM by various requestors. And because the action of shutting down a CPU or a cluster may take some time (think of cache flushing) then those constraints may also change _during_ the operation and proper measures should be taken to re-evaluate the power management decision dynamically. And that can be achieved only by having simultaneous visibility into both the higher level requirements and the lower level changing hardware states.
The coupled C-states in the cpuidle is a good example of where this separation was not done properly. It was used initially to handle the CPU vs cluster power down on TC2 and that turned out to be impossible to work with outside of the cpuidle context such as IKS or CPU hotplug. Some people are even thinking of getting rid of the coupled C-state layer entirely in favor of MCPM even on pre b.L systems since it represents a better separation of responsibilities and cleaner design overall.
Of course MCPM is not "done" yet. There are many things that still can be improved to be more efficient. But those improvements need research and experiments. And those might be either generic or completely different from one hardware platform to another. Yet, what MCPM provides is a proper separation of power management responsibilities so the higher level and the lower level can be developed and improved separately.
And above all, it needs a way for *asy* updates of the corresponding code over time when improvements are developed.
I understand your uneasiness with more complex firmware but I now wonder whether you completely missed the point of PSCI. I'll restate - it does *not* take away the power management policy from the non-secure, high-level OS. It does what it is *asked* to do and in a safe, secure manner.
I also wonder if on your end you missed the point of MCPM. I hope that you understand now that power management policy is far more elaborate and intricate than what the cpuidle layer should be concerned about.
So I reitterate my assertion that something is wrong in the overall secure OS architecture if it has to be that intimate with power management to the point of locking it up into firmware in order to remain secure.
Nicolas