On Mon, Sep 09, 2013 at 02:02:47PM +0100, Catalin Marinas wrote:
On Sun, Sep 08, 2013 at 05:16:16PM +0100, Nicolas Pitre wrote:
[...]
So the concept of "policy" has to be split in two parts: what is _desired_ by the upper layer such as cpuidle as determined by the governor and its view of the system load and utilisation patterns vs implied costs, and the second part which is the _possible_ power saving mode according to the sum of all the constraints presented to MCPM by various requestors.
And that's where I think MCPM (or PSCI) should only be concerned with C-state concepts (and correct arbitration). Pushing actions based on the expected residency down to the MCPM back-end is a bad design decision IMHO.
Taking the TC2 code example (it may be extended, I don't know the plans here) it seems that the cpuidle driver is only concerned with the C1 state (CPU rather than cluster suspend). IIUC, cpuidle is not aware of deeper sleep states. The MCPM back-end would get an expected residency information and make another decision for deeper sleep states. Where does it get the residency information from? Isn't this the estimation done by the cpuidle governor? At this point you pretty much move part of cpuidle governor functionality (and its concepts like target residency) down to the MCPM back-end level. Such split will have bad consequences longer term with code duplication between back-ends, harder to understand and maintain cpuidle decision points.
IMHO the subject of this thread should not be related to power management policy decisions and where they should live. The goal of MCPM and PSCI was not about defining policy for power management but providing mechanism and I agree with Catalin on this, we have to keep them separate. Then if the MCPM or PSCI implementation want to demote a C-state request since the code is about to flush the L2 cache with a wake-up interrupt pending that's perfectly fine by me, that is what Intel HW does BTW. But those are just optimizations, policy is implemented in the kernel, regardless of MCPM or PSCI. Let's always keep in mind that those policy decisions might and will be wrong sometimes eg:
1) last man in a cluster (in MCPM or PSCI kingdom) polls pending IRQs 2) No IRQs pending - last man starts flushing L2 3) a packet shows its head at the NI and triggers an IRQ
-> policy decision goes for a toss (ie L2 is flushed for nothing)
Since the kernel has no crystal ball, policy decisions sometimes might be wrong and that's FINE and this will happen even if MCPM and PSCI trim those policy decisions (demoting a C-state is fine, and it is done in Intel world in HW all the time).
And yes, the menu governor has been written for Intel platforms where cluster is a non-existing concept in C-states terms; reading C-states (in eg TC2) is misleading on ARM since we are forced to fill C-states with target residencies values that cater for cluster states even if that state *depends* on the state of other CPUS. The menu governor makes decision on a per-CPU basis and this is not optimal for ARM, it is as simple as that.
All this long-winded explanation to say that the debate MCPM vs PSCI has nothing to do with power management policies, that are better kept in the generic kernel layers and improved for ARM as a whole.
IMHO the debate must be and is around the coordination interface, which is by far the most important feature of MCPM and I think you summarized the concepts very well, with respective pros and cons; if I am allowed to give my opinion, please do not split the coordination across layers, that would be a total disaster - CPUs must be coordinated at the level where syncronization is required (if eg disabling/enabling CCI has to be done in secure world, the coordination scheme must live in secure code).
Lorenzo