On Tue, May 21, 2013 at 10:08:29PM +0100, Sebastian Capella wrote:
Thanks Liviu!
Some comments below..
Quoting Liviu Dudau (2013-05-21 10:15:42)
... Which side of the interface are you actually thinking of?
Both, I'm really just trying to understand the problem.
I don't think there is any C-state other than simple idle (which translates into an WFI for the core) that *doesn't* take into account power domain latencies and code path lengths to reach that state.
I'm speaking more about additional c-states after the lowest independent compute domain cstate, where we may add additional cstates which reduce the power further at a higher latency cost. These may be changing power states for the rest of the SOC or external power chips/supplies. Those states would effectively enter the lowest PSCI C-state, but then have additional steps in the CPUIdle hw specific driver.
Quoting from the PSCI spec:
"ARM systems generally include a power controller which provides the necessary mechanisms to control processor power. It normally provides interfaces to allow a number of power management functions. These often include support for transitioning processors, clusters or a superset, into low power states, where the processors are either fully switched off, or in quiescent states where they are not executing code. ARM strongly recommends that control of these states, via this power controller, is vested in the secure world. Otherwise, the OSPM could enter a low power mode without informing the Trusted OS. Even if such an arrangement could be made robust, it is unlikely to perform as well. In particular, for states where the core is fully power gated, a longer boot sequence would take place upon wake up as full initialization would be required by the secure world. This would be required as the secure components would effectively be booting from scratch every time. On a system where this power control is vested in the Secure world, these components would have an opportunity to save their state before powering off, allowing a faster resumption on power up. In addition, the secure world might need to manage peripherals as part of a power transition."
If you don't have such a power controller in your system then yes, you will have to drive the hardware from the CPUidle hw driver. But I don't see the need of a separate C-state for that.
I would say that the list of C-states that I have listed further down should cover most of the cases, maybe with the addition of an SYSTEM_SUSPEND state if I understood your concerns correctly.
Going on a tangent a bit:
To me, the C-states are like layers in an onion. Each deeper C-state includes the previous C-states that came in the list earlier. Therefore, you describe the C-state in terms of minimum total time to spend in that state and it includes the worst transition times (cost of reaching that state and to come out of it). Completely made up example:
CPU_ON < 2ms CPU_IDLE > 2ms CPU_OFF > 10ms CLUSTER_OFF > 500ms SYSTEM_SUSPEND > 5min SYSTEM_OFF > 1h
If you do that then the CPUidle driver decision becomes as simple as finding the right state that would not lead to a missed event and you don't really have to understand the costs of the host OS (if there is any). It should match the expectations of a real time system as well, if the table is correctly fine tuned (and if one understands that a real time system is about constant time response, not immediate response).
I don't know how to draw the line between the host OS costs and the guest OS costs when using target latencies. On one hand I think that the host OS should add its own costs into what gets passed to the guest and the guest will see a slower than baremetal system in terms of state transitions;
I was thinking maybe this also.. Is there a way to query the state transition cost information through PSCI? Would there be a way to have the layers of hosts/monitors/etc contribute the cost of their paths into the query results?
Possibly. PSCI spec doesn't specify any API for querying the C-state costs because the way to do so is still in the air. We know that the server world would like to carry on using ACPI for describing those states, device tree-based systems would probably invent a different way or learn how to integrate with ACPI.
... on the other hand I would like to see the guest OS shielded from this type of information as there are too many variables behind it (is the host OS also under some monitor code? are all transitions to the same state happening in constant time or are they dependent of number of cores involved, their state, etc, etc)
I agree, but don't see how. In our systems, we do very much care about the costs, and have ~real time constraints to manage. I think we need a good understanding of costs for the hw states.
And are those costs constant? Do you depend on how many CPUs you have online to determine how long it will take to do a cluster shutdown? Does having the DMA engine on add to the quiescence time? While I don't doubt that you understand what are the minimum time constraints that the hardware imposes, it's the combination of all the elements in the system that is under software control that gives the final answer and in most cases it is "depends".
If one uses a simple set of C-states (CPU_ON, CPU_IDLE, CPU_OFF, CLUSTER_OFF, SYSTEM_OFF) then the guest could make requests independent of the host OS latencies _after the relevant translations between time-to-next-event and intended target C-state have been performed_.
I think that if we don't know the real cost of entering a state, we basically will end up chosing the wrong states in many occasions.
True. But that "real" cost is usually an estimate of the worst case, or an average time, right?
CPUIdle is already binning the allowable costs into a specific state. If we decide that CPUIdle does not know the real cost of the states then the binning will be wrong sometimes, and cpuidle would not be selecting the correct states. I think this could have bad side effects for real time systems.
CPUidle does know the costs. The "reality" of those costs depends on the system you are running (virtualised or not, trusted OS trapping you calls or not). If the costs do not reflect the actual transition time then yes, CPUidle will make the wrong decision and the system won't work as intended. I'm not advocating doing that.
Also, I don't understand your remark regarding real time systems. If the CPUidle costs are wrong the decision will be wrong regardless of the type of system you use. Or are you concerned that being too conservative and lying to the OS about the actual cost for the system to transition to the new state at that moment will introduce unnecessary delays and forgo the real time functionality.
For my purposes and as things are today, I'd likely factor in the (probably pre-known & measured) host os/monitor costs into the cpuidle DT entries and have cpuidle run the show. At the lower layers, it won't matter what is passed through as long as the correct state is chosen.
Understood. I'm advocating the same thing with the only added caveat that the state you choose is not a physical system state in all cases, but a state that makes sense for the OS running at that level. As such, the numbers that will be used by CPUidle will be in the "ballpark" region rather than absolute numbers.
Any running OS should only be concerned with getting the time to the next event right (be it real time constrained or not) and finding out which C-state will guarantee availability at that time. If one doesn't know when the next event will come then being conservative should be good enough. There is no way you will have a ~real time system if you transition to cluster off and the real cost of coming out is measured in miliseconds, regardless of how you came to that decision.
Best regards, Liviu
Thanks,
Sebastian
`