also cc linaro kernel
Hi,
This patch will forward target residency information from the arm_big_little driver to mcpm.
If multiple powerdown states are used, the vendor specific code will need a way to distinguish the intended c-state information.
I do not have TC2 hardware to verify this. Would someone be able to help verify this change on TC2?
Thanks!
Sebastian
Sebastian Capella (1): cpuidle: arm_big_little: route target residency to mcpm
drivers/cpuidle/arm_big_little.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
Pass residency information to the mcpm_cpu_suspend. The information is taken from the target_residency of the intended C-state.
When a platform uses multiple powerdown cstates, the residency information indicates which powerdown state is targeted. Multiple powerdown cstate information can be maintained in the device tree and the vendor specific handling will then have enough information to determine what power state to enter without needing additional changes to the big_little framework.
Signed-off-by: Sebastian Capella sebastian.capella@linaro.org --- drivers/cpuidle/arm_big_little.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/cpuidle/arm_big_little.c b/drivers/cpuidle/arm_big_little.c index a430800..8332b05 100644 --- a/drivers/cpuidle/arm_big_little.c +++ b/drivers/cpuidle/arm_big_little.c @@ -89,7 +89,7 @@ static int notrace bl_powerdown_finisher(unsigned long arg) unsigned int cpu = mpidr & 0xf;
mcpm_set_entry_vector(cpu, cluster, cpu_resume); - mcpm_cpu_suspend(0); /* 0 should be replaced with better value here */ + mcpm_cpu_suspend(arg); return 1; }
@@ -107,6 +107,7 @@ static int bl_enter_powerdown(struct cpuidle_device *dev, { struct timespec ts_preidle, ts_postidle, ts_idle; int ret; + struct cpuidle_state *state = &drv->states[idx];
/* Used to keep track of the total time in idle */ getnstimeofday(&ts_preidle); @@ -117,7 +118,8 @@ static int bl_enter_powerdown(struct cpuidle_device *dev,
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
- ret = cpu_suspend((unsigned long) dev, bl_powerdown_finisher); + ret = cpu_suspend((unsigned long) state->target_residency, + bl_powerdown_finisher); if (ret) BUG();
Hi Sebastian,
On Mon, May 13, 2013 at 07:53:42PM +0100, Sebastian Capella wrote:
Pass residency information to the mcpm_cpu_suspend. The information is taken from the target_residency of the intended C-state.
When a platform uses multiple powerdown cstates, the residency information indicates which powerdown state is targeted. Multiple powerdown cstate information can be maintained in the device tree and the vendor specific handling will then have enough information to determine what power state to enter without needing additional changes to the big_little framework.
Signed-off-by: Sebastian Capella sebastian.capella@linaro.org
drivers/cpuidle/arm_big_little.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/cpuidle/arm_big_little.c b/drivers/cpuidle/arm_big_little.c index a430800..8332b05 100644 --- a/drivers/cpuidle/arm_big_little.c +++ b/drivers/cpuidle/arm_big_little.c
I could not find a branch that contains this file. Which git tree and branch are you using?
@@ -89,7 +89,7 @@ static int notrace bl_powerdown_finisher(unsigned long arg) unsigned int cpu = mpidr & 0xf;
mcpm_set_entry_vector(cpu, cluster, cpu_resume);
mcpm_cpu_suspend(0); /* 0 should be replaced with better value here */
mcpm_cpu_suspend(arg); return 1;
}
@@ -107,6 +107,7 @@ static int bl_enter_powerdown(struct cpuidle_device *dev, { struct timespec ts_preidle, ts_postidle, ts_idle; int ret;
struct cpuidle_state *state = &drv->states[idx]; /* Used to keep track of the total time in idle */ getnstimeofday(&ts_preidle);
@@ -117,7 +118,8 @@ static int bl_enter_powerdown(struct cpuidle_device *dev,
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
ret = cpu_suspend((unsigned long) dev, bl_powerdown_finisher);
ret = cpu_suspend((unsigned long) state->target_residency,
bl_powerdown_finisher);
I don't think you should pass the target residency here but the intended C-state. Think about what will happen when you run this in a guest kernel: is the target_residency the same if the guest has been migrated from a big core that might have a faster execution of the down/up path to a little core that is slower? The intended C-state should stay the same, regardless of the actual time it takes to get there and out, which affects the actual time spent inside the state.
Best regards, Liviu
On 05/15/2013 05:24 PM, Liviu Dudau wrote:
Hi Sebastian,
On Mon, May 13, 2013 at 07:53:42PM +0100, Sebastian Capella wrote:
Pass residency information to the mcpm_cpu_suspend. The information is taken from the target_residency of the intended C-state.
When a platform uses multiple powerdown cstates, the residency information indicates which powerdown state is targeted. Multiple powerdown cstate information can be maintained in the device tree and the vendor specific handling will then have enough information to determine what power state to enter without needing additional changes to the big_little framework.
Signed-off-by: Sebastian Capella sebastian.capella@linaro.org
drivers/cpuidle/arm_big_little.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/cpuidle/arm_big_little.c b/drivers/cpuidle/arm_big_little.c index a430800..8332b05 100644 --- a/drivers/cpuidle/arm_big_little.c +++ b/drivers/cpuidle/arm_big_little.c
I could not find a branch that contains this file. Which git tree and branch are you using?
I believe it should apply to:
https://git.linaro.org/gitweb?p=landing-teams/working/arm/kernel.git%3Ba=blo...
@@ -89,7 +89,7 @@ static int notrace bl_powerdown_finisher(unsigned long arg) unsigned int cpu = mpidr & 0xf;
mcpm_set_entry_vector(cpu, cluster, cpu_resume);
mcpm_cpu_suspend(0); /* 0 should be replaced with better value here */
mcpm_cpu_suspend(arg); return 1;
}
@@ -107,6 +107,7 @@ static int bl_enter_powerdown(struct cpuidle_device *dev, { struct timespec ts_preidle, ts_postidle, ts_idle; int ret;
struct cpuidle_state *state = &drv->states[idx]; /* Used to keep track of the total time in idle */ getnstimeofday(&ts_preidle);
@@ -117,7 +118,8 @@ static int bl_enter_powerdown(struct cpuidle_device *dev,
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
ret = cpu_suspend((unsigned long) dev, bl_powerdown_finisher);
ret = cpu_suspend((unsigned long) state->target_residency,
bl_powerdown_finisher);
I don't think you should pass the target residency here but the intended C-state. Think about what will happen when you run this in a guest kernel: is the target_residency the same if the guest has been migrated from a big core that might have a faster execution of the down/up path to a little core that is slower? The intended C-state should stay the same, regardless of the actual time it takes to get there and out, which affects the actual time spent inside the state.
Best regards, Liviu
Thanks Daniel!
Liviu,
I have been using on the linux-linaro branch in the linux-linaro-tracking repository here:
https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git%3Ba=shortlo...
Sorry for missing that.
Thanks!
Sebastian
On 15 May 2013 08:47, Daniel Lezcano daniel.lezcano@linaro.org wrote:
On 05/15/2013 05:24 PM, Liviu Dudau wrote:
Hi Sebastian,
On Mon, May 13, 2013 at 07:53:42PM +0100, Sebastian Capella wrote:
Pass residency information to the mcpm_cpu_suspend. The information is taken from the target_residency of the intended C-state.
When a platform uses multiple powerdown cstates, the residency
information
indicates which powerdown state is targeted. Multiple powerdown cstate information can be maintained in the device tree and the vendor specific handling will then have enough information to determine what power
state to
enter without needing additional changes to the big_little framework.
Signed-off-by: Sebastian Capella sebastian.capella@linaro.org
drivers/cpuidle/arm_big_little.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/cpuidle/arm_big_little.c
b/drivers/cpuidle/arm_big_little.c
index a430800..8332b05 100644 --- a/drivers/cpuidle/arm_big_little.c +++ b/drivers/cpuidle/arm_big_little.c
I could not find a branch that contains this file. Which git tree and
branch
are you using?
I believe it should apply to:
https://git.linaro.org/gitweb?p=landing-teams/working/arm/kernel.git%3Ba=blo...
@@ -89,7 +89,7 @@ static int notrace bl_powerdown_finisher(unsigned
long arg)
unsigned int cpu = mpidr & 0xf; mcpm_set_entry_vector(cpu, cluster, cpu_resume);
mcpm_cpu_suspend(0); /* 0 should be replaced with better value
here */
mcpm_cpu_suspend(arg); return 1;
}
@@ -107,6 +107,7 @@ static int bl_enter_powerdown(struct cpuidle_device
*dev,
{ struct timespec ts_preidle, ts_postidle, ts_idle; int ret;
struct cpuidle_state *state = &drv->states[idx]; /* Used to keep track of the total time in idle */ getnstimeofday(&ts_preidle);
@@ -117,7 +118,8 @@ static int bl_enter_powerdown(struct cpuidle_device
*dev,
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
ret = cpu_suspend((unsigned long) dev, bl_powerdown_finisher);
ret = cpu_suspend((unsigned long) state->target_residency,
bl_powerdown_finisher);
I don't think you should pass the target residency here but the intended C-state. Think about what will happen when you run this in a guest
kernel: is
the target_residency the same if the guest has been migrated from a big
core
that might have a faster execution of the down/up path to a little core
that
is slower? The intended C-state should stay the same, regardless of the
actual
time it takes to get there and out, which affects the actual time spent
inside
the state.
Best regards, Liviu
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
On Wed, 2013-05-15 at 09:49 -0700, Sebastian Capella wrote:
Thanks Daniel!
Liviu,
I have been using on the linux-linaro branch in the linux-linaro-tracking repository here:
https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git%3Ba=shortlo...
Generally, that's the Linaro kernel tree people should use and what is built daily and released monthly.
It's just it hasn't moved to 3.10 yet (will do in the next day or so) but the topic branches which feed into it (that Liviu pointed out) have already made that move.
Hi Liviu,
Regarding your comments about using the C-state instead of the residency, we based off of the existing mcpm_suspend call which currently takes residency (with a 0 meaning lowest power).
We used calls (including mcpm_suspend) in the hot plug/suspend path. However, it does not know about c-states. I suspect others may want to do the same. Do you know how suspend is done on tc2?
Regarding guest kernels, I don't think I understand the implications. If we migrate between cores (having different parameters) in the middle of a cstate transition, can we have correct behavior? Wouldn't it be worse to migrate to a lower c-state then we had intended?
Thanks,
Sebastian
On 15 May 2013 10:07, Jon Medhurst (Tixy) tixy@linaro.org wrote:
On Wed, 2013-05-15 at 09:49 -0700, Sebastian Capella wrote:
Thanks Daniel!
Liviu,
I have been using on the linux-linaro branch in the linux-linaro-tracking repository here:
https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git%3Ba=shortlo...
Generally, that's the Linaro kernel tree people should use and what is built daily and released monthly.
It's just it hasn't moved to 3.10 yet (will do in the next day or so) but the topic branches which feed into it (that Liviu pointed out) have already made that move.
-- Tixy
On Wed, May 15, 2013 at 07:05:10PM +0100, Sebastian Capella wrote:
Hi Liviu,
Regarding your comments about using the C-state instead of the residency, we based off of the existing mcpm_suspend call which currently takes residency (with a 0 meaning lowest power).
We used calls (including mcpm_suspend) in the hot plug/suspend path. However, it does not know about c-states. I suspect others may want to do the same. Do you know how suspend is done on tc2?
Regarding guest kernels, I don't think I understand the implications. If we migrate between cores (having different parameters) in the middle of a cstate transition, can we have correct behavior? Wouldn't it be worse to migrate to a lower c-state then we had intended?
Thanks,
Sebastian
On 15 May 2013 10:07, Jon Medhurst (Tixy) <tixy@linaro.orgmailto:tixy@linaro.org> wrote: On Wed, 2013-05-15 at 09:49 -0700, Sebastian Capella wrote:
Thanks Daniel!
Liviu,
I have been using on the linux-linaro branch in the linux-linaro-tracking repository here:
https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git%3Ba=shortlo...
Generally, that's the Linaro kernel tree people should use and what is built daily and released monthly.
It's just it hasn't moved to 3.10 yet (will do in the next day or so) but the topic branches which feed into it (that Liviu pointed out) have already made that move.
-- Tixy
Hi Sebastian,
From previous discussions between Achin, Charles and Nico I am aware that Nico has decided for the moment that target residency should be useful enough to be used by MCPM. That is because Nico is a big proponent of doing everything in the kernel and keeping the firmware dumb and (mostly) out of the way. However, the view that we have here at ARM (but I will only speak in my name here) is that in order to have alignment with AArch64 kernel and the way it is using PSCI interface, we should be moving the kernel on AArch32 and armv7a to run in non-secure mode. At that time, the kernel will make PSCI calls to do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware the deepest C-state that the core can support going to without being woken up to do any additional state management. It is then the latitude of the firmware to put the core in that state or to tally the sum of all requests in a cluster and decide to put the cores and the cluster in the lowest common C-state.
Regarding the migration of the guest kernels, it should be transparent (to a certain extent) wether on resume it is running on the same core or it has been migrated. The host OS should have a better understanding on what can be achieved and what invariants it can still hold, but it should not be limited to do that in a specific amount of time. Lets take an example: one core in the cluster says that it can go as deep as cluster shutdown but it does so in your use of the API by saying that it would like to sleep for at least amount X of time. The host however has to tally all the cores in the cluster in order to decide if the cluster can be shutdown, has to do a lot of cache maintainance and state saving, turning off clocks and devices etc, and in doing so is going to consume some compute cycles; it will then substract the time spent making a decision and doing the cleanup and then figure out if there is still time left for each of the cores to go to sleep for the specified amount of time. All this implies that the guest has to have an understanding of the time the host is spending in doing maintainance operations before asking the hypervisor for a target residency and the host still has to do the math again to validate that the guest request is still valid.
If we choose to use the target C-state, the request validation is simplified to a comparision between each core target C-state and the lowest common C-state per cluster, all done in the host.
Of course, by describing C-states in terms of target residency times both schemes can be considered equivalent. But that target residency time is not constant for all code paths and for all conditions and that makes the decision process more complicated.
Hope that provides some clarification.
Best regards, Liviu
On Thu, 16 May 2013, Liviu Dudau wrote:
From previous discussions between Achin, Charles and Nico I am aware that Nico has decided for the moment that target residency should be useful enough to be used by MCPM. That is because Nico is a big proponent of doing everything in the kernel and keeping the firmware dumb and (mostly) out of the way. However, the view that we have here at ARM (but I will only speak in my name here) is that in order to have alignment with AArch64 kernel and the way it is using PSCI interface, we should be moving the kernel on AArch32 and armv7a to run in non-secure mode. At that time, the kernel will make PSCI calls to do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware the deepest C-state that the core can support going to without being woken up to do any additional state management. It is then the latitude of the firmware to put the core in that state or to tally the sum of all requests in a cluster and decide to put the cores and the cluster in the lowest common C-state.
That's all good.
My worry is about the definition of all the different C-state on all the different platforms. I think it is simpler to have the kernel tell the firmware what it anticipates in terms of load/quiescence periods (e.g. the next interrupt is likely to happen in x millisecs), and let the firmware and/or low-level machine specific backend translate that into the appropriate C-state on its own. After all, the firmware is supposed to know what is the best C-state to apply given a target latency and the current state of the surrounding CPUs, which may also differ depending on the cluster type, etc.
Regarding the migration of the guest kernels, it should be transparent (to a certain extent) wether on resume it is running on the same core or it has been migrated. The host OS should have a better understanding on what can be achieved and what invariants it can still hold, but it should not be limited to do that in a specific amount of time. Lets take an example: one core in the cluster says that it can go as deep as cluster shutdown but it does so in your use of the API by saying that it would like to sleep for at least amount X of time. The host however has to tally all the cores in the cluster in order to decide if the cluster can be shutdown, has to do a lot of cache maintainance and state saving, turning off clocks and devices etc, and in doing so is going to consume some compute cycles; it will then substract the time spent making a decision and doing the cleanup and then figure out if there is still time left for each of the cores to go to sleep for the specified amount of time. All this implies that the guest has to have an understanding of the time the host is spending in doing maintainance operations before asking the hypervisor for a target residency and the host still has to do the math again to validate that the guest request is still valid.
I don't follow your reasoning. Why would the guest have to care about what the host can do at all and in what amount of time? What the guest should tell the host is this: "I don't anticipate any need for the CPU during the next 500 ms so take this as a hint to perform the most efficient power saving given the constraints you alone have the knowledge of." The host should know how long it takes to flush its cache, whether or not that cache is in use by other guests, etc. But the guest should not care.
And in this case the math performed by the guest and the host are completely different.
If we choose to use the target C-state, the request validation is simplified to a comparision between each core target C-state and the lowest common C-state per cluster, all done in the host.
Of course, by describing C-states in terms of target residency times both schemes can be considered equivalent. But that target residency time is not constant for all code paths and for all conditions and that makes the decision process more complicated.
For who?
If the guest is responsible for choosing a C-state itself and pass it on to the host, it has to process through a set of available C-states and select the proper one according to the target residency time it must compute anyway since this is all the scheduler can tell you. And since those C-states are likely to have different latency profiles on different clusters, the guest will have to query the type of host it is running on or the available C-states each time it wants to select one, etc. So I don't think passing the target residency directly to the host is more complicated when you look at the big picture.
Nicolas
On Thu, May 16, 2013 at 07:21:55PM +0100, Nicolas Pitre wrote:
On Thu, 16 May 2013, Liviu Dudau wrote:
From previous discussions between Achin, Charles and Nico I am aware that Nico has decided for the moment that target residency should be useful enough to be used by MCPM. That is because Nico is a big proponent of doing everything in the kernel and keeping the firmware dumb and (mostly) out of the way. However, the view that we have here at ARM (but I will only speak in my name here) is that in order to have alignment with AArch64 kernel and the way it is using PSCI interface, we should be moving the kernel on AArch32 and armv7a to run in non-secure mode. At that time, the kernel will make PSCI calls to do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware the deepest C-state that the core can support going to without being woken up to do any additional state management. It is then the latitude of the firmware to put the core in that state or to tally the sum of all requests in a cluster and decide to put the cores and the cluster in the lowest common C-state.
That's all good.
My worry is about the definition of all the different C-state on all the different platforms. I think it is simpler to have the kernel tell the firmware what it anticipates in terms of load/quiescence periods (e.g. the next interrupt is likely to happen in x millisecs), and let the firmware and/or low-level machine specific backend translate that into the appropriate C-state on its own. After all, the firmware is supposed to know what is the best C-state to apply given a target latency and the current state of the surrounding CPUs, which may also differ depending on the cluster type, etc.
Regarding the migration of the guest kernels, it should be transparent (to a certain extent) wether on resume it is running on the same core or it has been migrated. The host OS should have a better understanding on what can be achieved and what invariants it can still hold, but it should not be limited to do that in a specific amount of time. Lets take an example: one core in the cluster says that it can go as deep as cluster shutdown but it does so in your use of the API by saying that it would like to sleep for at least amount X of time. The host however has to tally all the cores in the cluster in order to decide if the cluster can be shutdown, has to do a lot of cache maintainance and state saving, turning off clocks and devices etc, and in doing so is going to consume some compute cycles; it will then substract the time spent making a decision and doing the cleanup and then figure out if there is still time left for each of the cores to go to sleep for the specified amount of time. All this implies that the guest has to have an understanding of the time the host is spending in doing maintainance operations before asking the hypervisor for a target residency and the host still has to do the math again to validate that the guest request is still valid.
I don't follow your reasoning. Why would the guest have to care about what the host can do at all and in what amount of time? What the guest should tell the host is this: "I don't anticipate any need for the CPU during the next 500 ms so take this as a hint to perform the most efficient power saving given the constraints you alone have the knowledge of." The host should know how long it takes to flush its cache, whether or not that cache is in use by other guests, etc. But the guest should not care.
Exactly my position. What I was doing was to show in an example what the use of target residency actually means (i.e. guest sends a hint, but the host cannot use that hint straight away as it needs first to calculate the common longest amount of time that all the cores in cluster can sleep, do the maintainace operations and then recalculate the remaining time again in order to validate that it can still go to that C-state. I was not implying that the guest has to know the cost of the host maintainance operations (other than that it is probably built into the target residency values that the guests uses).
In my (possibly simplistic) view of the world, by specifying the target C-state as a state and not as a time interval the guest frees the host to do various operations that it doesn't care about (or it cannot know about). If as a guest I tell the hypervisor that I can survive cluster shutdown then the infrastructure can then migrate me to a machine in a different corner of the world where I'm going to be warm booted from the snapshot taken at cluster shutdown.
Remember that time can be virtualised as well, so although the scheduler tells me that I need to wakeup in 500ms, that doesn't mean wall clock time. The host can lie about the current time when the guest wakes up.
And in this case the math performed by the guest and the host are completely different.
If we choose to use the target C-state, the request validation is simplified to a comparision between each core target C-state and the lowest common C-state per cluster, all done in the host.
Of course, by describing C-states in terms of target residency times both schemes can be considered equivalent. But that target residency time is not constant for all code paths and for all conditions and that makes the decision process more complicated.
For who?
Sorry, I meant to say: "But the maintainance cost is not constant for all code paths .... "
If the guest is responsible for choosing a C-state itself and pass it on to the host, it has to process through a set of available C-states and select the proper one according to the target residency time it must compute anyway since this is all the scheduler can tell you. And since those C-states are likely to have different latency profiles on different clusters, the guest will have to query the type of host it is running on or the available C-states each time it wants to select one, etc. So I don't think passing the target residency directly to the host is more complicated when you look at the big picture.
I agree. But I was trying to keep the host code small and dumb and let the guest code do all the calculations and the translation between target residency and C-states.
One other piece of information that target residency time does not convey is the restrictions regarding migration. If as a guest I can sleep for 500ms because the tape device that I'm reading data from is slow and the reading hardware has a buffer big enough that is not going to wake me up for a while that doesn't mean that the whole cluster can be shut down and migrated. Yet the same 500ms target residency will be used to signal that a guest is idle and does nothing, so it can be shut down together with all the idle cores in the cluster. So how is the host going to know?
Best regards, Liviu
Nicolas
Hi Nico,
On 16 May 2013 19:21, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Thu, 16 May 2013, Liviu Dudau wrote:
From previous discussions between Achin, Charles and Nico I am aware that Nico has decided for the moment that target residency should be useful enough to be used by MCPM. That is because Nico is a big proponent of doing everything in the kernel and keeping the firmware dumb and (mostly) out of the way. However, the view that we have here at ARM (but I will only speak in my name here) is that in order to have alignment with AArch64 kernel and the way it is using PSCI interface, we should be moving the kernel on AArch32 and armv7a to run in non-secure mode. At that time, the kernel will make PSCI calls to do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware the deepest C-state that the core can support going to without being woken up to do any additional state management. It is then the latitude of the firmware to put the core in that state or to tally the sum of all requests in a cluster and decide to put the cores and the cluster in the lowest common C-state.
That's all good.
My worry is about the definition of all the different C-state on all the different platforms. I think it is simpler to have the kernel tell the firmware what it anticipates in terms of load/quiescence periods (e.g. the next interrupt is likely to happen in x millisecs), and let the firmware and/or low-level machine specific backend translate that into the appropriate C-state on its own. After all, the firmware is supposed to know what is the best C-state to apply given a target latency and the current state of the surrounding CPUs, which may also differ depending on the cluster type, etc.
While I'm for abstracting platform details behind firmware like PSCI (especially when SMCs are required), I would rather keep the firmware simple (i.e. not too much cleverness). I think the cpuidle framework in Linux is a better place for deciding which C-state it can target and we only need a way to describe the states/latencies in the DT.
Regarding the migration of the guest kernels, it should be transparent (to a certain extent) wether on resume it is running on the same core or it has been migrated. The host OS should have a better understanding on what can be achieved and what invariants it can still hold, but it should not be limited to do that in a specific amount of time. Lets take an example: one core in the cluster says that it can go as deep as cluster shutdown but it does so in your use of the API by saying that it would like to sleep for at least amount X of time. The host however has to tally all the cores in the cluster in order to decide if the cluster can be shutdown, has to do a lot of cache maintainance and state saving, turning off clocks and devices etc, and in doing so is going to consume some compute cycles; it will then substract the time spent making a decision and doing the cleanup and then figure out if there is still time left for each of the cores to go to sleep for the specified amount of time. All this implies that the guest has to have an understanding of the time the host is spending in doing maintainance operations before asking the hypervisor for a target residency and the host still has to do the math again to validate that the guest request is still valid.
I don't follow your reasoning. Why would the guest have to care about what the host can do at all and in what amount of time? What the guest should tell the host is this: "I don't anticipate any need for the CPU during the next 500 ms so take this as a hint to perform the most efficient power saving given the constraints you alone have the knowledge of." The host should know how long it takes to flush its cache, whether or not that cache is in use by other guests, etc. But the guest should not care.
I agree, the guest shouldn't know about the host C-states. The guest most likely will be provided with virtual C-states/latencies and the host could make a decision based the C-state of the guests.
-- Catalin
On Wed, May 15, 2013 at 07:05:10PM +0100, Sebastian Capella wrote:
Hi Liviu,
Regarding your comments about using the C-state instead of the residency, we based off of the existing mcpm_suspend call which currently takes residency (with a 0 meaning lowest power).
We used calls (including mcpm_suspend) in the hot plug/suspend path. However, it does not know about c-states. I suspect others may want to do the same. Do you know how suspend is done on tc2?
Regarding guest kernels, I don't think I understand the implications. If we migrate between cores (having different parameters) in the middle of a cstate transition, can we have correct behavior? Wouldn't it be worse to migrate to a lower c-state then we had intended?
Thanks,
Sebastian
On 15 May 2013 10:07, Jon Medhurst (Tixy) <tixy@linaro.orgmailto:tixy@linaro.org> wrote: On Wed, 2013-05-15 at 09:49 -0700, Sebastian Capella wrote:
Thanks Daniel!
Liviu,
I have been using on the linux-linaro branch in the linux-linaro-tracking repository here:
https://git.linaro.org/gitweb?p=kernel/linux-linaro-tracking.git%3Ba=shortlo...
Generally, that's the Linaro kernel tree people should use and what is built daily and released monthly.
It's just it hasn't moved to 3.10 yet (will do in the next day or so) but the topic branches which feed into it (that Liviu pointed out) have already made that move.
-- Tixy
Hi Sebastian,
From previous discussions between Achin, Charles and Nico I am aware that Nico has decided for the moment that target residency should be useful enough to be used by MCPM. That is because Nico is a big proponent of doing everything in the kernel and keeping the firmware dumb and (mostly) out of the way. However, the view that we have here at ARM (but I will only speak in my name here) is that in order to have alignment with AArch64 kernel and the way it is using PSCI interface, we should be moving the kernel on AArch32 and armv7a to run in non-secure mode. At that time, the kernel will make PSCI calls to do CPU_ON, CPU_SUSPEND, etc. and the aim is to provide to the firmware the deepest C-state that the core can support going to without being woken up to do any additional state management. It is then the latitude of the firmware to put the core in that state or to tally the sum of all requests in a cluster and decide to put the cores and the cluster in the lowest common C-state.
Regarding the migration of the guest kernels, it should be transparent (to a certain extent) wether on resume it is running on the same core or it has been migrated. The host OS should have a better understanding on what can be achieved and what invariants it can still hold, but it should not be limited to do that in a specific amount of time. Lets take an example: one core in the cluster says that it can go as deep as cluster shutdown but it does so in your use of the API by saying that it would like to sleep for at least amount X of time. The host however has to tally all the cores in the cluster in order to decide if the cluster can be shutdown, has to do a lot of cache maintainance and state saving, turning off clocks and devices etc, and in doing so is going to consume some compute cycles; it will then substract the time spent making a decision and doing the cleanup and then figure out if there is still time left for each of the cores to go to sleep for the specified amount of time. All this implies that the guest has to have an understanding of the time the host is spending in doing maintainance operations before asking the hypervisor for a target residency and the host still has to do the math again to validate that the guest request is still valid.
If we choose to use the target C-state, the request validation is simplified to a comparision between each core target C-state and the lowest common C-state per cluster, all done in the host.
Of course, by describing C-states in terms of target residency times both schemes can be considered equivalent. But that target residency time is not constant for all code paths and for all conditions and that makes the decision process more complicated.
Hope that provides some clarification.
Best regards, Liviu
Hi Nico, Liviu, Catalin,
Do you expect there to also be cases where the PSCI interface may not be aware of all of the platform states?
Eg. if you have an SOC, not all of the cstates and latencies are directly related to the ARM core.. Maybe you can have additional states and latencies accounting for the cost of enabling external power supplies, restoring state for non-retained peripherals/hw, etc.
Currently, this type of thing can be specified in cpuidle with aditional cstates and handled in vendor specific sw, with the additional cstates being selected when the TR/latency requirements are least restrictive.
How would these states be handled considering also host os costs?
Thanks,
Sebastian
On Tue, May 21, 2013 at 01:39:38AM +0100, Sebastian Capella wrote:
Hi Nico, Liviu, Catalin,
Hi Sebastian,
Do you expect there to also be cases where the PSCI interface may not be aware of all of the platform states?
Not sure why the PSCI interface would have a say here. It's only an interface between two pieces of code, it should not have state awareness. Which side of the interface are you actually thinking of?
Eg. if you have an SOC, not all of the cstates and latencies are directly related to the ARM core.. Maybe you can have additional states and latencies accounting for the cost of enabling external power supplies, restoring state for non-retained peripherals/hw, etc.
I don't think there is any C-state other than simple idle (which translates into an WFI for the core) that *doesn't* take into account power domain latencies and code path lengths to reach that state. I don't see the usefulness of describing the latency of going into CPU_OFF state without including all the steps to reach the state and come back out of it.
Are you thinking of C-states that do not belong to the compute domain but might still be part of the SoC? (things like an System MMU, or a DMA engine, etc)
Currently, this type of thing can be specified in cpuidle with aditional cstates and handled in vendor specific sw, with the additional cstates being selected when the TR/latency requirements are least restrictive.
How would these states be handled considering also host os costs?
I don't know how to draw the line between the host OS costs and the guest OS costs when using target latencies. On one hand I think that the host OS should add its own costs into what gets passed to the guest and the guest will see a slower than baremetal system in terms of state transitions; on the other hand I would like to see the guest OS shielded from this type of information as there are too many variables behind it (is the host OS also under some monitor code? are all transitions to the same state happening in constant time or are they dependent of number of cores involved, their state, etc, etc)
If one uses a simple set of C-states (CPU_ON, CPU_IDLE, CPU_OFF, CLUSTER_OFF, SYSTEM_OFF) then the guest could make requests independent of the host OS latencies _after the relevant translations between time-to-next-event and intended target C-state have been performed_ .
Hope this helps, Liviu
Thanks,
Sebastian
linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev
Thanks Liviu!
Some comments below..
Quoting Liviu Dudau (2013-05-21 10:15:42)
... Which side of the interface are you actually thinking of?
Both, I'm really just trying to understand the problem.
I don't think there is any C-state other than simple idle (which translates into an WFI for the core) that *doesn't* take into account power domain latencies and code path lengths to reach that state.
I'm speaking more about additional c-states after the lowest independent compute domain cstate, where we may add additional cstates which reduce the power further at a higher latency cost. These may be changing power states for the rest of the SOC or external power chips/supplies. Those states would effectively enter the lowest PSCI C-state, but then have additional steps in the CPUIdle hw specific driver.
I don't know how to draw the line between the host OS costs and the guest OS costs when using target latencies. On one hand I think that the host OS should add its own costs into what gets passed to the guest and the guest will see a slower than baremetal system in terms of state transitions;
I was thinking maybe this also.. Is there a way to query the state transition cost information through PSCI? Would there be a way to have the layers of hosts/monitors/etc contribute the cost of their paths into the query results?
... on the other hand I would like to see the guest OS shielded from this type of information as there are too many variables behind it (is the host OS also under some monitor code? are all transitions to the same state happening in constant time or are they dependent of number of cores involved, their state, etc, etc)
I agree, but don't see how. In our systems, we do very much care about the costs, and have ~real time constraints to manage. I think we need a good understanding of costs for the hw states.
If one uses a simple set of C-states (CPU_ON, CPU_IDLE, CPU_OFF, CLUSTER_OFF, SYSTEM_OFF) then the guest could make requests independent of the host OS latencies _after the relevant translations between time-to-next-event and intended target C-state have been performed_.
I think that if we don't know the real cost of entering a state, we basically will end up chosing the wrong states in many occasions.
CPUIdle is already binning the allowable costs into a specific state. If we decide that CPUIdle does not know the real cost of the states then the binning will be wrong sometimes, and cpuidle would not be selecting the correct states. I think this could have bad side effects for real time systems.
For my purposes and as things are today, I'd likely factor in the (probably pre-known & measured) host os/monitor costs into the cpuidle DT entries and have cpuidle run the show. At the lower layers, it won't matter what is passed through as long as the correct state is chosen.
Thanks,
Sebastian
`
On Tue, May 21, 2013 at 10:08:29PM +0100, Sebastian Capella wrote:
Thanks Liviu!
Some comments below..
Quoting Liviu Dudau (2013-05-21 10:15:42)
... Which side of the interface are you actually thinking of?
Both, I'm really just trying to understand the problem.
I don't think there is any C-state other than simple idle (which translates into an WFI for the core) that *doesn't* take into account power domain latencies and code path lengths to reach that state.
I'm speaking more about additional c-states after the lowest independent compute domain cstate, where we may add additional cstates which reduce the power further at a higher latency cost. These may be changing power states for the rest of the SOC or external power chips/supplies. Those states would effectively enter the lowest PSCI C-state, but then have additional steps in the CPUIdle hw specific driver.
Quoting from the PSCI spec:
"ARM systems generally include a power controller which provides the necessary mechanisms to control processor power. It normally provides interfaces to allow a number of power management functions. These often include support for transitioning processors, clusters or a superset, into low power states, where the processors are either fully switched off, or in quiescent states where they are not executing code. ARM strongly recommends that control of these states, via this power controller, is vested in the secure world. Otherwise, the OSPM could enter a low power mode without informing the Trusted OS. Even if such an arrangement could be made robust, it is unlikely to perform as well. In particular, for states where the core is fully power gated, a longer boot sequence would take place upon wake up as full initialization would be required by the secure world. This would be required as the secure components would effectively be booting from scratch every time. On a system where this power control is vested in the Secure world, these components would have an opportunity to save their state before powering off, allowing a faster resumption on power up. In addition, the secure world might need to manage peripherals as part of a power transition."
If you don't have such a power controller in your system then yes, you will have to drive the hardware from the CPUidle hw driver. But I don't see the need of a separate C-state for that.
I would say that the list of C-states that I have listed further down should cover most of the cases, maybe with the addition of an SYSTEM_SUSPEND state if I understood your concerns correctly.
Going on a tangent a bit:
To me, the C-states are like layers in an onion. Each deeper C-state includes the previous C-states that came in the list earlier. Therefore, you describe the C-state in terms of minimum total time to spend in that state and it includes the worst transition times (cost of reaching that state and to come out of it). Completely made up example:
CPU_ON < 2ms CPU_IDLE > 2ms CPU_OFF > 10ms CLUSTER_OFF > 500ms SYSTEM_SUSPEND > 5min SYSTEM_OFF > 1h
If you do that then the CPUidle driver decision becomes as simple as finding the right state that would not lead to a missed event and you don't really have to understand the costs of the host OS (if there is any). It should match the expectations of a real time system as well, if the table is correctly fine tuned (and if one understands that a real time system is about constant time response, not immediate response).
I don't know how to draw the line between the host OS costs and the guest OS costs when using target latencies. On one hand I think that the host OS should add its own costs into what gets passed to the guest and the guest will see a slower than baremetal system in terms of state transitions;
I was thinking maybe this also.. Is there a way to query the state transition cost information through PSCI? Would there be a way to have the layers of hosts/monitors/etc contribute the cost of their paths into the query results?
Possibly. PSCI spec doesn't specify any API for querying the C-state costs because the way to do so is still in the air. We know that the server world would like to carry on using ACPI for describing those states, device tree-based systems would probably invent a different way or learn how to integrate with ACPI.
... on the other hand I would like to see the guest OS shielded from this type of information as there are too many variables behind it (is the host OS also under some monitor code? are all transitions to the same state happening in constant time or are they dependent of number of cores involved, their state, etc, etc)
I agree, but don't see how. In our systems, we do very much care about the costs, and have ~real time constraints to manage. I think we need a good understanding of costs for the hw states.
And are those costs constant? Do you depend on how many CPUs you have online to determine how long it will take to do a cluster shutdown? Does having the DMA engine on add to the quiescence time? While I don't doubt that you understand what are the minimum time constraints that the hardware imposes, it's the combination of all the elements in the system that is under software control that gives the final answer and in most cases it is "depends".
If one uses a simple set of C-states (CPU_ON, CPU_IDLE, CPU_OFF, CLUSTER_OFF, SYSTEM_OFF) then the guest could make requests independent of the host OS latencies _after the relevant translations between time-to-next-event and intended target C-state have been performed_.
I think that if we don't know the real cost of entering a state, we basically will end up chosing the wrong states in many occasions.
True. But that "real" cost is usually an estimate of the worst case, or an average time, right?
CPUIdle is already binning the allowable costs into a specific state. If we decide that CPUIdle does not know the real cost of the states then the binning will be wrong sometimes, and cpuidle would not be selecting the correct states. I think this could have bad side effects for real time systems.
CPUidle does know the costs. The "reality" of those costs depends on the system you are running (virtualised or not, trusted OS trapping you calls or not). If the costs do not reflect the actual transition time then yes, CPUidle will make the wrong decision and the system won't work as intended. I'm not advocating doing that.
Also, I don't understand your remark regarding real time systems. If the CPUidle costs are wrong the decision will be wrong regardless of the type of system you use. Or are you concerned that being too conservative and lying to the OS about the actual cost for the system to transition to the new state at that moment will introduce unnecessary delays and forgo the real time functionality.
For my purposes and as things are today, I'd likely factor in the (probably pre-known & measured) host os/monitor costs into the cpuidle DT entries and have cpuidle run the show. At the lower layers, it won't matter what is passed through as long as the correct state is chosen.
Understood. I'm advocating the same thing with the only added caveat that the state you choose is not a physical system state in all cases, but a state that makes sense for the OS running at that level. As such, the numbers that will be used by CPUidle will be in the "ballpark" region rather than absolute numbers.
Any running OS should only be concerned with getting the time to the next event right (be it real time constrained or not) and finding out which C-state will guarantee availability at that time. If one doesn't know when the next event will come then being conservative should be good enough. There is no way you will have a ~real time system if you transition to cluster off and the real cost of coming out is measured in miliseconds, regardless of how you came to that decision.
Best regards, Liviu
Thanks,
Sebastian
`
On Tue, May 21, 2013 at 02:08:29PM -0700, Sebastian Capella wrote:
Thanks Liviu!
Some comments below..
Quoting Liviu Dudau (2013-05-21 10:15:42)
... Which side of the interface are you actually thinking of?
Both, I'm really just trying to understand the problem.
I don't think there is any C-state other than simple idle (which translates into an WFI for the core) that *doesn't* take into account power domain latencies and code path lengths to reach that state.
I'm speaking more about additional c-states after the lowest independent compute domain cstate, where we may add additional cstates which reduce the power further at a higher latency cost. These may be changing power states for the rest of the SOC or external power chips/supplies. Those states would effectively enter the lowest PSCI C-state, but then have additional steps in the CPUIdle hw specific driver.
I don't know how to draw the line between the host OS costs and the guest OS costs when using target latencies. On one hand I think that the host OS should add its own costs into what gets passed to the guest and the guest will see a slower than baremetal system in terms of state transitions;
I was thinking maybe this also.. Is there a way to query the state transition cost information through PSCI? Would there be a way to have the layers of hosts/monitors/etc contribute the cost of their paths into the query results?
Currently not. This partly depends on whether the target residency is supposed to be a hint about the rough order of magnitude of the expected idle period, or whether it's supposed to be a strict contract.
In effect, I think it's a hint which steers the choice of powerdown state, rather than soemthing with a strong real-time guarantee attached to it. In that case shaving the firmware latency off this value before using it may not be worth it. If the specified target residency is small enough that this makes a significant difference, this suggests a very short period of actual powerdown, which may not outweigh its own overheads in terms of power-saving.
That's just my view -- others may disagree
Cheers ---Dave
Quoting Dave Martin (2013-05-22 11:22:36)
Currently not. This partly depends on whether the target residency is supposed to be a hint about the rough order of magnitude of the expected idle period, or whether it's supposed to be a strict contract.
In effect, I think it's a hint which steers the choice of powerdown state, rather than soemthing with a strong real-time guarantee attached to it. In that case shaving the firmware latency off this value before using it may not be worth it. If the specified target residency is small enough that this makes a significant difference, this suggests a very short period of actual powerdown, which may not outweigh its own overheads in terms of power-saving.
Thanks Dave, Liviu,
Sorry, you've caught me mixing terms and concepts.
I agree, target residency to me also is more an estimate of the cost vs. benefit for a state.
The cstates also define a latency parameter that is used for limiting selection of certain states by the governor. This is affected by QoS constraints, which we use alot in embedded. This is the one needed for realtime use that is tricky with host os' additional latency.
Both latency and target residency would need some adjustment for embedded mobile if we have additional overhead as it becomes very important to squeeze this as much as possible. For latency, microseconds count as we cannot allow a cstate which will fail to meet our qos constraints.
Thanks,
Sebastian
Hi Guys,
Sorry, I think the conversation went pretty far from my patch.
Concerning my original patch, do you have any more ideas or concerns?
I'm not sure I have a clear idea what, if anything, needs to be changed.
I was able to verify it on the TC2 platform without issue.
Thanks again for all of your time.
Sebastian
On 22 May 2013 11:51, Sebastian Capella sebastian.capella@linaro.orgwrote:
Quoting Dave Martin (2013-05-22 11:22:36)
Currently not. This partly depends on whether the target residency is supposed to be a hint about the rough order of magnitude of the expected idle period, or whether it's supposed to be a strict contract.
In effect, I think it's a hint which steers the choice of powerdown state, rather than soemthing with a strong real-time guarantee attached to it. In that case shaving the firmware latency off this value before using it may not be worth it. If the specified target residency is small enough that this makes a significant difference, this suggests a very short period of actual powerdown, which may not outweigh its own overheads in terms of power-saving.
Thanks Dave, Liviu,
Sorry, you've caught me mixing terms and concepts.
I agree, target residency to me also is more an estimate of the cost vs. benefit for a state.
The cstates also define a latency parameter that is used for limiting selection of certain states by the governor. This is affected by QoS constraints, which we use alot in embedded. This is the one needed for realtime use that is tricky with host os' additional latency.
Both latency and target residency would need some adjustment for embedded mobile if we have additional overhead as it becomes very important to squeeze this as much as possible. For latency, microseconds count as we cannot allow a cstate which will fail to meet our qos constraints.
Thanks,
Sebastian
Hi,
I haven't heard back from anyone regarding my last request. If there are no objections, I'll go ahead and publish this patch to LKML and LAKML.
Thanks,
Sebastian
On 29 May 2013 07:37, Sebastian Capella sebastian.capella@linaro.orgwrote:
Hi Guys,
Sorry, I think the conversation went pretty far from my patch.
Concerning my original patch, do you have any more ideas or concerns?
I'm not sure I have a clear idea what, if anything, needs to be changed.
I was able to verify it on the TC2 platform without issue.
Thanks again for all of your time.
Sebastian
On 22 May 2013 11:51, Sebastian Capella sebastian.capella@linaro.orgwrote:
Quoting Dave Martin (2013-05-22 11:22:36)
Currently not. This partly depends on whether the target residency is supposed to be a hint about the rough order of magnitude of the expected idle period, or whether it's supposed to be a strict contract.
In effect, I think it's a hint which steers the choice of powerdown state, rather than soemthing with a strong real-time guarantee attached to it. In that case shaving the firmware latency off this value before using it may not be worth it. If the specified target residency is small enough that this makes a significant difference, this suggests a very short period of actual powerdown, which may not outweigh its own overheads in terms of power-saving.
Thanks Dave, Liviu,
Sorry, you've caught me mixing terms and concepts.
I agree, target residency to me also is more an estimate of the cost vs. benefit for a state.
The cstates also define a latency parameter that is used for limiting selection of certain states by the governor. This is affected by QoS constraints, which we use alot in embedded. This is the one needed for realtime use that is tricky with host os' additional latency.
Both latency and target residency would need some adjustment for embedded mobile if we have additional overhead as it becomes very important to squeeze this as much as possible. For latency, microseconds count as we cannot allow a cstate which will fail to meet our qos constraints.
Thanks,
Sebastian
On Mon, Jun 10, 2013 at 06:30:25PM +0100, Sebastian Capella wrote:
Hi,
I haven't heard back from anyone regarding my last request. If there are no objections, I'll go ahead and publish this patch to LKML and LAKML.
No objections from me, Sebastian. When you have time and the inclination to do so, I would like to see a more detailed explanation on what are your QoS constraints on the embedded mobile, for my personal enlightment.
Best regards, Liviu
Thanks,
Sebastian
On 29 May 2013 07:37, Sebastian Capella <sebastian.capella@linaro.orgmailto:sebastian.capella@linaro.org> wrote: Hi Guys,
Sorry, I think the conversation went pretty far from my patch.
Concerning my original patch, do you have any more ideas or concerns?
I'm not sure I have a clear idea what, if anything, needs to be changed.
I was able to verify it on the TC2 platform without issue.
Thanks again for all of your time.
Sebastian
On 22 May 2013 11:51, Sebastian Capella <sebastian.capella@linaro.orgmailto:sebastian.capella@linaro.org> wrote: Quoting Dave Martin (2013-05-22 11:22:36)
Currently not. This partly depends on whether the target residency is supposed to be a hint about the rough order of magnitude of the expected idle period, or whether it's supposed to be a strict contract.
In effect, I think it's a hint which steers the choice of powerdown state, rather than soemthing with a strong real-time guarantee attached to it. In that case shaving the firmware latency off this value before using it may not be worth it. If the specified target residency is small enough that this makes a significant difference, this suggests a very short period of actual powerdown, which may not outweigh its own overheads in terms of power-saving.
Thanks Dave, Liviu,
Sorry, you've caught me mixing terms and concepts.
I agree, target residency to me also is more an estimate of the cost vs. benefit for a state.
The cstates also define a latency parameter that is used for limiting selection of certain states by the governor. This is affected by QoS constraints, which we use alot in embedded. This is the one needed for realtime use that is tricky with host os' additional latency.
Both latency and target residency would need some adjustment for embedded mobile if we have additional overhead as it becomes very important to squeeze this as much as possible. For latency, microseconds count as we cannot allow a cstate which will fail to meet our qos constraints.
Thanks,
Sebastian