[Question] Regard Of IKS Implementation

List overview All Threads
Download

newer

older

[ACTIVITY] 2013-04-27 - 2013-05-10

[ACTIVITY] (John Stultz) May 6 - 10

Leo Yan

8 May 2013 8 May '13

3:17 a.m.

hi Nico & all,

After we studied the IKS code, we believe the code is general and smoothly and can almost meet well for our own SoC's requirement; here also have some questions want to confirm with you guys:

1. When outbound core wake up inbound core, the outbound core's thread will sleep until the inbound core use MCPM’s early pork to send IPI;

a) Looks like this method somehow is due to TC2 board has long letancy to power on/off cluster and core; right? How about to use polling method? because on our own SoC, the wakenup interval will take _only_ about 10 ~ 20us;

b) The inbound core will send IPI to outbound core for the synchronization, but at this point the inbound core's GIC's cpu interface is disabled; so even the core's cpu interface is disabled, can the core send SGI to other cores?

c) MCPM's patchset merged for mainline have no related function for early pork, so later will early pork related functions be committed to mainline?

2. Now the switching is an async operation, means after the function bL_switch_request is return back, we cannot say switching has been completed; so we have some concern for it.

For example, when switch from A15 core to A7 core, then maybe we want to decrease the voltage so that can save power; if the switching is an async operation, then it maybe introduce the issue is: after return back from the function bL_switch_request, then s/w will decrease the voltage; but at the meantime, the real switching is ongoing on another pair cores.

i browser the git log and get to know at the beginning the switching is synced by using kernel's workqueue, later changed to use a dedicated kernel thread with FIFO type; do u think it's better to go ahead to add sync method for switching?

3. After enabled switcher, then it will disable hotplug.

Actually current code can support hotplug with IKS; because with IKS, the logical core will map the according physical core id and GIC's interface id, so that it can make sure if the system has hotplugged out which physical core, later the kernel can hotplug in this physical core. So could u give more hints why iks need disable hotplug?

-- Thx, Leo Yan

Show replies by date

Dave Martin

8 May 8 May

1:31 p.m.

Hi there,

On Wed, May 08, 2013 at 11:17:49AM +0800, Leo Yan wrote:

...

hi Nico & all,

After we studied the IKS code, we believe the code is general and smoothly and can almost meet well for our own SoC's requirement; here also have some questions want to confirm with you guys:

When outbound core wake up inbound core, the outbound core's

thread will sleep until the inbound core use MCPM’s early pork to send IPI;

a) Looks like this method somehow is due to TC2 board has long letancy to power on/off cluster and core; right? How about to use polling method? because on our own SoC, the wakenup interval will take _only_ about 10 ~ 20us;

This is correct, TC2 has much longer latencies, especially if a whole cluster needs to be powered up.

...

b) The inbound core will send IPI to outbound core for the synchronization, but at this point the inbound core's GIC's cpu interface is disabled; so even the core's cpu interface is disabled, can the core send SGI to other cores?

SGIs are triggered by writing to the GIC Distributor. I believe that doesn't require the triggering CPU's CPU interface to be enabled.

The destination CPU's CPU interface needs to be enabled in order for the interrupt to be received, though.

Nico can confirm whether this is correct.

I believe this does change for GICv3 though -- it may be necessary to fire up the CPU interface before an SGI can be sent in that case. This isn't an issue for v7 based platforms, but may need addressing in the future.

...

c) MCPM's patchset merged for mainline have no related function for early pork, so later will early pork related functions be committed to mainline?

I'll let Nico comment on that one.

...

From my side I don't see a strong technical reason why not.

...

Now the switching is an async operation, means after the function

bL_switch_request is return back, we cannot say switching has been completed; so we have some concern for it.

I did write some patches to solve that, by providing a way for the caller to be notified of completion, while keeping the underlying mechanism asynchronous.

Nico: Can you remember whether you had any concerns about that functionality? See "ARM: bL_switcher: Add switch completion callback", posted to you on Mar 15.

I hadn't pushed for merging those at the time because I was hoping to do more testing on them, but I was diverted by other activities.

...

For example, when switch from A15 core to A7 core, then maybe we want to decrease the voltage so that can save power; if the switching is an async operation, then it maybe introduce the issue is: after return back from the function bL_switch_request, then s/w will decrease the voltage; but at the meantime, the real switching is ongoing on another pair cores.

i browser the git log and get to know at the beginning the switching is synced by using kernel's workqueue, later changed to use a dedicated kernel thread with FIFO type; do u think it's better to go ahead to add sync method for switching?

Migrating from one cluster to the other has intrinsic power and time costs, caused by the time and effort needed to power CPUs up and down and migrate software and cache state across. This is more than just the time to power up a CPU.

This means that there is a limit on how rapidly it is worth switching between clusters before it leads to a net loss in terms of power and/or performance. Over shorter timescales, fine-grained CPU idling may provide better overall behaviour results.

My general expectation is that at reasonable switching speeds, the extra overhead of the asynchronous CPU power-on compared with a synchronous approach may not have a big impact on overall system behaviour.

However, it would certainly interesting to measure these effects on a faster platform. TC2 is the only hardware we had direct access to for our development work, and on that hardware the asynchronous power-up is a definite advantage.

...

After enabled switcher, then it will disable hotplug.

Actually current code can support hotplug with IKS; because with IKS, the logical core will map the according physical core id and GIC's interface id, so that it can make sure if the system has hotplugged out which physical core, later the kernel can hotplug in this physical core. So could u give more hints why iks need disable hotplug?

Hotplug is possible to implement, but adds some complexity. The combination of IKS and CPUidle should allow all CPUs to be powered down automatically when not needed, so enable hotplug may not be a huge extra benefit, but that doesn't mean that functionality could not be added in the future.

Cheers --Dave

Nicolas Pitre

6:07 p.m.

On Wed, 8 May 2013, Dave Martin wrote:

...

On Wed, May 08, 2013 at 11:17:49AM +0800, Leo Yan wrote:

...

Now the switching is an async operation, means after the function

bL_switch_request is return back, we cannot say switching has been completed; so we have some concern for it.

I did write some patches to solve that, by providing a way for the caller to be notified of completion, while keeping the underlying mechanism asynchronous.

There are two things to distinguish here.

First, it is true that bL_switch_request() is currently posting switch requests and not waiting for them to complete.

...

Nico: Can you remember whether you had any concerns about that functionality? See "ARM: bL_switcher: Add switch completion callback", posted to you on Mar 15.

I had concerns which I sent on March 13, and as far as I can see they were all addressed in the fixup patch you posted on March 15.

However ...

...

...
For example, when switch from A15 core to A7 core, then maybe we want to decrease the voltage so that can save power; if the switching is an async operation, then it maybe introduce the issue is: after return back from the function bL_switch_request, then s/w will decrease the voltage; but at the meantime, the real switching is ongoing on another pair cores.

That's the second point. When bL_switch_to() returns, the switch _is_ complete. However the outbound CPU may still be alive doing some processing of its own. It may purposely stay alive for a while to allow the inbound CPU to snoop its cache. Or it could be used to pre-zero free memory pages or whatever. It may be doing extra cleanup work because it was elected as the last man in its cluster. Etc.

So, as I explained in my previous reply, it is not because the switch is over that the outbound CPU is no longer running. Therefore the switch done notification must not be used to hook voltage changes as only the MCPM backend really knows if and when all CPUs in a cluster are really down.

Nicolas

Dave Martin

9 May 9 May

8:25 a.m.

On Wed, May 08, 2013 at 02:07:26PM -0400, Nicolas Pitre wrote:

...

On Wed, 8 May 2013, Dave Martin wrote:

...
On Wed, May 08, 2013 at 11:17:49AM +0800, Leo Yan wrote:

...

Now the switching is an async operation, means after the function

bL_switch_request is return back, we cannot say switching has been completed; so we have some concern for it.

I did write some patches to solve that, by providing a way for the caller to be notified of completion, while keeping the underlying mechanism asynchronous.

There are two things to distinguish here.

First, it is true that bL_switch_request() is currently posting switch requests and not waiting for them to complete.

...
Nico: Can you remember whether you had any concerns about that functionality? See "ARM: bL_switcher: Add switch completion callback", posted to you on Mar 15.

I had concerns which I sent on March 13, and as far as I can see they were all addressed in the fixup patch you posted on March 15.

However ...

...
...
For example, when switch from A15 core to A7 core, then maybe we want to decrease the voltage so that can save power; if the switching is an async operation, then it maybe introduce the issue is: after return back from the function bL_switch_request, then s/w will decrease the voltage; but at the meantime, the real switching is ongoing on another pair cores.

That's the second point. When bL_switch_to() returns, the switch _is_ complete. However the outbound CPU may still be alive doing some processing of its own. It may purposely stay alive for a while to allow the inbound CPU to snoop its cache. Or it could be used to pre-zero free memory pages or whatever. It may be doing extra cleanup work because it was elected as the last man in its cluster. Etc.

So, as I explained in my previous reply, it is not because the switch is over that the outbound CPU is no longer running. Therefore the switch done notification must not be used to hook voltage changes as only the MCPM backend really knows if and when all CPUs in a cluster are really down.

Note that if we really need to be certain that the CPU is really down, MCPM is not enough either, because that only observes various levels of committing to powerdown. Ultimately, only the SoC hardware knows when a true low-power state is reached.

If you want to cut the voltage to a level which would be unsafe before true powerdown is reached, this needs to be coordinated by external means. On some SoCs, the power controller may be capable of doing this itself -- I believe this is the case on TC2. If instead software has direct control, it would be necessarily to mask (or otherwise prevent) wakeups on the affected CPU and wait until the power controller tells you it is down before you start changing the voltage.

Cheers ---Dave

Leo Yan

10 May 10 May

6:35 a.m.

On 05/08/2013 09:31 PM, Dave Martin wrote:

...

Hi there,

...
b) The inbound core will send IPI to outbound core for the synchronization, but at this point the inbound core's GIC's cpu interface is disabled; so even the core's cpu interface is disabled, can the core send SGI to other cores?

SGIs are triggered by writing to the GIC Distributor. I believe that doesn't require the triggering CPU's CPU interface to be enabled.

The destination CPU's CPU interface needs to be enabled in order for the interrupt to be received, though.

Nico can confirm whether this is correct.

I believe this does change for GICv3 though -- it may be necessary to fire up the CPU interface before an SGI can be sent in that case. This isn't an issue for v7 based platforms, but may need addressing in the future.

We will use GICv2 and Nico has confirmed that, so there should have no anymore issue.

...

...

Now the switching is an async operation, means after the function

bL_switch_request is return back, we cannot say switching has been completed; so we have some concern for it.

I did write some patches to solve that, by providing a way for the caller to be notified of completion, while keeping the underlying mechanism asynchronous.

Nico: Can you remember whether you had any concerns about that functionality? See "ARM: bL_switcher: Add switch completion callback", posted to you on Mar 15.

I hadn't pushed for merging those at the time because I was hoping to do more testing on them, but I was diverted by other activities.

Can i get related patches?

...

...
For example, when switch from A15 core to A7 core, then maybe we want to decrease the voltage so that can save power; if the switching is an async operation, then it maybe introduce the issue is: after return back from the function bL_switch_request, then s/w will decrease the voltage; but at the meantime, the real switching is ongoing on another pair cores.

i browser the git log and get to know at the beginning the switching is synced by using kernel's workqueue, later changed to use a dedicated kernel thread with FIFO type; do u think it's better to go ahead to add sync method for switching?

Migrating from one cluster to the other has intrinsic power and time costs, caused by the time and effort needed to power CPUs up and down and migrate software and cache state across. This is more than just the time to power up a CPU.

This means that there is a limit on how rapidly it is worth switching between clusters before it leads to a net loss in terms of power and/or performance. Over shorter timescales, fine-grained CPU idling may provide better overall behaviour results.

Yes, here have two latency will contribute the time costs: 1. h/w latency: the time for the power controller to power on the core; 2. s/w latency: the outbound core save and restore contexts for VFP/NEON, GIC, generic timer and CP15 registers;

More time cost also means more power cost; and the latency will depend on the SoC's power controller implementation.

According to our profiling result on TC2, the h/w latency is long and in some situation it will take about 1ms; on the other hand, s/w latency will take about 80us; so on TC2 the migration will take more than 1ms.

...

My general expectation is that at reasonable switching speeds, the extra overhead of the asynchronous CPU power-on compared with a synchronous approach may not have a big impact on overall system behaviour.

Here i can think out some benefit from async approach: for example, the logical core is running on A15, there have two switching request, the first request is from A15 to A7, the second request is from A7 to A15; so if the async operation is detecting there have no change for the cluster id, then it will skip these two request.

This is the reason u guys prefer to use the async approach? If so, then maybe it also due to there have long latency for the switching on TC2 so that use the async method can get better performance?

...

However, it would certainly interesting to measure these effects on a faster platform. TC2 is the only hardware we had direct access to for our development work, and on that hardware the asynchronous power-up is a definite advantage.

...

After enabled switcher, then it will disable hotplug.

Actually current code can support hotplug with IKS; because with IKS, the logical core will map the according physical core id and GIC's interface id, so that it can make sure if the system has hotplugged out which physical core, later the kernel can hotplug in this physical core. So could u give more hints why iks need disable hotplug?

Hotplug is possible to implement, but adds some complexity. The combination of IKS and CPUidle should allow all CPUs to be powered down automatically when not needed, so enable hotplug may not be a huge extra benefit, but that doesn't mean that functionality could not be added in the future.

Yes, we want to enable the functionality. I will explain the reason in next Nico's mail.

...

Cheers --Dave

Nicolas Pitre

8 May 8 May

3:40 p.m.

On Wed, 8 May 2013, Leo Yan wrote:

...

hi Nico & all,

After we studied the IKS code, we believe the code is general and smoothly and can almost meet well for our own SoC's requirement; here also have some questions want to confirm with you guys:

Good. We're certainly looking forward to apply this code to other SOCs.

...

When outbound core wake up inbound core, the outbound core's thread will

sleep until the inbound core use MCPM’s early pork to send IPI;

a) Looks like this method somehow is due to TC2 board has long letancy to power on/off cluster and core; right? How about to use polling method? because on our own SoC, the wakenup interval will take _only_ about 10 ~ 20us;

There is no need to poll anything. If your SOC is fast enough in all cases, then the outbound may simply go ahead and let the inbound resume with the saved context whenever it is ready.

...

b) The inbound core will send IPI to outbound core for the synchronization, but at this point the inbound core's GIC's cpu interface is disabled; so even the core's cpu interface is disabled, can the core send SGI to other cores?

It must, otherwise the switch as implemented would never complete.

...

c) MCPM's patchset merged for mainline have no related function for early pork, so later will early pork related functions be committed to mainline?

The early poke mechanism is only needed by the switcher. This is why it is not submitted yet.

...

Now the switching is an async operation, means after the function

bL_switch_request is return back, we cannot say switching has been completed; so we have some concern for it.

For example, when switch from A15 core to A7 core, then maybe we want to decrease the voltage so that can save power; if the switching is an async operation, then it maybe introduce the issue is: after return back from the function bL_switch_request, then s/w will decrease the voltage; but at the meantime, the real switching is ongoing on another pair cores.

i browser the git log and get to know at the beginning the switching is synced by using kernel's workqueue, later changed to use a dedicated kernel thread with FIFO type; do u think it's better to go ahead to add sync method for switching?

No. This is absolutely the wrong way to look at things.

The switcher is _just_ a specialized CPU hotplug agent with a special side effect. What it does is to tell the MCPM layer to shut CPU x down, power up CPU y, etc. It happens that cpuidle may be doing the same thing in parallel, and so does the classical CPU hotplug.

So you must add your voltage policy into the MCPM backend for your platform instead, _irrespective_ of the switcher presence.

First, the switcher is aware of the state on a per logical CPU basis. It knows when its logical CPU0 switched from the A15 to the A7. That logical CPU0 instance doesn't know and doesn't have to know what is happening with logical CPU1. The switcher does not perform cluster wide switching so it does not know when the entire A7 or the entire A15 cores are down. That's the job of the MCPM layer.

Another example: suppose that logical CPU0 is running on the A7 and logical CPU1 is running on the A15, but cpuidle for the later decides to shut itself down. The cpuidle driver will ask MCPM to shut down logical CPU1 which happens to be the last A15 and therefore no more A15 will be alive at that moment, even if the switcher knows that logical CPU1 is still tied to the A15. You certainly want to lower the voltage in that case too.

...

After enabled switcher, then it will disable hotplug.

Actually current code can support hotplug with IKS; because with IKS, the logical core will map the according physical core id and GIC's interface id, so that it can make sure if the system has hotplugged out which physical core, later the kernel can hotplug in this physical core. So could u give more hints why iks need disable hotplug?

The problem is to decide what the semantic of a hotplug request would be.

Let's use an example. The system boots with CPUs 0,1,2,3. When the switcher initializes, it itself hot-unplugs CPUs 2,3 and only CPUs 0,1 remain. Of course physical CPUs 2 and 3 are used when a switch happens, but even if physical CPU 0 is switched to physical CPU 2, the logical CPU number as far as Linux is concerned remains CPU 0. So even if CPU 2 is running, Linux thinks this is still CPU 0.

Now, when the switcher is active, we must forbid any hotplugging of logical CPUs 2 and 3, or the semantic of the switcher would be broken. So that means keeping track of CPUs that can and cannot be hotplugged, and that lack of uniformity is likely to cause confusion in user space already.

But if you really want to hot-unplug CPU 0, this might correspond to either physical CPU0 or physical CPU2. What physical CPU should be brought back in when a hotplug request comes in?

And if the switcher is disabled after hot-unplugging CPU0 when it was still active, should both physical CPUs 0 and 2 left disabled, or logical CPU2 should be brought back online nevertheless?

There are many issues to cover, and the code needed to deal with them becomes increasingly complex. And this is not only about the switcher, as some other parts of the kernel such as pmu might expects to shut down physical CPU0 when its hotplug callback is invoked for logical CPU0, etc. etc.

So please tell me: why do you want CPU hotplug in combination with the switcher in the first place? Using hotplug for power management is already a bad idea to start with given the cost and overhead associated with it. The switcher does perform CPU hotplugging behind the scene but it avoids all the extra costs from a hotplug operation of a logical CPU in the core kernel.

But if you _really_ insist on performing CPU hotplug while using the switcher, you still can disable the switcher via sysfs, hot-unplug a CPU still via sysfs, and re-enable the switcher. When the switcher reinitializes, it will go through its pairing with the available CPUs, and if there is no available pairing for a logical CPU because one of the physical CPU has been hot-unplugged then that logical CPU won't be available with the switcher.

Nicolas

Leo Yan

10 May 10 May

12:35 p.m.

On 05/08/2013 11:40 PM, Nicolas Pitre wrote:

...

On Wed, 8 May 2013, Leo Yan wrote:

...

When outbound core wake up inbound core, the outbound core's thread will

sleep until the inbound core use MCPM’s early pork to send IPI;

a) Looks like this method somehow is due to TC2 board has long letancy to power on/off cluster and core; right? How about to use polling method? because on our own SoC, the wakenup interval will take _only_ about 10 ~ 20us;

There is no need to poll anything. If your SOC is fast enough in all cases, then the outbound may simply go ahead and let the inbound resume with the saved context whenever it is ready.

Yes, i go through the code and here should be fine; Let's keep current simple code.

Here have a corner case is: the outbound core set power controller's register for power down, then it will flush its L1 cache; if it's the last man of the cluster it need flush L2 cache as well, so this operation may take long time (about 2ms for 512KB's L2 cache).

At the meantime, the inbound core is running concurrently and the inbound core may trigger another switching and call *mcpm_cpu_power_up()* to help set some power controller registers' for outbound core; so finally when the outbound core call "WFI", the outbound core cannot be really powered off by power controller; So the polling at here is the inbound core will wait until the outbound core is really powered off.

Even the outbound core has not been powered off, it will not introduce any issue. Because if the outbound core is waken up from "WFI" state, it will run s/w's reset sequence.

Here have ONLY one thing need to confirm: the state machine of SoC's power controller will totally not be interfered by upper corner case. :-)

...

...

Now the switching is an async operation, means after the function

bL_switch_request is return back, we cannot say switching has been completed; so we have some concern for it.

For example, when switch from A15 core to A7 core, then maybe we want to decrease the voltage so that can save power; if the switching is an async operation, then it maybe introduce the issue is: after return back from the function bL_switch_request, then s/w will decrease the voltage; but at the meantime, the real switching is ongoing on another pair cores.

i browser the git log and get to know at the beginning the switching is synced by using kernel's workqueue, later changed to use a dedicated kernel thread with FIFO type; do u think it's better to go ahead to add sync method for switching?

No. This is absolutely the wrong way to look at things.

The switcher is _just_ a specialized CPU hotplug agent with a special side effect. What it does is to tell the MCPM layer to shut CPU x down, power up CPU y, etc. It happens that cpuidle may be doing the same thing in parallel, and so does the classical CPU hotplug.

So you must add your voltage policy into the MCPM backend for your platform instead, _irrespective_ of the switcher presence.

First, the switcher is aware of the state on a per logical CPU basis. It knows when its logical CPU0 switched from the A15 to the A7. That logical CPU0 instance doesn't know and doesn't have to know what is happening with logical CPU1. The switcher does not perform cluster wide switching so it does not know when the entire A7 or the entire A15 cores are down. That's the job of the MCPM layer.

Another example: suppose that logical CPU0 is running on the A7 and logical CPU1 is running on the A15, but cpuidle for the later decides to shut itself down. The cpuidle driver will ask MCPM to shut down logical CPU1 which happens to be the last A15 and therefore no more A15 will be alive at that moment, even if the switcher knows that logical CPU1 is still tied to the A15. You certainly want to lower the voltage in that case too.

MCPM is a basic framework for cpuidle/IKS/hotplug, all low power mode should run with MCPM's general APIs; so it's make sense to add related code into MCPM backend.

Let's see another scenario: At the beginning the logical core 0 is running on A7 core, if the profiling governor (such as cpufreq's governor) think the performance is not high enough, then it will call *bL_switch_request()* to switch to A15 core, the function *bL_switch_request()* will directly return back; but from this point the governor will think now it has already run on A15, so the governor will do profiling based on A15's frequency but actually it still run on A7. So the switcher's async operation may introduce some misunderstanding for governor;

How about u think for this?

...

...

After enabled switcher, then it will disable hotplug.

Actually current code can support hotplug with IKS; because with IKS, the logical core will map the according physical core id and GIC's interface id, so that it can make sure if the system has hotplugged out which physical core, later the kernel can hotplug in this physical core. So could u give more hints why iks need disable hotplug?

The problem is to decide what the semantic of a hotplug request would be.

Let's use an example. The system boots with CPUs 0,1,2,3. When the switcher initializes, it itself hot-unplugs CPUs 2,3 and only CPUs 0,1 remain. Of course physical CPUs 2 and 3 are used when a switch happens, but even if physical CPU 0 is switched to physical CPU 2, the logical CPU number as far as Linux is concerned remains CPU 0. So even if CPU 2 is running, Linux thinks this is still CPU 0.

Now, when the switcher is active, we must forbid any hotplugging of logical CPUs 2 and 3, or the semantic of the switcher would be broken. So that means keeping track of CPUs that can and cannot be hotplugged, and that lack of uniformity is likely to cause confusion in user space already.

But if you really want to hot-unplug CPU 0, this might correspond to either physical CPU0 or physical CPU2. What physical CPU should be brought back in when a hotplug request comes in?

From the functionality level, who is hot-unplugged, who will be brought back.

...

And if the switcher is disabled after hot-unplugging CPU0 when it was still active, should both physical CPUs 0 and 2 left disabled, or logical CPU2 should be brought back online nevertheless?

This is a hard decision for dynamical enabling/disabling IKS. Maybe when disable IKS, we need go back to the start point before we enable IKS: hot-plug all cores and then disable IKS.

...

There are many issues to cover, and the code needed to deal with them becomes increasingly complex. And this is not only about the switcher, as some other parts of the kernel such as pmu might expects to shut down physical CPU0 when its hotplug callback is invoked for logical CPU0, etc. etc.

So please tell me: why do you want CPU hotplug in combination with the switcher in the first place? Using hotplug for power management is already a bad idea to start with given the cost and overhead associated with it. The switcher does perform CPU hotplugging behind the scene but it avoids all the extra costs from a hotplug operation of a logical CPU in the core kernel.

Sometimes the customer has strictly power requirement for phone. We found we can get some benefit from hot-plug/hot-unplug when the system have low load, the basic reason is: we can reduce times for the core enter/exit low power modes.

If the system have very seldom tasks need to run, but there have more than one tasks on the core's runqueue, then kernel will send IPI to other core to do reschedule to run the thread; if the thread have very low workload, the most time will spend on the flow for low power mode's enter/exit but not for really tasks, then hot-plug will be better choice.

The per core timer has the same issue, if the core is powered off and its local timer cannot use anymore, so kernel need use the broadcast timer to help wake up the core and the core will be waken up to handle the timer event.

So if hot-unplug the cores, we can avoid many times IPIs, so finally we can get some power benefit when system have low workload.

Let's use TC2 as the example to describe the hot-plug's implementation: the logical cpu 0 has virtual frequency points: 175Mhz/200Mhz/250Mhz/300Mhz/350Mhz/400Mhz/450Mhz/500Mhz for A7 core, and 600Mhz/700Mhz/800Mhz/900Mhz/1000Mhz/1100Mhz/1200MHz for A15 core;

When the system can meet the performance requirement, cpufreq will call IKS to firstly switch to A7 core; if the core run <= virtual frequency 200Mhz, then system can hot-plug the core. If kernel need improve the performance, it will execute the reverse flow: hot-plug A7 core -> do switch to A15 core.

...

But if you _really_ insist on performing CPU hotplug while using the switcher, you still can disable the switcher via sysfs, hot-unplug a CPU still via sysfs, and re-enable the switcher. When the switcher reinitializes, it will go through its pairing with the available CPUs, and if there is no available pairing for a logical CPU because one of the physical CPU has been hot-unplugged then that logical CPU won't be available with the switcher.

...

Nicolas

Nicolas Pitre

12 May 12 May

8:54 p.m.

On Fri, 10 May 2013, Leo Yan wrote:

...

On 05/08/2013 11:40 PM, Nicolas Pitre wrote:

...
On Wed, 8 May 2013, Leo Yan wrote:

...

When outbound core wake up inbound core, the outbound core's thread

will sleep until the inbound core use MCPM’s early pork to send IPI;

a) Looks like this method somehow is due to TC2 board has long letancy to power on/off cluster and core; right? How about to use polling method? because on our own SoC, the wakenup interval will take _only_ about 10 ~ 20us;

There is no need to poll anything. If your SOC is fast enough in all cases, then the outbound may simply go ahead and let the inbound resume with the saved context whenever it is ready.

Yes, i go through the code and here should be fine; Let's keep current simple code.

Here have a corner case is: the outbound core set power controller's register for power down, then it will flush its L1 cache; if it's the last man of the cluster it need flush L2 cache as well, so this operation may take long time (about 2ms for 512KB's L2 cache).

At the meantime, the inbound core is running concurrently and the inbound core may trigger another switching and call *mcpm_cpu_power_up()* to help set some power controller registers' for outbound core; so finally when the outbound core call "WFI", the outbound core cannot be really powered off by power controller; So the polling at here is the inbound core will wait until the outbound core is really powered off.

That should be handled with the code in mcpm_head.S that waits for the CLUSTER_GOING_DOWN state to go away.

What your MCPM backend can do as an optimization is to check the inbound cluster state once in a while during its L2 flush procedure, and abort the flush halfway when it sees INBOUND_COMING_UP.

...

Even the outbound core has not been powered off, it will not introduce any issue. Because if the outbound core is waken up from "WFI" state, it will run s/w's reset sequence.

Here have ONLY one thing need to confirm: the state machine of SoC's power controller will totally not be interfered by upper corner case. :-)

Indeed. This is not trivial to get everything right.

...

...
The switcher is _just_ a specialized CPU hotplug agent with a special side effect. What it does is to tell the MCPM layer to shut CPU x down, power up CPU y, etc. It happens that cpuidle may be doing the same thing in parallel, and so does the classical CPU hotplug.

So you must add your voltage policy into the MCPM backend for your platform instead, _irrespective_ of the switcher presence.

MCPM is a basic framework for cpuidle/IKS/hotplug, all low power mode should run with MCPM's general APIs; so it's make sense to add related code into MCPM backend.

Let's see another scenario: At the beginning the logical core 0 is running on A7 core, if the profiling governor (such as cpufreq's governor) think the performance is not high enough, then it will call *bL_switch_request()* to switch to A15 core, the function *bL_switch_request()* will directly return back; but from this point the governor will think now it has already run on A15, so the governor will do profiling based on A15's frequency but actually it still run on A7. So the switcher's async operation may introduce some misunderstanding for governor;

How about u think for this?

This is indeed the reason why a switch completion callback facility was recently added: to notify cpufreq governors that the operation has completed. The cpufreq layer has pre and post frequency change callbacks and obviously the post callback should be invoked only when the switch is complete.

...

...
...

After enabled switcher, then it will disable hotplug.

Actually current code can support hotplug with IKS; because with IKS, the logical core will map the according physical core id and GIC's interface id, so that it can make sure if the system has hotplugged out which physical core, later the kernel can hotplug in this physical core. So could u give more hints why iks need disable hotplug?

The problem is to decide what the semantic of a hotplug request would be.

Let's use an example. The system boots with CPUs 0,1,2,3. When the switcher initializes, it itself hot-unplugs CPUs 2,3 and only CPUs 0,1 remain. Of course physical CPUs 2 and 3 are used when a switch happens, but even if physical CPU 0 is switched to physical CPU 2, the logical CPU number as far as Linux is concerned remains CPU 0. So even if CPU 2 is running, Linux thinks this is still CPU 0.

Now, when the switcher is active, we must forbid any hotplugging of logical CPUs 2 and 3, or the semantic of the switcher would be broken. So that means keeping track of CPUs that can and cannot be hotplugged, and that lack of uniformity is likely to cause confusion in user space already.

But if you really want to hot-unplug CPU 0, this might correspond to either physical CPU0 or physical CPU2. What physical CPU should be brought back in when a hotplug request comes in?

From the functionality level, who is hot-unplugged, who will be brought back.

That might be easy to implement by doing a slight tweaking of the request vetoing performed by bL_switcher_hotplug_callback().

...

...
And if the switcher is disabled after hot-unplugging CPU0 when it was still active, should both physical CPUs 0 and 2 left disabled, or logical CPU2 should be brought back online nevertheless?

This is a hard decision for dynamical enabling/disabling IKS. Maybe when disable IKS, we need go back to the start point before we enable IKS: hot-plug all cores and then disable IKS.

Yes, but that doesn't look pretty. Hence I want to be convinced of the value of hotplug with the switcher active before making such compromises.

...

...
So please tell me: why do you want CPU hotplug in combination with the switcher in the first place? Using hotplug for power management is already a bad idea to start with given the cost and overhead associated with it. The switcher does perform CPU hotplugging behind the scene but it avoids all the extra costs from a hotplug operation of a logical CPU in the core kernel.

Sometimes the customer has strictly power requirement for phone. We found we can get some benefit from hot-plug/hot-unplug when the system have low load, the basic reason is: we can reduce times for the core enter/exit low power modes.

If the system have very seldom tasks need to run, but there have more than one tasks on the core's runqueue, then kernel will send IPI to other core to do reschedule to run the thread; if the thread have very low workload, the most time will spend on the flow for low power mode's enter/exit but not for really tasks, then hot-plug will be better choice.

Can't you use cgroups for this instead?

...

The per core timer has the same issue, if the core is powered off and its local timer cannot use anymore, so kernel need use the broadcast timer to help wake up the core and the core will be waken up to handle the timer event.

Patches are being pushed forward by Viresh Kumar to prevent work queues from waking up idle CPUs. I didn't look into the details myself, but I'm sure the timer events are in the same boat.

...

So if hot-unplug the cores, we can avoid many times IPIs, so finally we can get some power benefit when system have low workload.

Let's use TC2 as the example to describe the hot-plug's implementation: the logical cpu 0 has virtual frequency points: 175Mhz/200Mhz/250Mhz/300Mhz/350Mhz/400Mhz/450Mhz/500Mhz for A7 core, and 600Mhz/700Mhz/800Mhz/900Mhz/1000Mhz/1100Mhz/1200MHz for A15 core;

When the system can meet the performance requirement, cpufreq will call IKS to firstly switch to A7 core; if the core run <= virtual frequency 200Mhz, then system can hot-plug the core. If kernel need improve the performance, it will execute the reverse flow: hot-plug A7 core -> do switch to A15 core.

As I said earlier, CPU hotplug is a very heavyweight operation in the core of the kernel. Hot-plugging a CPU back into the system may experience significant latency if the system is loaded, and a loaded system is exactly what normally triggers the need to bring it back in. It is far better to identify bad sources of CPU wakeups and fix them so to leave the unneeded CPU into deep idle mode.

Nicolas

4485

days inactive

4489

days old

linaro-kernel@lists.linaro.org

7 comments

participants

tags (0)

participants (3)

Dave Martin
Leo Yan
Nicolas Pitre