Hi Alex,
Glad to have you looking at my code as well :)
On 28/03/14 09:14, Alex Shi wrote:
On 03/24/2014 09:47 PM, Chris Redpath wrote:
When a normal forced up-migration takes place we stop the task to be migrated while the target CPU becomes available. This delay can range from 80us to 1500us on TC2 if the target CPU is in a deep idle state.
Instead, interrupt the target CPU and ask it to pull a task. This lets the current eligible task continue executing on the original CPU while the target CPU wakes. Use a pinned timer to prevent the pulling CPU going back into power-down with pending up-migrations.
If we trigger for a nohz kick, it doesn't matter about triggering for an idle pull since the idle_pull flag will be set when we execute the softirq and we'll still do the idle pull.
If the target CPU is busy, we will not pull any tasks.
Chris, I do not fully understand the MP feature. So correct me if I am wrong. :)
The trade off is one more reschedule interrupt, and keep big cpu alive, that cause more energy cost.
It's not really extra cost. We would have performed the reschedule anyway in order to do the migration except that previously we would have waited in the CPU stopper on the source CPU while the target CPU woke from sleep and now we continue while that happens.
Since we are always waking an idle big CPU when we make this decision, we are typically paying an idle wakeup cost each time. When running the mobile workloads we are mostly interested in, that idle wakeup is frequently a wakeup from cluster shutdown mode which can be over 1ms.
The aim of this change was to try and prevent dropped frames during hmp up-migrations caused by the execution stalling while waiting for the target CPU to become available.
The CPU keepalive is there to prevent entering deep idle states in the couple of hundred microseconds that the CPU stopper takes to run on the source CPU. It could be more logically expressed as a (very) temporary idle latency requirement, except that we cannot express such constraints for a single CPU in the kernel today.
So do you have data show the trade off is worthy? like, the res interrupt cost, cpu alive cost VS go to idle and be wakeup cost. or benchmark data to show we get benefit from performance/power.
I have traces which show the resulting improvement but it's so small that it is lost in the noise in all the benchmarks we have. Most of the benchmarks do not actually involve that much migration between clusters - typically the 'benchmark' app tasks start heavy processing and continue until complete, and with the HMP thresholds we use our lighter workloads generally migrate once or twice per operation.
We have loads of data to show that the change has no detrimental impact on any of the metrics for our benchmarked scenarios (power is largely unchanged as is performance) however I need to complete the microbenchmark I've been working on to get some numbers showing what is visible in traces.
As to the one more res interrupt, could we check the target's pending timer before add new one? if it has a timer in time, we can save the keep_alive interrupt.
We could do this, but since we are waking an idle CPU I do not expect us to ever have such a pending timer. I have wondered if it is worth canceling the pending timer event when we start executing the new task but I haven't done any measurements in that area.
BTW, did we check the little cpu domain to see if they are under utilize? If so, relief the big cpu load is helpful on power efficiency. I just newly idle pull for big cpu, not for little cpu.
We have a bunch of mechanisms in place to try never to have a situation where the load on a big CPU could be relieved by a little CPU. In the HMP patches, our little CPUs are allowed to run any tasks but the big CPUs are only allowed to run tasks where their tracked (unweighted) load is above our up-threshold. We do this by removing the scheduler's cluster balancing and replacing it at a handful of key points with a load-based decision (both task load and domain loads are used).
When big tasks become runnable, we will use a big CPU if we have an idle one but we will happily use a little CPU if all the big CPUs are in use. While a big task is running, we will move it to a big CPU if there is an idle one, and if a big CPU becomes idle it will check that there are no suitable tasks on a little CPU that could be pulled before it goes idle.
This way, we mostly avoid a situation where the big cluster is overloaded and the little cluster has spare capacity.
If we do get into a situation where we have multiple tasks resident on a big CPU (it can happen if we pull something and the original task wakes up again), we check to see if the overall progress can be better served by offloading one of the heavier tasks to a little CPU.
It is true that in really heavy load situations where you have > num_cpus busy tasks, we could do better by allowing tasks to spread according to the relative compute capacities of the cores, but this is not a situation we have seen in any mobile workload yet. Hopefully energy-aware scheduling will handle this perfectly :)
We do have a difference in the newly-idle behaviour depending upon if the CPU is big or little. Since we have disabled the cpu-level balance, CPUs do not idle pull across the clusters but only inside. This covers spreading light load and heavy load idependently and that is the end of the story for little CPUs.
big CPUs can idle pull from other big CPUS (as normal) and if that doesn't find a task, they have an additional step which will finds the heaviest-load task on any little CPU and will pull it if the task load is above the up-threshold.
--Chris