On 03/28/2014 07:25 PM, Chris Redpath wrote:
Hi Alex,
Glad to have you looking at my code as well :)
On 28/03/14 09:14, Alex Shi wrote:
On 03/24/2014 09:47 PM, Chris Redpath wrote:
When a normal forced up-migration takes place we stop the task to be migrated while the target CPU becomes available. This delay can range from 80us to 1500us on TC2 if the target CPU is in a deep idle state.
Instead, interrupt the target CPU and ask it to pull a task. This lets the current eligible task continue executing on the original CPU while the target CPU wakes. Use a pinned timer to prevent the pulling CPU going back into power-down with pending up-migrations.
If we trigger for a nohz kick, it doesn't matter about triggering for an idle pull since the idle_pull flag will be set when we execute the softirq and we'll still do the idle pull.
If the target CPU is busy, we will not pull any tasks.
Chris, I do not fully understand the MP feature. So correct me if I am wrong. :)
The trade off is one more reschedule interrupt, and keep big cpu alive, that cause more energy cost.
It's not really extra cost. We would have performed the reschedule anyway in order to do the migration except that previously we would have waited in the CPU stopper on the source CPU while the target CPU woke from sleep and now we continue while that happens.
Since we are always waking an idle big CPU when we make this decision, we are typically paying an idle wakeup cost each time. When running the mobile workloads we are mostly interested in, that idle wakeup is frequently a wakeup from cluster shutdown mode which can be over 1ms.
The aim of this change was to try and prevent dropped frames during hmp up-migrations caused by the execution stalling while waiting for the target CPU to become available.
The CPU keepalive is there to prevent entering deep idle states in the couple of hundred microseconds that the CPU stopper takes to run on the source CPU. It could be more logically expressed as a (very) temporary idle latency requirement, except that we cannot express such constraints for a single CPU in the kernel today.
Hi Chris, I very appreciate for your so detailed explanations! And looks like your patch do a very excellent improvement on this issue.
I am just wondering if we could use a bit simple way to resolve this problem like the following patch. let the task moving destination cpu do active load balance instead of source cpu. So it could give the time to source cpu when destination waking. and don't need let destination keepalive. What's your opinions on this?
iff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7984458..f30e598 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6275,7 +6275,7 @@ more_balance: raw_spin_unlock_irqrestore(&busiest->lock, flags);
if (active_balance) { - stop_one_cpu_nowait(cpu_of(busiest), + stop_one_cpu_nowait(busiest->push_cpu, active_load_balance_cpu_stop, busiest, &busiest->active_balance_work); } @@ -7198,7 +7198,7 @@ static void hmp_force_up_migration(int this_cpu) raw_spin_unlock_irqrestore(&target->lock, flags);
if (force) - stop_one_cpu_nowait(cpu_of(target), + stop_one_cpu_nowait(target->push_cpu, hmp_active_task_migration_cpu_stop, target, &target->active_balance_work); } @@ -7295,7 +7295,7 @@ static unsigned int hmp_idle_pull(int this_cpu) if (force) { /* start timer to keep us awake */ hmp_cpu_keepalive_trigger(); - stop_one_cpu_nowait(cpu_of(target), + stop_one_cpu_nowait(target->push_cpu, hmp_active_task_migration_cpu_stop, target, &target->active_balance_work); }
So do you have data show the trade off is worthy? like, the res interrupt cost, cpu alive cost VS go to idle and be wakeup cost. or benchmark data to show we get benefit from performance/power.
I have traces which show the resulting improvement but it's so small that it is lost in the noise in all the benchmarks we have. Most of the benchmarks do not actually involve that much migration between clusters
- typically the 'benchmark' app tasks start heavy processing and
continue until complete, and with the HMP thresholds we use our lighter workloads generally migrate once or twice per operation.
Is your statistic too sensitive to share in linaro-kernel ml? If not, I will be very glad to see your data. :)
We have loads of data to show that the change has no detrimental impact on any of the metrics for our benchmarked scenarios (power is largely unchanged as is performance) however I need to complete the microbenchmark I've been working on to get some numbers showing what is visible in traces.
A micro benchmark is nice to persuade community to accept it. :)