On 12/12/2013 09:30 PM, Vincent Guittot wrote:
On 12 December 2013 07:19, Alex Shi alex.shi@linaro.org wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance.
I don't fully agree with your example above. Nothing prevent the scheduler to directly migrate a task to another cpu without going to smt and core step. Nevertheless, only one cpu in a group can pull tasks for the entire group at a level (cpu level as an example) and then the tasks will be spread in this group of cpus during load balance at upper level (core and smt level)
Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
That's already possible but that's not always the case (at least up to the CPU level but i'm not sure of the numa balance behavior especially with latest change) but if you can ensure that you always move a task directly onto the right cpu, it would clearly be a great improvement
Thanks for your comments, Vincent!
I didn't see sth special for CPU domain level unless in wakeup balance. but this case is only for running tasks. So why you said 'it's already possible'? Did I miss sth in scheduler?
The hard part this purpose is the baby-sitter logical. but with this mode always moving task correct is possible. and current bottom-up mode is impossible.
Vincent
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
-- Thanks Alex