Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance. Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
On 12 December 2013 07:19, Alex Shi alex.shi@linaro.org wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance.
I don't fully agree with your example above. Nothing prevent the scheduler to directly migrate a task to another cpu without going to smt and core step. Nevertheless, only one cpu in a group can pull tasks for the entire group at a level (cpu level as an example) and then the tasks will be spread in this group of cpus during load balance at upper level (core and smt level)
Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
That's already possible but that's not always the case (at least up to the CPU level but i'm not sure of the numa balance behavior especially with latest change) but if you can ensure that you always move a task directly onto the right cpu, it would clearly be a great improvement
Vincent
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
-- Thanks Alex
On 12/12/2013 09:30 PM, Vincent Guittot wrote:
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance.
I don't fully agree with your example above. Nothing prevent the scheduler to directly migrate a task to another cpu without going to smt and core step. Nevertheless, only one cpu in a group can pull tasks for the entire group at a level (cpu level as an example) and then the tasks will be spread in this group of cpus during load balance at upper level (core and smt level)
Yes, the task moving maybe start from some middle level or top level. but according the balance sequence -- bottom up, and the longer and longer balance interval of higher domain. This type moving is possible.
On 12/12/2013 09:30 PM, Vincent Guittot wrote:
On 12 December 2013 07:19, Alex Shi alex.shi@linaro.org wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance.
I don't fully agree with your example above. Nothing prevent the scheduler to directly migrate a task to another cpu without going to smt and core step. Nevertheless, only one cpu in a group can pull tasks for the entire group at a level (cpu level as an example) and then the tasks will be spread in this group of cpus during load balance at upper level (core and smt level)
Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
That's already possible but that's not always the case (at least up to the CPU level but i'm not sure of the numa balance behavior especially with latest change) but if you can ensure that you always move a task directly onto the right cpu, it would clearly be a great improvement
Thanks for your comments, Vincent!
I didn't see sth special for CPU domain level unless in wakeup balance. but this case is only for running tasks. So why you said 'it's already possible'? Did I miss sth in scheduler?
The hard part this purpose is the baby-sitter logical. but with this mode always moving task correct is possible. and current bottom-up mode is impossible.
Vincent
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
-- Thanks Alex
On 13 December 2013 11:04, Alex Shi alex.shi@linaro.org wrote:
On 12/12/2013 09:30 PM, Vincent Guittot wrote:
On 12 December 2013 07:19, Alex Shi alex.shi@linaro.org wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance.
I don't fully agree with your example above. Nothing prevent the scheduler to directly migrate a task to another cpu without going to smt and core step. Nevertheless, only one cpu in a group can pull tasks for the entire group at a level (cpu level as an example) and then the tasks will be spread in this group of cpus during load balance at upper level (core and smt level)
Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
That's already possible but that's not always the case (at least up to the CPU level but i'm not sure of the numa balance behavior especially with latest change) but if you can ensure that you always move a task directly onto the right cpu, it would clearly be a great improvement
Thanks for your comments, Vincent!
I didn't see sth special for CPU domain level unless in wakeup balance. but this case is only for running tasks. So why you said 'it's already possible'? Did I miss sth in scheduler?
I haven't been clear in my explanation. I wanted to say that it's possible to move a task directly to the right CPU node without spurious migration in the SMT level, then MC level before
Let take the topology example describe just below:
SMT | 0-1 | 2-3 | 4-5 | 6-7 | MC | 0-3 | 4-7 | CPU | 0-7
Now, let imagine that there are 2 tasks (A and B) on cpu6. The next idle balance will move directly one of the 2 task on cpu0 which is the best final cpu
Vincent
The hard part this purpose is the baby-sitter logical. but with this mode always moving task correct is possible. and current bottom-up mode is impossible.
Vincent
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
-- Thanks Alex
-- Thanks Alex
On 12/13/2013 10:10 PM, Vincent Guittot wrote:
Thanks for your comments, Vincent!
I didn't see sth special for CPU domain level unless in wakeup balance. but this case is only for running tasks. So why you said 'it's already possible'? Did I miss sth in scheduler?
I haven't been clear in my explanation. I wanted to say that it's possible to move a task directly to the right CPU node without spurious migration in the SMT level, then MC level before
Let take the topology example describe just below:
SMT | 0-1 | 2-3 | 4-5 | 6-7 | MC | 0-3 | 4-7 | CPU | 0-7
Now, let imagine that there are 2 tasks (A and B) on cpu6. The next idle balance will move directly one of the 2 task on cpu0 which is the best final cpu
Thanks Vincent!
The nohz idle balance will pick up idle cpu the minimum cpu number. So, in my example, tasks on cpu0, it is impossible to move task to another MC. So you have to assume tasks on cpu6, not cpu0. :) And in fact newidle balance will randomly happen and then moving task randomly. They will slightly ease this unnecessary task moving, but can not solving it.
On 14 December 2013 09:50, Alex Shi alex.shi@linaro.org wrote:
On 12/13/2013 10:10 PM, Vincent Guittot wrote:
Thanks for your comments, Vincent!
I didn't see sth special for CPU domain level unless in wakeup balance. but this case is only for running tasks. So why you said 'it's already possible'? Did I miss sth in scheduler?
I haven't been clear in my explanation. I wanted to say that it's possible to move a task directly to the right CPU node without spurious migration in the SMT level, then MC level before
Let take the topology example describe just below:
SMT | 0-1 | 2-3 | 4-5 | 6-7 | MC | 0-3 | 4-7 | CPU | 0-7
Now, let imagine that there are 2 tasks (A and B) on cpu6. The next idle balance will move directly one of the 2 task on cpu0 which is the best final cpu
Thanks Vincent!
The nohz idle balance will pick up idle cpu the minimum cpu number. So,
Yes, and that's why i'm looking for the closest idle cpu for doing idle balance in my packing patches because i'm looking for powersaving. Buit it could be interesting to do the opposite for a performance pov. Instead of simply selecting the 1st cpu in the idle mask, we can select a idle cpu that is far from the current one in order to to make the spread of task more efficient
in my example, tasks on cpu0, it is impossible to move task to another MC. So you have to assume tasks on cpu6, not cpu0. :) And in fact newidle balance will randomly happen and then moving task randomly. They will slightly ease this unnecessary task moving, but can not solving it.
-- Thanks Alex
On 12/16/2013 06:31 PM, Vincent Guittot wrote:
On 14 December 2013 09:50, Alex Shi alex.shi@linaro.org wrote:
On 12/13/2013 10:10 PM, Vincent Guittot wrote:
Thanks for your comments, Vincent!
I didn't see sth special for CPU domain level unless in wakeup balance. but this case is only for running tasks. So why you said 'it's already possible'? Did I miss sth in scheduler?
I haven't been clear in my explanation. I wanted to say that it's possible to move a task directly to the right CPU node without spurious migration in the SMT level, then MC level before
Let take the topology example describe just below:
SMT | 0-1 | 2-3 | 4-5 | 6-7 | MC | 0-3 | 4-7 | CPU | 0-7
Now, let imagine that there are 2 tasks (A and B) on cpu6. The next idle balance will move directly one of the 2 task on cpu0 which is the best final cpu
Thanks Vincent!
The nohz idle balance will pick up idle cpu the minimum cpu number. So,
Yes, and that's why i'm looking for the closest idle cpu for doing idle balance in my packing patches because i'm looking for powersaving. Buit it could be interesting to do the opposite for a performance pov. Instead of simply selecting the 1st cpu in the idle mask, we can select a idle cpu that is far from the current one in order to to make the spread of task more efficient
On the SMT domain, the closest cpu number is not closer in CPU topology... :( like on my desktop: cor CPU
0 0 0 4 1 1 1 5 2 2 2 6 3 3 3 7
in my example, tasks on cpu0, it is impossible to move task to another MC. So you have to assume tasks on cpu6, not cpu0. :) And in fact newidle balance will randomly happen and then moving task randomly. They will slightly ease this unnecessary task moving, but can not solving it.
-- Thanks Alex
Hi Paul,
Would you like to give some comments when you'r free? Especially for the 3rd point.
Appreciate for your any input!
On 12/12/2013 02:19 PM, Alex Shi wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance. Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
On Fri, Dec 13, 2013 at 06:09:47PM +0800, Alex Shi wrote:
Hi Paul,
Would you like to give some comments when you'r free? Especially for the 3rd point.
Appreciate for your any input!
Hello, Alex,
I guess my main question is why we are discussing this internally rather than on LKML. Is this to be some writeup somewhere? (That would be a good thing.)
On to #3 below...
On 12/12/2013 02:19 PM, Alex Shi wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance. Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
Might be -- key thing is to measure it on a big system. If it is a problem, here are some workarounds with varying degrees of sanity:
1. Reduce cache misses by keeping three values:
a. Current load value. b. Last exported value. (Should be in same cache line as a.) c. Exported value. (Should be in separate cache line.)
The avoid writing to c unless a has deviated sufficiently from a. If the load values stay constant for long time periods, this can reduce the number of cache misses. On the other hand, it introduces an extra compare and branch -- though this should not be a problem if the load value is computed relatively infrequently.
2. As above, but provide additional information to allow the other CPU to extrapolate values. For example, if a CPU goes idle, some of its values will decay exponentially. The remote CPU could compute the decay rate, removing the need for the subject CPU to wake up to recompute its value. (Maybe you already do this?)
Similarly if a CPU remains CPU-bound with given runqueue loading for some time.
3. Allow the exported values to become inaccurate, and resample the actual values remotely if extrapolated values indicate that action is warranted.
There are probably other approaches. I am being quite general here because I don't have the full picture of the scheduler statistics in my head. It is likely possible to obtain a much better approach by considering the scheduler's specifics.
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
Maybe... That said, you do need to take the thrashing of the idle_cpus_mask into account. In fact, I am a bit surprised that maintaining idle_cpus_mask avoids heavy memory contention.
Hmmm... It is not clear to me that this is updated on each and every idle entry and exit. Are you sure that it means what you want it to mean?
Thanx, Paul
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
-- Thanks Alex
On 17 December 2013 17:32, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:
On Fri, Dec 13, 2013 at 06:09:47PM +0800, Alex Shi wrote:
Hi Paul,
Would you like to give some comments when you'r free? Especially for the 3rd point.
Appreciate for your any input!
Hello, Alex,
I guess my main question is why we are discussing this internally rather than on LKML. Is this to be some writeup somewhere? (That would be a good thing.)
On to #3 below...
On 12/12/2013 02:19 PM, Alex Shi wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance. Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
Might be -- key thing is to measure it on a big system. If it is a problem, here are some workarounds with varying degrees of sanity:
Reduce cache misses by keeping three values: a. Current load value. b. Last exported value. (Should be in same cache line as a.) c. Exported value. (Should be in separate cache line.) The avoid writing to c unless a has deviated sufficiently from a. If the load values stay constant for long time periods, this can reduce the number of cache misses. On the other hand, it introduces an extra compare and branch -- though this should not be a problem if the load value is computed relatively infrequently.
As above, but provide additional information to allow the other CPU to extrapolate values. For example, if a CPU goes idle, some of its values will decay exponentially. The remote CPU could compute the decay rate, removing the need for the subject CPU to wake up to recompute its value. (Maybe you already do this?) Similarly if a CPU remains CPU-bound with given runqueue loading for some time.
Allow the exported values to become inaccurate, and resample the actual values remotely if extrapolated values indicate that action is warranted.
There are probably other approaches. I am being quite general here because I don't have the full picture of the scheduler statistics in my head. It is likely possible to obtain a much better approach by considering the scheduler's specifics.
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
Maybe... That said, you do need to take the thrashing of the idle_cpus_mask into account. In fact, I am a bit surprised that maintaining idle_cpus_mask avoids heavy memory contention.
Hmmm... It is not clear to me that this is updated on each and every idle entry and exit. Are you sure that it means what you want it to mean?
Hi Paul,
idle_cpus_mask is set as soon as a cpu enter idle (if not alread set) but it will be cleared during the next tick on the cpu and not as soon as the cpu exit idle
Regards, Vincent
Thanx, Paul
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
-- Thanks Alex
On Tue, Dec 17, 2013 at 06:22:32PM +0100, Vincent Guittot wrote:
On 17 December 2013 17:32, Paul E. McKenney paulmck@linux.vnet.ibm.com wrote:
On Fri, Dec 13, 2013 at 06:09:47PM +0800, Alex Shi wrote:
Hi Paul,
Would you like to give some comments when you'r free? Especially for the 3rd point.
Appreciate for your any input!
Hello, Alex,
I guess my main question is why we are discussing this internally rather than on LKML. Is this to be some writeup somewhere? (That would be a good thing.)
On to #3 below...
On 12/12/2013 02:19 PM, Alex Shi wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance. Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
Might be -- key thing is to measure it on a big system. If it is a problem, here are some workarounds with varying degrees of sanity:
Reduce cache misses by keeping three values: a. Current load value. b. Last exported value. (Should be in same cache line as a.) c. Exported value. (Should be in separate cache line.) The avoid writing to c unless a has deviated sufficiently from a. If the load values stay constant for long time periods, this can reduce the number of cache misses. On the other hand, it introduces an extra compare and branch -- though this should not be a problem if the load value is computed relatively infrequently.
As above, but provide additional information to allow the other CPU to extrapolate values. For example, if a CPU goes idle, some of its values will decay exponentially. The remote CPU could compute the decay rate, removing the need for the subject CPU to wake up to recompute its value. (Maybe you already do this?) Similarly if a CPU remains CPU-bound with given runqueue loading for some time.
Allow the exported values to become inaccurate, and resample the actual values remotely if extrapolated values indicate that action is warranted.
There are probably other approaches. I am being quite general here because I don't have the full picture of the scheduler statistics in my head. It is likely possible to obtain a much better approach by considering the scheduler's specifics.
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
Maybe... That said, you do need to take the thrashing of the idle_cpus_mask into account. In fact, I am a bit surprised that maintaining idle_cpus_mask avoids heavy memory contention.
Hmmm... It is not clear to me that this is updated on each and every idle entry and exit. Are you sure that it means what you want it to mean?
Hi Paul,
idle_cpus_mask is set as soon as a cpu enter idle (if not alread set) but it will be cleared during the next tick on the cpu and not as soon as the cpu exit idle
OK, that makes more sense from a memory-contention viewpoint.
Thanx, Paul
On 12/18/2013 12:32 AM, Paul E. McKenney wrote:
On Fri, Dec 13, 2013 at 06:09:47PM +0800, Alex Shi wrote:
Hi Paul,
Would you like to give some comments when you'r free? Especially for the 3rd point.
Appreciate for your any input!
Hello, Alex,
I guess my main question is why we are discussing this internally rather than on LKML. Is this to be some writeup somewhere? (That would be a good thing.)
Thanks Paul! I very appreciate your great comments and suggestions!
I thought it could be better to collect more comments before send to Peter and LKML. But since the main idea has no much change, and Paul had give very valuable suggestions. Rewrite/resend may cover the Paul's wise comments. So, I'd forward this email to LKML. sorry if the format doesn't looks good.
On to #3 below...
On 12/12/2013 02:19 PM, Alex Shi wrote:
Paul & Vincent & Morten,
The following rough idea get during this KS. I want to have internal review before send to LKML. Would you like to give some comments?
========================== 1, Current scheduler load balance is bottom-up mode, each CPU need initiate the balance by self. Like in a integrate computer system, it has smt/core/cpu/numa, 4 level scheduler domains. If there is just 2 tasks in whole system that both running on cpu0. Current load balance need to pull task to another smt in smt domain, then pull task to another core, then pull task to another cpu, finally pull task to another numa. Totally it is need 4 times task moving to get system balance.
As Vincent pointed out, idle balance may help a little for this task moving. But it depends on the CPU topology and which cpu is just idle in chance, so can not help much.
Generally, the task moving complexity is O(nm log n), n := nr_cpus, m := nr_tasks
PeterZ has a excellent summary and explanation for this in kernel/sched/fair.c:4605
Another weakness of current LB is that every cpu need to get the other cpus' load info repeatedly and try to figure out busiest sched group/queue on every sched domain level. but may not conduct a task moving, one of reasons is that cpu can only pull task, not pushing.
2, Consider huge cost of task moving: CS, tlb/cache refill, and the useless remote cpu load info getting. If we can have better solution for load balance, like reduce the balance times to. O(m) m := nr_tasks
It will be a great win on performance. like above example, we can move task from cpu0 direct to another numa. that only need 1 task moving, save 3 CS and tlb/cache refill.
To get this point, a rough idea is changing the load balance behaviour to top-down mode. Say let each of cpu report self load status on per-cpu memory. And a baby-sitter in system to collect these cpus load info, then decide how to move task centralize, finally send IPI to each hungry cpu to let them pull load quota from appointed cpus.
Like in above example, the baby-sitter will fetch each cpus' load info, then send a pull task IPI to let a cpu in another numa pull one task from cpu0. So in the task pulling, we still just involved 2 cpus, can reuse move_tasks functions.
BTW, the baby-sitter can care all kind of balance, regular balance, idle balance, wake up balance.
3, One of concern of top-down mode is that baby-sitter need remote
The new mode is more like a flat mode balance. But if numa access is really unbearable, we may give a baby-sitter for each of NUMA node, then reduce the numa balance frequency, and do more frequency internal numa node balance. -- I hope that's not needed.
access cpu load info on top domain level every times. But the fact is current load balance also need to get remote cpu load info for top level domain balance. and more worse, such remote accessing maybe useless. -- since there is just one thread reading the info, no competitor writer, Paul, do you think it is worthy concern?
Might be -- key thing is to measure it on a big system. If it is a problem, here are some workarounds with varying degrees of sanity:
Reduce cache misses by keeping three values:
a. Current load value. b. Last exported value. (Should be in same cache line as a.) c. Exported value. (Should be in separate cache line.)
The avoid writing to c unless a has deviated sufficiently from a. If the load values stay constant for long time periods, this can reduce the number of cache misses. On the other hand, it introduces an extra compare and branch -- though this should not be a problem if the load value is computed relatively infrequently.
As above, but provide additional information to allow the other CPU to extrapolate values. For example, if a CPU goes idle, some of its values will decay exponentially. The remote CPU could compute the decay rate, removing the need for the subject CPU to wake up to recompute its value. (Maybe you already do this?)
The idle_cpus_mask should be one of additional info to allow extrapolation action. As Vincent explained, this variable set when CPU enter idle and cleared during next tick. So we mayn't need to look at the cpu load for idle cpu.
And yes, current scheduler has similar behaviour.
Similarly if a CPU remains CPU-bound with given runqueue loading for some time.
- Allow the exported values to become inaccurate, and resample the actual values remotely if extrapolated values indicate that action is warranted.
It is a very heuristic idea! Could you give a bit more hints/clues to get remote cpu load by extrapolated value? I know RCU use this way wonderfully. but still no much idea to get live cpu load...
There are probably other approaches. I am being quite general here because I don't have the full picture of the scheduler statistics in my head. It is likely possible to obtain a much better approach by considering the scheduler's specifics.
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
[..]
Thanx, Paul
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
On Mon, Jan 06, 2014 at 09:44:36PM +0800, Alex Shi wrote:
On 12/18/2013 12:32 AM, Paul E. McKenney wrote:
On Fri, Dec 13, 2013 at 06:09:47PM +0800, Alex Shi wrote:
[ . . . ]
- Allow the exported values to become inaccurate, and resample the actual values remotely if extrapolated values indicate that action is warranted.
It is a very heuristic idea! Could you give a bit more hints/clues to get remote cpu load by extrapolated value? I know RCU use this way wonderfully. but still no much idea to get live cpu load...
Well, as long as the CPU continues doing the same thing, for example, being idle or running a user-mode task, the extrapolation should be exact, right? The load value was X the last time the CPU changed state, and T time has passed since then, so you can calculated it exactly.
The exact method for detecting inaccuracies depends on how and where you are calculating the load values. If you are calculating them on each state change (as is done for some values for NO_HZ_FULL), then you simply need sufficient synchronization for geting a consistent snapshot of several values. One easy way to do this is via a per-CPU seqlock. The state-change code write-acquires the seqlock, while those doing extrapolation read-acquire it and retry if changes occur. This can have problems if too many values are required and if changes occur too fast, but such problems can be addressed should they occur.
Does that help?
Thanx, Paul
There are probably other approaches. I am being quite general here because I don't have the full picture of the scheduler statistics in my head. It is likely possible to obtain a much better approach by considering the scheduler's specifics.
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
[..]
Thanx, Paul
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
-- Thanks Alex
On 01/20/2014 11:04 AM, Paul E. McKenney wrote:
On Mon, Jan 06, 2014 at 09:44:36PM +0800, Alex Shi wrote:
On 12/18/2013 12:32 AM, Paul E. McKenney wrote:
On Fri, Dec 13, 2013 at 06:09:47PM +0800, Alex Shi wrote:
[ . . . ]
- Allow the exported values to become inaccurate, and resample the actual values remotely if extrapolated values indicate that action is warranted.
It is a very heuristic idea! Could you give a bit more hints/clues to get remote cpu load by extrapolated value? I know RCU use this way wonderfully. but still no much idea to get live cpu load...
Well, as long as the CPU continues doing the same thing, for example, being idle or running a user-mode task, the extrapolation should be exact, right? The load value was X the last time the CPU changed state, and T time has passed since then, so you can calculated it exactly.
It's a good idea that I never thought before. Thanks a lot!
The exact method for detecting inaccuracies depends on how and where you are calculating the load values. If you are calculating them on each state change (as is done for some values for NO_HZ_FULL), then you simply need sufficient synchronization for geting a consistent snapshot of several values. One easy way to do this is via a per-CPU seqlock. The state-change code write-acquires the seqlock, while those doing extrapolation read-acquire it and retry if changes occur. This can have problems if too many values are required and if changes occur too fast, but such problems can be addressed should they occur.
I thought about the seqlock, but it is clearly not scalable. Anyway, load balance don't be very accurate, so maybe atomic operate for exported per cpu load in balance is acceptable.
Does that help?
Yes, very helpful! :)
Thanx, Paul
There are probably other approaches. I am being quite general here because I don't have the full picture of the scheduler statistics in my head. It is likely possible to obtain a much better approach by considering the scheduler's specifics.
BTW, to reduce unnecessary remote info fetching, we can use current idle_cpus_mask in nohz, we just skip the idle cpu in this cpumask simply.
[..]
Thanx, Paul
4, From power saving POV, top-down give the whole system cpu topology info directly. So beside the CS reducing, it can reduce the idle cpu interfere by a transition task. and let idle cpu sleep better.
-- Thanks Alex
linaro-kernel@lists.linaro.org