Re: [RFC PATCH v2 3/6] sched: pack small tasks

17 Dec 2012


      On 16 December 2012 08:12, Alex Shi alex.shi@intel.com wrote:
...
On 12/14/2012 05:33 PM, Vincent Guittot wrote:
...
On 14 December 2012 02:46, Alex Shi alex.shi@intel.com wrote:
...
On 12/13/2012 11:48 PM, Vincent Guittot wrote:
...
On 13 December 2012 15:53, Vincent Guittot vincent.guittot@linaro.org wrote:
...
On 13 December 2012 15:25, Alex Shi alex.shi@intel.com wrote:
...
On 12/13/2012 06:11 PM, Vincent Guittot wrote:
> On 13 December 2012 03:17, Alex Shi alex.shi@intel.com wrote:
>> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
>>> During the creation of sched_domain, we define a pack buddy CPU for each CPU
>>> when one is available. We want to pack at all levels where a group of CPU can
>>> be power gated independently from others.
>>> On a system that can't power gate a group of CPUs independently, the flag is
>>> set at all sched_domain level and the buddy is set to -1. This is the default
>>> behavior.
>>> On a dual clusters / dual cores system which can power gate each core and
>>> cluster independently, the buddy configuration will be :
>>>
>>>       | Cluster 0   | Cluster 1   |
>>>       | CPU0 | CPU1 | CPU2 | CPU3 |
>>> -----------------------------------
>>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>>
>>> Small tasks tend to slip out of the periodic load balance so the best place
>>> to choose to migrate them is during their wake up. The decision is in O(1) as
>>> we only check again one buddy CPU
>>
>> Just have a little worry about the scalability on a big machine, like on
>> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
>> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
>> is different on task distribution decision.
>
> The buddy CPU should probably not be the same for all 64 LCPU it
> depends on where it's worth packing small tasks
Do you have further ideas for buddy cpu on such example?
yes, I have several ideas which were not really relevant for small
system but could be interesting for larger system
We keep the same algorithm in a socket but we could either use another
LCPU in the targeted socket (conf0) or chain the socket (conf1)
instead of packing directly in one LCPU
The scheme below tries to summaries the idea:
Socket      | socket 0 | socket 1   | socket 2   | socket 3   |
LCPU        | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
buddy conf0 | 0 | 0    | 1  | 16    | 2  | 32    | 3  | 48    |
buddy conf1 | 0 | 0    | 0  | 16    | 16 | 32    | 32 | 48    |
buddy conf2 | 0 | 0    | 16 | 16    | 32 | 32    | 48 | 48    |
But, I don't know how this can interact with NUMA load balance and the
better might be to use conf3.
I mean conf2 not conf3
So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
is unbalanced for different socket.
That the target because we have decided to pack the small tasks in
socket 0 when we have parsed the topology at boot.
We don't have to loop into sched_domain or sched_group anymore to find
the best LCPU when a small tasks wake up.
iteration on domain and group is a advantage feature for power efficient
requirement, not shortage. If some CPU are already idle before forking,
let another waking CPU check their load/util and then decide which one
is best CPU can reduce late migrations, that save both the performance
and power.
In fact, we have already done this job once at boot and we consider
that moving small tasks in the buddy CPU is always benefit so we don't
need to waste time looping sched_domain and sched_group to compute
current capacity of each LCPU for each wake up of each small tasks. We
want all small tasks and background activity waking up on the same
buddy CPU and let the default behavior of the scheduler choosing the
best CPU for heavy tasks or loaded CPUs.
...
On the contrary, move task walking on each level buddies is not only bad
on performance but also bad on power. Consider the quite big latency of
waking a deep idle CPU. we lose too much..
My result have shown different conclusion.
In fact, there is much more chance that the buddy will not be in a
deep idle as all the small tasks and background activity are already
waking on this CPU.
...
...
...
And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
not a good design, consider my previous examples: if there are 4 or 8
tasks in one socket, you just has 2 choices: spread them into all cores,
or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
maybe a better solution. but the design missed this.
You speak about tasks without any notion of load. This patch only care
of small tasks and light LCPU load, but it falls back to default
behavior for other situation. So if there are 4 or 8 small tasks, they
will migrate to the socket 0 after 1 or up to 3 migration (it depends
of the conf and the LCPU they come from).
According to your patch, what your mean 'notion of load' is the
utilization of cpu, not the load weight of tasks, right?
Yes but not only. The number of tasks that run simultaneously, is
another important input
...
Yes, I just talked about tasks numbers, but it naturally extends to the
task utilization on cpu. like 8 tasks with 25% util, that just can full
fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need
to wake up another CPU socket while local socket has some LCPU idle...
8 tasks with a running period of 25ms per 100ms that wake up
simultaneously should probably run on 8 different LCPU in order to
race to idle
Regards,
Vincent
...
...
Then, if too much small tasks wake up simultaneously on the same LCPU,
the default load balance will spread them in the core/cluster/socket
...
Obviously, more and more cores is the trend on any kinds of CPU, the
buddy system seems hard to catch up this.
--
Thanks
    Alex

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [RFC PATCH v2 3/6] sched: pack small tasks