Re: [PATCH v5] topology: make core_mask include at least cluster_siblings

16 Sep 2022

On Fri, Sep 16, 2022 at 03:59:34PM +0800, Yicong Yang wrote:
...
On 2022/9/16 1:56, Darren Hart wrote:
...
On Thu, Sep 15, 2022 at 08:01:18PM +0800, Yicong Yang wrote:
...
Hi Darren,
Hi Yicong,
...
...
...

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1d6636ebaac5..5497c5ab7318 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
   		core_mask = &cpu_topology[cpu].llc_sibling;
   }

/*
* For systems with no shared cpu-side LLC but with clusters defined,


* extend core_mask to cluster_siblings. The sched domain builder will


* then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.


*/


if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
   cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))


core_mask = &cpu_topology[cpu].cluster_sibling;


return core_mask;

}
Is this patch still necessary for Ampere after Ionela's patch [1], which
will limit the cluster's span within coregroup's span.
Yes, see:
https://lore.kernel.org/lkml/YshYAyEWhE4z%2FKpB@fedora/
Both patches work together to accomplish the desired sched domains for the
Ampere Altra family.
Thanks for the link. From my understanding, on the Altra machine we'll get
the following results:
with your patch alone:
Scheduler will get a weight of 2 for both CLS and MC level and finally the
MC domain will be squashed. The lowest domain will be CLS.
with both your patch and Ionela's:
CLS will have a weight of 1 and MC will have a weight of 2. CLS won't be
built and the lowest domain will be MC.
with Ionela's patch alone:
Both CLS and MC will have a weight of 1, which is incorrect.
So your patch is still necessary for Amphere Altra. Then we need to limit
MC span to DIE/NODE span, according to the scheduler's definition for
topology level, for the issue below. Maybe something like this:
That seems reasonable.
What isn't clear to me is why qemu is creating a cluster layer with the
description you provide. Why is cluster_siblings being populated?
...
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 46cbe4471e78..8ebaba576836 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -713,6 +713,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
            cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
                core_mask = &cpu_topology[cpu].cluster_sibling;

  if (cpumask_subset(cpu_cpu_mask(cpu), core_mask))


          core_mask = cpu_cpu_mask(cpu);


  return core_mask;



}
...
...
I found an issue that the NUMA domains are not built on qemu with:
qemu-system-aarch64 \
        -kernel ${Image} \
        -smp 8 \
        -cpu cortex-a72 \
        -m 32G \
        -object memory-backend-ram,id=node0,size=8G \
        -object memory-backend-ram,id=node1,size=8G \
        -object memory-backend-ram,id=node2,size=8G \
        -object memory-backend-ram,id=node3,size=8G \
        -numa node,memdev=node0,cpus=0-1,nodeid=0 \
        -numa node,memdev=node1,cpus=2-3,nodeid=1 \
        -numa node,memdev=node2,cpus=4-5,nodeid=2 \
        -numa node,memdev=node3,cpus=6-7,nodeid=3 \
        -numa dist,src=0,dst=1,val=12 \
        -numa dist,src=0,dst=2,val=20 \
        -numa dist,src=0,dst=3,val=22 \
        -numa dist,src=1,dst=2,val=22 \
        -numa dist,src=1,dst=3,val=24 \
        -numa dist,src=2,dst=3,val=12 \
        -machine virt,iommu=smmuv3 \
        -net none \
        -initrd ${Rootfs} \
        -nographic \
        -bios QEMU_EFI.fd \
        -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8"
I can see the schedule domain build stops at MC level since we reach all the
cpus in the system:
[    2.141316] CPU0 attaching sched-domain(s):
[    2.142558]  domain-0: span=0-7 level=MC
[    2.145364]   groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 }
[    2.158357] CPU1 attaching sched-domain(s):
[    2.158964]  domain-0: span=0-7 level=MC
[...]
Without this the NUMA domains are built correctly:
Without which? My patch, Ionela's patch, or both?
Revert your patch only will have below result, sorry for the ambiguous. Before reverting,
for CPU 0, MC should span 0-1 but with your patch it's extended to 0-7 and the scheduler
domain build will stop at MC level because it has reached all the CPUs.
...
...
[    2.008885] CPU0 attaching sched-domain(s):
[    2.009764]  domain-0: span=0-1 level=MC
[    2.012654]   groups: 0:{ span=0 cap=962 }, 1:{ span=1 cap=925 }
[    2.016532]   domain-1: span=0-3 level=NUMA
[    2.017444]    groups: 0:{ span=0-1 cap=1887 }, 2:{ span=2-3 cap=1871 }
[    2.019354]    domain-2: span=0-5 level=NUMA
I'm not following this topology - what in the description above should result in
a domain with span=0-5?
It emulates a 3-hop NUMA machine and the NUMA domains will be built according to the
NUMA distances:
node   0   1   2   3
  0:  10  12  20  22
  1:  12  10  22  24
  2:  20  22  10  12
  3:  22  24  12  10
So for CPU 0 the NUMA domains will look like:
NUMA domain 0 for local nodes (squashed to MC domain), CPU 0-1
NUMA domain 1 for nodes within distance 12, CPU 0-3
NUMA domain 2 for nodes within distance 20, CPU 0-5
NUMA domain 3 for all the nodes, CPU 0-7
Right, thanks for the explanation.
So the bit that remains unclear to me, is why is cluster_siblings being
populated? Which part of your qemu topology description becomes the CLS layer
during sched domain cosntruction?
-- 
Darren Hart
Ampere Computing / OS and Kernel

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v5] topology: make core_mask include at least cluster_siblings