SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com --- arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, +#endif + +#ifdef CONFIG_SCHED_CLUSTER + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops; + bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu; @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu); + + /* + * Only use no_mc topology if all cpu_coregroup_mask weights=1 + */ + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) + use_no_mc_topology = false; + } + + /* + * SoCs with no shared processor-side cache will have cpu_coregroup_mask + * weights=1. If they also define clusters with cpu_clustergroup_mask + * weights > 1, build_sched_domain() will trigger a BUG as the CLS + * cpu_mask will not be a subset of MC. It will extend the MC cpu_mask + * to match CLS, and later discard the MC level. Avoid the bug by using + * a topology without the MC if the cpu_coregroup_mask weights=1. + */ + if (use_no_mc_topology) { + pr_info("cpu_coregroup_mask weights=1, skipping MC topology level"); + set_sched_topology(arm64_no_mc_topology); } }
-----Original Message----- From: Darren Hart [mailto:darren@os.amperecomputing.com] Sent: Friday, February 11, 2022 2:43 PM To: LKML linux-kernel@vger.kernel.org; Linux Arm linux-arm-kernel@lists.infradead.org Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; Peter Zijlstra peterz@infradead.org; Vincent Guittot vincent.guittot@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Valentin Schneider valentin.schneider@arm.com; D . Scott Phillips scott@os.amperecomputing.com; Ilkka Koskinen ilkka@os.amperecomputing.com; stable@vger.kernel.org Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent, What kind of resources are clusters sharing on Ampere Altra? So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
- { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
+};
void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops;
- bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu;
@@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu);
/*
* Only use no_mc topology if all cpu_coregroup_mask weights=1
*/
if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
This flag shouldn't be global.
- }
- /*
* SoCs with no shared processor-side cache will have cpu_coregroup_mask
* weights=1. If they also define clusters with cpu_clustergroup_mask
* weights > 1, build_sched_domain() will trigger a BUG as the CLS
* cpu_mask will not be a subset of MC. It will extend the MC cpu_mask
* to match CLS, and later discard the MC level. Avoid the bug by using
* a topology without the MC if the cpu_coregroup_mask weights=1.
*/
- if (use_no_mc_topology) {
pr_info("cpu_coregroup_mask weights=1, skipping MC topology level");
}set_sched_topology(arm64_no_mc_topology);
}
-- 2.31.1
Thanks Barry
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Darren Hart [mailto:darren@os.amperecomputing.com] Sent: Friday, February 11, 2022 2:43 PM To: LKML linux-kernel@vger.kernel.org; Linux Arm linux-arm-kernel@lists.infradead.org Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; Peter Zijlstra peterz@infradead.org; Vincent Guittot vincent.guittot@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Valentin Schneider valentin.schneider@arm.com; D . Scott Phillips scott@os.amperecomputing.com; Ilkka Koskinen ilkka@os.amperecomputing.com; stable@vger.kernel.org Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent,
Hi Barry, thanks for the review.
What kind of resources are clusters sharing on Ampere Altra?
The cluster pairs are DSU pairs (ARM DynamIQ Shared Unit). While there is no shared L3 cache, they do share an SCU (snoop control unit) and have a cache coherency latency advantage relative to non-DSU pairs.
The Anandtech Altra review illustrates this advantage: https://www.anandtech.com/show/16315/the-ampere-altra-review/3
Notably, the SCHED_CLUSTER change did result in marked improvements for some interactive workloads.
So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
Correct. On the processor side the last level is at each cpu, and then there is a memory side SLC (system level cache) that is shared by all cpus.
arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
- { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
+};
void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops;
- bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu;
@@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu);
/*
* Only use no_mc topology if all cpu_coregroup_mask weights=1
*/
if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
This flag shouldn't be global.
Please note that this patch is intended to maintain an identical final sched domain construction for a symmetric topology with no shared processor-side cache and with cache advantaged clusters and avoid the BUG messages since this topology is correct for this architecture.
By using a sched topology without the MC layer, this more accurately describes this architecture and does not require changes to build_sched_domain(), in particular changes to the assumptions about what a valid topology is.
The test above tests every cpu coregroup weight in order to limit the impact of this change to this specific kind of topology. It intentionally does not address, nor change existing behavior for, the assymetrical topology you describe.
I felt this was the less invasive approach and consistent with how other architectures handled "non-default" topologies.
If there is interest on working toward a more generic topology builder, I'd be interested in working on that too, but I think this change makes sense in the near term.
Thanks,
- }
- /*
* SoCs with no shared processor-side cache will have cpu_coregroup_mask
* weights=1. If they also define clusters with cpu_clustergroup_mask
* weights > 1, build_sched_domain() will trigger a BUG as the CLS
* cpu_mask will not be a subset of MC. It will extend the MC cpu_mask
* to match CLS, and later discard the MC level. Avoid the bug by using
* a topology without the MC if the cpu_coregroup_mask weights=1.
*/
- if (use_no_mc_topology) {
pr_info("cpu_coregroup_mask weights=1, skipping MC topology level");
}set_sched_topology(arm64_no_mc_topology);
}
-- 2.31.1
Thanks Barry
On Tue, Feb 15, 2022 at 12:17 AM Darren Hart darren@os.amperecomputing.com wrote:
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Darren Hart [mailto:darren@os.amperecomputing.com] Sent: Friday, February 11, 2022 2:43 PM To: LKML linux-kernel@vger.kernel.org; Linux Arm linux-arm-kernel@lists.infradead.org Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; Peter Zijlstra peterz@infradead.org; Vincent Guittot vincent.guittot@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Valentin Schneider valentin.schneider@arm.com; D . Scott Phillips scott@os.amperecomputing.com; Ilkka Koskinen ilkka@os.amperecomputing.com; stable@vger.kernel.org Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent,
Hi Barry, thanks for the review.
What kind of resources are clusters sharing on Ampere Altra?
The cluster pairs are DSU pairs (ARM DynamIQ Shared Unit). While there is no shared L3 cache, they do share an SCU (snoop control unit) and have a cache coherency latency advantage relative to non-DSU pairs.
The Anandtech Altra review illustrates this advantage: https://www.anandtech.com/show/16315/the-ampere-altra-review/3
Notably, the SCHED_CLUSTER change did result in marked improvements for some interactive workloads.
Thanks. there is a wake_affine patchset, i also wonder if your device can also benefit from it: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/
So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
Correct. On the processor side the last level is at each cpu, and then there is a memory side SLC (system level cache) that is shared by all cpus.
Thanks.
arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
- { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
+};
void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops;
- bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu;
@@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu);
/*
* Only use no_mc topology if all cpu_coregroup_mask weights=1
*/
if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
This flag shouldn't be global.
Please note that this patch is intended to maintain an identical final sched domain construction for a symmetric topology with no shared processor-side cache and with cache advantaged clusters and avoid the BUG messages since this topology is correct for this architecture.
By using a sched topology without the MC layer, this more accurately describes this architecture and does not require changes to build_sched_domain(), in particular changes to the assumptions about what a valid topology is.
The test above tests every cpu coregroup weight in order to limit the impact of this change to this specific kind of topology. It intentionally does not address, nor change existing behavior for, the assymetrical topology you describe.
I felt this was the less invasive approach and consistent with how other architectures handled "non-default" topologies.
If there is interest on working toward a more generic topology builder, I'd be interested in working on that too, but I think this change makes sense in the near term.
I do agree this patch makes sense for symmetric topology but asymmetrical topology is still breaking. it might be better to be more generic. we had a similar fix over here for smt before: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
Thanks,
- }
- /*
- SoCs with no shared processor-side cache will have cpu_coregroup_mask
- weights=1. If they also define clusters with cpu_clustergroup_mask
- weights > 1, build_sched_domain() will trigger a BUG as the CLS
- cpu_mask will not be a subset of MC. It will extend the MC cpu_mask
- to match CLS, and later discard the MC level. Avoid the bug by using
- a topology without the MC if the cpu_coregroup_mask weights=1.
- */
- if (use_no_mc_topology) {
pr_info("cpu_coregroup_mask weights=1, skipping MC topology level");
}set_sched_topology(arm64_no_mc_topology);
}
-- 2.31.1
Thanks Barry
-- Darren Hart Ampere Computing / OS and Kernel
Thanks Barry
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Darren Hart [mailto:darren@os.amperecomputing.com] Sent: Friday, February 11, 2022 2:43 PM To: LKML linux-kernel@vger.kernel.org; Linux Arm linux-arm-kernel@lists.infradead.org Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; Peter Zijlstra peterz@infradead.org; Vincent Guittot vincent.guittot@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Valentin Schneider valentin.schneider@arm.com; D . Scott Phillips scott@os.amperecomputing.com; Ilkka Koskinen ilkka@os.amperecomputing.com; stable@vger.kernel.org Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent, What kind of resources are clusters sharing on Ampere Altra? So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
- { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
+};
void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops;
- bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu;
@@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu);
/*
* Only use no_mc topology if all cpu_coregroup_mask weights=1
*/
if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
What is the *current* behaviour on such a system?
Will
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Darren Hart [mailto:darren@os.amperecomputing.com] Sent: Friday, February 11, 2022 2:43 PM To: LKML linux-kernel@vger.kernel.org; Linux Arm linux-arm-kernel@lists.infradead.org Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; Peter Zijlstra peterz@infradead.org; Vincent Guittot vincent.guittot@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Valentin Schneider valentin.schneider@arm.com; D . Scott Phillips scott@os.amperecomputing.com; Ilkka Koskinen ilkka@os.amperecomputing.com; stable@vger.kernel.org Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent, What kind of resources are clusters sharing on Ampere Altra? So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
- { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
+};
void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops;
- bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu;
@@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu);
/*
* Only use no_mc topology if all cpu_coregroup_mask weights=1
*/
if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
What is the *current* behaviour on such a system?
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
Thanks,
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Darren Hart [mailto:darren@os.amperecomputing.com] Sent: Friday, February 11, 2022 2:43 PM To: LKML linux-kernel@vger.kernel.org; Linux Arm linux-arm-kernel@lists.infradead.org Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; Peter Zijlstra peterz@infradead.org; Vincent Guittot vincent.guittot@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Valentin Schneider valentin.schneider@arm.com; D . Scott Phillips scott@os.amperecomputing.com; Ilkka Koskinen ilkka@os.amperecomputing.com; stable@vger.kernel.org Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent, What kind of resources are clusters sharing on Ampere Altra? So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
- { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
+};
void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops;
- bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu;
@@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu);
/*
* Only use no_mc topology if all cpu_coregroup_mask weights=1
*/
if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
What is the *current* behaviour on such a system?
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
Barry -- why shouldn't we take this as-is?
Will
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Darren Hart [mailto:darren@os.amperecomputing.com] Sent: Friday, February 11, 2022 2:43 PM To: LKML linux-kernel@vger.kernel.org; Linux Arm linux-arm-kernel@lists.infradead.org Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; Peter Zijlstra peterz@infradead.org; Vincent Guittot vincent.guittot@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Valentin Schneider valentin.schneider@arm.com; D . Scott Phillips scott@os.amperecomputing.com; Ilkka Koskinen ilkka@os.amperecomputing.com; stable@vger.kernel.org Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent, What kind of resources are clusters sharing on Ampere Altra? So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
+};
void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops;
bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu;
@@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu);
/*
* Only use no_mc topology if all cpu_coregroup_mask weights=1
*/
if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
What is the *current* behaviour on such a system?
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared. Looks like there is a mismatch in topology description
Barry -- why shouldn't we take this as-is?
Will
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
-----Original Message----- From: Darren Hart [mailto:darren@os.amperecomputing.com] Sent: Friday, February 11, 2022 2:43 PM To: LKML linux-kernel@vger.kernel.org; Linux Arm linux-arm-kernel@lists.infradead.org Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; Peter Zijlstra peterz@infradead.org; Vincent Guittot vincent.guittot@linaro.org; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; Valentin Schneider valentin.schneider@arm.com; D . Scott Phillips scott@os.amperecomputing.com; Ilkka Koskinen ilkka@os.amperecomputing.com; stable@vger.kernel.org Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
SoCs such as the Ampere Altra define clusters but have no shared processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
BUG: arch topology borken the CLS domain not a subset of the MC domain
for each CPU (160 times for a 2 socket 80 core Altra system). The MC level cpu mask is then extended to that of the CLS child, and is later removed entirely as redundant.
This change detects when all cpu_coregroup_mask weights=1 and uses an alternative sched_domain_topology equivalent to the default if CONFIG_SCHED_MC were disabled.
The final resulting sched domain topology is unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
With CLS: CLS [0-1] DIE [0-79] NUMA [0-159]
Without CLS: DIE [0-79] NUMA [0-159]
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Barry Song song.bao.hua@hisilicon.com Cc: Valentin Schneider valentin.schneider@arm.com Cc: D. Scott Phillips scott@os.amperecomputing.com Cc: Ilkka Koskinen ilkka@os.amperecomputing.com Cc: stable@vger.kernel.org # 5.16.x Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent, What kind of resources are clusters sharing on Ampere Altra? So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 27df5c1e6baa..0a78ac5c8830 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) } }
+static struct sched_domain_topology_level arm64_no_mc_topology[] = { +#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
+};
void __init smp_prepare_cpus(unsigned int max_cpus) { const struct cpu_operations *ops;
bool use_no_mc_topology = true; int err; unsigned int cpu; unsigned int this_cpu;
@@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_present(cpu, true); numa_store_cpu_info(cpu);
/*
* Only use no_mc topology if all cpu_coregroup_mask weights=1
*/
if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
What is the *current* behaviour on such a system?
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I'd describe this as a mismatch between the default assumptions of the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies a shared cpu-side cache) and SoCs without a shared cpu-side cache. This is encoded in properties of the MC level and the requirement that child domains be a subset of the parent domain cpumask.
The MC-less topology addresses this in a consistent way with other architectures using the provided API for non-default topologies without changing these fundamental assumptions and without changing the final resulting topology which is correct and matches the topology supplied in the ACPI PPTT.
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
> -----Original Message----- > From: Darren Hart [mailto:darren@os.amperecomputing.com] > Sent: Friday, February 11, 2022 2:43 PM > To: LKML linux-kernel@vger.kernel.org; Linux Arm > linux-arm-kernel@lists.infradead.org > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > Peter Zijlstra peterz@infradead.org; Vincent Guittot > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > song.bao.hua@hisilicon.com; Valentin Schneider > valentin.schneider@arm.com; D . Scott Phillips > scott@os.amperecomputing.com; Ilkka Koskinen > ilkka@os.amperecomputing.com; stable@vger.kernel.org > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > SoCs such as the Ampere Altra define clusters but have no shared > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > BUG: arch topology borken > the CLS domain not a subset of the MC domain > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > level cpu mask is then extended to that of the CLS child, and is later > removed entirely as redundant. > > This change detects when all cpu_coregroup_mask weights=1 and uses an > alternative sched_domain_topology equivalent to the default if > CONFIG_SCHED_MC were disabled. > > The final resulting sched domain topology is unchanged with or without > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > For CPU0: > > With CLS: > CLS [0-1] > DIE [0-79] > NUMA [0-159] > > Without CLS: > DIE [0-79] > NUMA [0-159] > > Cc: Catalin Marinas catalin.marinas@arm.com > Cc: Will Deacon will@kernel.org > Cc: Peter Zijlstra peterz@infradead.org > Cc: Vincent Guittot vincent.guittot@linaro.org > Cc: Barry Song song.bao.hua@hisilicon.com > Cc: Valentin Schneider valentin.schneider@arm.com > Cc: D. Scott Phillips scott@os.amperecomputing.com > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > Cc: stable@vger.kernel.org # 5.16.x > Signed-off-by: Darren Hart darren@os.amperecomputing.com
Hi Darrent, What kind of resources are clusters sharing on Ampere Altra? So on Altra, cpus are not sharing LLC? Each LLC is separate for each cpu?
> --- > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > 1 file changed, 32 insertions(+) > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index 27df5c1e6baa..0a78ac5c8830 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > } > } > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > +#ifdef CONFIG_SCHED_SMT > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > +#endif > + > +#ifdef CONFIG_SCHED_CLUSTER > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > +#endif > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > + { NULL, }, > +}; > + > void __init smp_prepare_cpus(unsigned int max_cpus) > { > const struct cpu_operations *ops; > + bool use_no_mc_topology = true; > int err; > unsigned int cpu; > unsigned int this_cpu; > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > set_cpu_present(cpu, true); > numa_store_cpu_info(cpu); > + > + /* > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > + */ > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > + use_no_mc_topology = false;
This seems to be wrong? If you have 5 cpus, Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 has cpu_coregroup_mask(cpu)== 4, for cpu0, you still need to remove MC, but for cpu1-4, you will need CLS and MC both?
What is the *current* behaviour on such a system?
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in the PPTT table whereas the MC level is defined as the number of cache levels. So i would say that you should compare the level to know the ordering
Then, there is another point: In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES which is used to define some scheduler internal variable like sd_llc(sched domain last level of cache) which allows fast task migration between this cpus in this level at wakeup. In your case the sd_llc should not be the cluster but the MC with only one CPU. But I would not be surprised that most of perf improvement comes from this sd_llc wrongly set to cluster instead of the single CPU
I'd describe this as a mismatch between the default assumptions of the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies a shared cpu-side cache) and SoCs without a shared cpu-side cache. This is encoded in properties of the MC level and the requirement that child domains be a subset of the parent domain cpumask.
The MC-less topology addresses this in a consistent way with other architectures using the provided API for non-default topologies without changing these fundamental assumptions and without changing the final resulting topology which is correct and matches the topology supplied in the ACPI PPTT.
-- Darren Hart Ampere Computing / OS and Kernel
On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > -----Original Message----- > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > Sent: Friday, February 11, 2022 2:43 PM > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > linux-arm-kernel@lists.infradead.org > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > song.bao.hua@hisilicon.com; Valentin Schneider > > valentin.schneider@arm.com; D . Scott Phillips > > scott@os.amperecomputing.com; Ilkka Koskinen > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > SoCs such as the Ampere Altra define clusters but have no shared > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > BUG: arch topology borken > > the CLS domain not a subset of the MC domain > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > level cpu mask is then extended to that of the CLS child, and is later > > removed entirely as redundant. > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > alternative sched_domain_topology equivalent to the default if > > CONFIG_SCHED_MC were disabled. > > > > The final resulting sched domain topology is unchanged with or without > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > For CPU0: > > > > With CLS: > > CLS [0-1] > > DIE [0-79] > > NUMA [0-159] > > > > Without CLS: > > DIE [0-79] > > NUMA [0-159] > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > Cc: Will Deacon will@kernel.org > > Cc: Peter Zijlstra peterz@infradead.org > > Cc: Vincent Guittot vincent.guittot@linaro.org > > Cc: Barry Song song.bao.hua@hisilicon.com > > Cc: Valentin Schneider valentin.schneider@arm.com > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > Cc: stable@vger.kernel.org # 5.16.x > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > Hi Darrent, > What kind of resources are clusters sharing on Ampere Altra? > So on Altra, cpus are not sharing LLC? Each LLC is separate > for each cpu? > > > --- > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > 1 file changed, 32 insertions(+) > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > index 27df5c1e6baa..0a78ac5c8830 100644 > > --- a/arch/arm64/kernel/smp.c > > +++ b/arch/arm64/kernel/smp.c > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > } > > } > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > +#ifdef CONFIG_SCHED_SMT > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > +#endif > > + > > +#ifdef CONFIG_SCHED_CLUSTER > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > +#endif > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > + { NULL, }, > > +}; > > + > > void __init smp_prepare_cpus(unsigned int max_cpus) > > { > > const struct cpu_operations *ops; > > + bool use_no_mc_topology = true; > > int err; > > unsigned int cpu; > > unsigned int this_cpu; > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > set_cpu_present(cpu, true); > > numa_store_cpu_info(cpu); > > + > > + /* > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > + */ > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > + use_no_mc_topology = false; > > This seems to be wrong? If you have 5 cpus, > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > need to remove MC, but for cpu1-4, you will need > CLS and MC both?
What is the *current* behaviour on such a system?
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
Given the fairly loose definition of some of these domains and the freedom to adjust flags with custom topologies, I think it's difficult to say it needs to be this or that. As you point out, this stems from an assumption in the default topology, so eliding the MC level within the current set of abstractions for a very targeted topology still seems reasonable to address the BUG in the very near term in a contained way.
IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in the PPTT table whereas the MC level is defined as the number of cache levels. So i would say that you should compare the level to know the ordering
Then, there is another point: In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES which is used to define some scheduler internal variable like sd_llc(sched domain last level of cache) which allows fast task migration between this cpus in this level at wakeup. In your case the sd_llc should not be the cluster but the MC with only one CPU. But I would not be surprised that most of perf improvement comes from this sd_llc wrongly set to cluster instead of the single CPU
Agree that we currently have an imperfect representation of SoCs without a shared CPU-side cache. Within this imperfect abstraction, it seems valid to consider the SCU as a valid shared resource to be described by SD_SHARE_PKG_RESOURCES, and as you say, that does result in an overall performance improvement.
Also agree that it is worth working toward a better abstraction. On the way to that, I think it makes sense to avoid BUG'ing on the current topology which is a reasonable one given the current abstraction.
Thanks,
I'd describe this as a mismatch between the default assumptions of the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies a shared cpu-side cache) and SoCs without a shared cpu-side cache. This is encoded in properties of the MC level and the requirement that child domains be a subset of the parent domain cpumask.
The MC-less topology addresses this in a consistent way with other architectures using the provided API for non-default topologies without changing these fundamental assumptions and without changing the final resulting topology which is correct and matches the topology supplied in the ACPI PPTT.
-- Darren Hart Ampere Computing / OS and Kernel
On Tue, 15 Feb 2022 at 21:05, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote: > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > > > > -----Original Message----- > > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > > Sent: Friday, February 11, 2022 2:43 PM > > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > > linux-arm-kernel@lists.infradead.org > > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > > song.bao.hua@hisilicon.com; Valentin Schneider > > > valentin.schneider@arm.com; D . Scott Phillips > > > scott@os.amperecomputing.com; Ilkka Koskinen > > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > > > SoCs such as the Ampere Altra define clusters but have no shared > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > > > BUG: arch topology borken > > > the CLS domain not a subset of the MC domain > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > > level cpu mask is then extended to that of the CLS child, and is later > > > removed entirely as redundant. > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > > alternative sched_domain_topology equivalent to the default if > > > CONFIG_SCHED_MC were disabled. > > > > > > The final resulting sched domain topology is unchanged with or without > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > > > For CPU0: > > > > > > With CLS: > > > CLS [0-1] > > > DIE [0-79] > > > NUMA [0-159] > > > > > > Without CLS: > > > DIE [0-79] > > > NUMA [0-159] > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > > Cc: Will Deacon will@kernel.org > > > Cc: Peter Zijlstra peterz@infradead.org > > > Cc: Vincent Guittot vincent.guittot@linaro.org > > > Cc: Barry Song song.bao.hua@hisilicon.com > > > Cc: Valentin Schneider valentin.schneider@arm.com > > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > > Cc: stable@vger.kernel.org # 5.16.x > > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > > > Hi Darrent, > > What kind of resources are clusters sharing on Ampere Altra? > > So on Altra, cpus are not sharing LLC? Each LLC is separate > > for each cpu? > > > > > --- > > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > > 1 file changed, 32 insertions(+) > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > index 27df5c1e6baa..0a78ac5c8830 100644 > > > --- a/arch/arm64/kernel/smp.c > > > +++ b/arch/arm64/kernel/smp.c > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > > } > > > } > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > > +#ifdef CONFIG_SCHED_SMT > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > > +#endif > > > + > > > +#ifdef CONFIG_SCHED_CLUSTER > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > > +#endif > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > > + { NULL, }, > > > +}; > > > + > > > void __init smp_prepare_cpus(unsigned int max_cpus) > > > { > > > const struct cpu_operations *ops; > > > + bool use_no_mc_topology = true; > > > int err; > > > unsigned int cpu; > > > unsigned int this_cpu; > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > set_cpu_present(cpu, true); > > > numa_store_cpu_info(cpu); > > > + > > > + /* > > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > > + */ > > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > > + use_no_mc_topology = false; > > > > This seems to be wrong? If you have 5 cpus, > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > > need to remove MC, but for cpu1-4, you will need > > CLS and MC both? > > What is the *current* behaviour on such a system? >
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
Given the fairly loose definition of some of these domains and the freedom to adjust flags with custom topologies, I think it's difficult to say it needs to be this or that. As you point out, this stems from an assumption in the default topology, so eliding the MC level within the current set of abstractions for a very targeted topology still seems reasonable to address the BUG in the very near term in a contained way.
But if another SoC comes with a valid MC then a CLUSTER, this proposal will not work.
Keep in mind that the MC level will be removed/degenerate when building because it is useless in your case so the scheduler topology will still be the same at the end but it will support more case. That why I think you should keep MC level
IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in the PPTT table whereas the MC level is defined as the number of cache levels. So i would say that you should compare the level to know the ordering
Then, there is another point: In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES which is used to define some scheduler internal variable like sd_llc(sched domain last level of cache) which allows fast task migration between this cpus in this level at wakeup. In your case the sd_llc should not be the cluster but the MC with only one CPU. But I would not be surprised that most of perf improvement comes from this sd_llc wrongly set to cluster instead of the single CPU
Agree that we currently have an imperfect representation of SoCs without a shared CPU-side cache. Within this imperfect abstraction, it seems valid to consider the SCU as a valid shared resource to be described by SD_SHARE_PKG_RESOURCES, and as you say, that does result in an overall performance improvement.
My concern is that there are some ongoing discussion to make more usage of the CLUSTER level than what is currently done and it assumes that we have a valid LLC level after the CLUSTER one which is not your case and I'm afraid that it will be suboptimal for you because CLUSTER and LLC are wrongly the same for your case and then you will come back to add more exception in the generic scheduler code to cover this first exception
Also agree that it is worth working toward a better abstraction. On the way to that, I think it makes sense to avoid BUG'ing on the current topology which is a reasonable one given the current abstraction.
Thanks,
I'd describe this as a mismatch between the default assumptions of the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies a shared cpu-side cache) and SoCs without a shared cpu-side cache. This is encoded in properties of the MC level and the requirement that child domains be a subset of the parent domain cpumask.
The MC-less topology addresses this in a consistent way with other architectures using the provided API for non-default topologies without changing these fundamental assumptions and without changing the final resulting topology which is correct and matches the topology supplied in the ACPI PPTT.
-- Darren Hart Ampere Computing / OS and Kernel
-- Darren Hart Ampere Computing / OS and Kernel
On Wed, Feb 16, 2022 at 09:30:49AM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 21:05, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote: > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote: > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > > > > > > > -----Original Message----- > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > > > Sent: Friday, February 11, 2022 2:43 PM > > > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > > > linux-arm-kernel@lists.infradead.org > > > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > > > song.bao.hua@hisilicon.com; Valentin Schneider > > > > valentin.schneider@arm.com; D . Scott Phillips > > > > scott@os.amperecomputing.com; Ilkka Koskinen > > > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > > > > > BUG: arch topology borken > > > > the CLS domain not a subset of the MC domain > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > > > level cpu mask is then extended to that of the CLS child, and is later > > > > removed entirely as redundant. > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > > > alternative sched_domain_topology equivalent to the default if > > > > CONFIG_SCHED_MC were disabled. > > > > > > > > The final resulting sched domain topology is unchanged with or without > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > > > > > For CPU0: > > > > > > > > With CLS: > > > > CLS [0-1] > > > > DIE [0-79] > > > > NUMA [0-159] > > > > > > > > Without CLS: > > > > DIE [0-79] > > > > NUMA [0-159] > > > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > > > Cc: Will Deacon will@kernel.org > > > > Cc: Peter Zijlstra peterz@infradead.org > > > > Cc: Vincent Guittot vincent.guittot@linaro.org > > > > Cc: Barry Song song.bao.hua@hisilicon.com > > > > Cc: Valentin Schneider valentin.schneider@arm.com > > > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > > > Cc: stable@vger.kernel.org # 5.16.x > > > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > > > > > Hi Darrent, > > > What kind of resources are clusters sharing on Ampere Altra? > > > So on Altra, cpus are not sharing LLC? Each LLC is separate > > > for each cpu? > > > > > > > --- > > > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > > > 1 file changed, 32 insertions(+) > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > index 27df5c1e6baa..0a78ac5c8830 100644 > > > > --- a/arch/arm64/kernel/smp.c > > > > +++ b/arch/arm64/kernel/smp.c > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > > > } > > > > } > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > > > +#ifdef CONFIG_SCHED_SMT > > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > > > +#endif > > > > + > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > > > +#endif > > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > > > + { NULL, }, > > > > +}; > > > > + > > > > void __init smp_prepare_cpus(unsigned int max_cpus) > > > > { > > > > const struct cpu_operations *ops; > > > > + bool use_no_mc_topology = true; > > > > int err; > > > > unsigned int cpu; > > > > unsigned int this_cpu; > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > > > set_cpu_present(cpu, true); > > > > numa_store_cpu_info(cpu); > > > > + > > > > + /* > > > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > > > + */ > > > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > > > + use_no_mc_topology = false; > > > > > > This seems to be wrong? If you have 5 cpus, > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > > > need to remove MC, but for cpu1-4, you will need > > > CLS and MC both? > > > > What is the *current* behaviour on such a system? > > > > As I understand it, any system that uses the default topology which has > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 > will behave as described above by printing the following for each CPU > matching this criteria: > > BUG: arch topology borken > the [CLS,SMT,...] domain not a subset of the MC domain > > And then extend the MC domain cpumask to match that of the child and continue > on. > > That would still be the behavior for this type of system after this > patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
Given the fairly loose definition of some of these domains and the freedom to adjust flags with custom topologies, I think it's difficult to say it needs to be this or that. As you point out, this stems from an assumption in the default topology, so eliding the MC level within the current set of abstractions for a very targeted topology still seems reasonable to address the BUG in the very near term in a contained way.
But if another SoC comes with a valid MC then a CLUSTER, this proposal will not work.
Keep in mind that the MC level will be removed/degenerate when building because it is useless in your case so the scheduler topology will still be the same at the end but it will support more case. That why I think you should keep MC level
Hi Vincent,
Thanks for reiterating, I don't think I quite understood what you were suggesting before. Is the following in line with what you were thinking?
I am testing a version of this patch which uses a topology like this instead:
MC CLS DIE NUMA
(I tested without an SMT domain since the trigger is still MC weight==1, so there is no valid topology that includes an SMT level under these conditions).
Which results in no BUG output and a final topology on Altra of:
CLS DIE NUMA
Which so far seems right (I still need to do some testing and review the sched debug data).
If we take this approach, I think to address your concern about other systems with valid MCs, we would need a different trigger that MC weight == 1 to use this alternate topology. Do you have a suggestion on what to trigger this on?
Thanks,
On Wed, 16 Feb 2022 at 17:24, Darren Hart darren@os.amperecomputing.com wrote:
On Wed, Feb 16, 2022 at 09:30:49AM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 21:05, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote: > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote: > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote: > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > > > > > > > > > > -----Original Message----- > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > > > > Sent: Friday, February 11, 2022 2:43 PM > > > > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > > > > linux-arm-kernel@lists.infradead.org > > > > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > > > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > > > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > > > > song.bao.hua@hisilicon.com; Valentin Schneider > > > > > valentin.schneider@arm.com; D . Scott Phillips > > > > > scott@os.amperecomputing.com; Ilkka Koskinen > > > > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > > > > > > > BUG: arch topology borken > > > > > the CLS domain not a subset of the MC domain > > > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > > > > level cpu mask is then extended to that of the CLS child, and is later > > > > > removed entirely as redundant. > > > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > > > > alternative sched_domain_topology equivalent to the default if > > > > > CONFIG_SCHED_MC were disabled. > > > > > > > > > > The final resulting sched domain topology is unchanged with or without > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > > > > > > > For CPU0: > > > > > > > > > > With CLS: > > > > > CLS [0-1] > > > > > DIE [0-79] > > > > > NUMA [0-159] > > > > > > > > > > Without CLS: > > > > > DIE [0-79] > > > > > NUMA [0-159] > > > > > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > > > > Cc: Will Deacon will@kernel.org > > > > > Cc: Peter Zijlstra peterz@infradead.org > > > > > Cc: Vincent Guittot vincent.guittot@linaro.org > > > > > Cc: Barry Song song.bao.hua@hisilicon.com > > > > > Cc: Valentin Schneider valentin.schneider@arm.com > > > > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > > > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > > > > Cc: stable@vger.kernel.org # 5.16.x > > > > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > > > > > > > Hi Darrent, > > > > What kind of resources are clusters sharing on Ampere Altra? > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate > > > > for each cpu? > > > > > > > > > --- > > > > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > > > > 1 file changed, 32 insertions(+) > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > > index 27df5c1e6baa..0a78ac5c8830 100644 > > > > > --- a/arch/arm64/kernel/smp.c > > > > > +++ b/arch/arm64/kernel/smp.c > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > > > > } > > > > > } > > > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > > > > +#ifdef CONFIG_SCHED_SMT > > > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > > > > +#endif > > > > > + > > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > > > > +#endif > > > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > > > > + { NULL, }, > > > > > +}; > > > > > + > > > > > void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > { > > > > > const struct cpu_operations *ops; > > > > > + bool use_no_mc_topology = true; > > > > > int err; > > > > > unsigned int cpu; > > > > > unsigned int this_cpu; > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > > > > > set_cpu_present(cpu, true); > > > > > numa_store_cpu_info(cpu); > > > > > + > > > > > + /* > > > > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > > > > + */ > > > > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > > > > + use_no_mc_topology = false; > > > > > > > > This seems to be wrong? If you have 5 cpus, > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > > > > need to remove MC, but for cpu1-4, you will need > > > > CLS and MC both? > > > > > > What is the *current* behaviour on such a system? > > > > > > > As I understand it, any system that uses the default topology which has > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 > > will behave as described above by printing the following for each CPU > > matching this criteria: > > > > BUG: arch topology borken > > the [CLS,SMT,...] domain not a subset of the MC domain > > > > And then extend the MC domain cpumask to match that of the child and continue > > on. > > > > That would still be the behavior for this type of system after this > > patch is applied. > > That's what I thought, but in that case applying your patch is a net > improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
Given the fairly loose definition of some of these domains and the freedom to adjust flags with custom topologies, I think it's difficult to say it needs to be this or that. As you point out, this stems from an assumption in the default topology, so eliding the MC level within the current set of abstractions for a very targeted topology still seems reasonable to address the BUG in the very near term in a contained way.
But if another SoC comes with a valid MC then a CLUSTER, this proposal will not work.
Keep in mind that the MC level will be removed/degenerate when building because it is useless in your case so the scheduler topology will still be the same at the end but it will support more case. That why I think you should keep MC level
Hi Vincent,
Thanks for reiterating, I don't think I quite understood what you were suggesting before. Is the following in line with what you were thinking?
I am testing a version of this patch which uses a topology like this instead:
MC CLS DIE NUMA
(I tested without an SMT domain since the trigger is still MC weight==1, so there is no valid topology that includes an SMT level under these conditions).
Which results in no BUG output and a final topology on Altra of:
CLS DIE NUMA
Which so far seems right (I still need to do some testing and review the sched debug data).
If we take this approach, I think to address your concern about other systems with valid MCs, we would need a different trigger that MC weight == 1 to use this alternate topology. Do you have a suggestion on what to trigger this on?
AFAICT, this CLUSTER level is only supported by ACPI. In parse_acpi_topology() you should be able to know if cluster level is above or below the level returned by acpi_find_last_cache_level() and set the correct topology table accordingly
Thanks,
-- Darren Hart Ampere Computing / OS and Kernel
On Wed, Feb 16, 2022 at 06:52:29PM +0100, Vincent Guittot wrote:
On Wed, 16 Feb 2022 at 17:24, Darren Hart darren@os.amperecomputing.com wrote:
On Wed, Feb 16, 2022 at 09:30:49AM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 21:05, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote: > On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote: > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote: > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote: > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > > > > > Sent: Friday, February 11, 2022 2:43 PM > > > > > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > > > > > linux-arm-kernel@lists.infradead.org > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > > > > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > > > > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > > > > > song.bao.hua@hisilicon.com; Valentin Schneider > > > > > > valentin.schneider@arm.com; D . Scott Phillips > > > > > > scott@os.amperecomputing.com; Ilkka Koskinen > > > > > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > > > > > > > > > BUG: arch topology borken > > > > > > the CLS domain not a subset of the MC domain > > > > > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > > > > > level cpu mask is then extended to that of the CLS child, and is later > > > > > > removed entirely as redundant. > > > > > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > > > > > alternative sched_domain_topology equivalent to the default if > > > > > > CONFIG_SCHED_MC were disabled. > > > > > > > > > > > > The final resulting sched domain topology is unchanged with or without > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > > > > > > > > > For CPU0: > > > > > > > > > > > > With CLS: > > > > > > CLS [0-1] > > > > > > DIE [0-79] > > > > > > NUMA [0-159] > > > > > > > > > > > > Without CLS: > > > > > > DIE [0-79] > > > > > > NUMA [0-159] > > > > > > > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > > > > > Cc: Will Deacon will@kernel.org > > > > > > Cc: Peter Zijlstra peterz@infradead.org > > > > > > Cc: Vincent Guittot vincent.guittot@linaro.org > > > > > > Cc: Barry Song song.bao.hua@hisilicon.com > > > > > > Cc: Valentin Schneider valentin.schneider@arm.com > > > > > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > > > > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > > > > > Cc: stable@vger.kernel.org # 5.16.x > > > > > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > > > > > > > > > Hi Darrent, > > > > > What kind of resources are clusters sharing on Ampere Altra? > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate > > > > > for each cpu? > > > > > > > > > > > --- > > > > > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > > > > > 1 file changed, 32 insertions(+) > > > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644 > > > > > > --- a/arch/arm64/kernel/smp.c > > > > > > +++ b/arch/arm64/kernel/smp.c > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > > > > > } > > > > > > } > > > > > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > > > > > +#ifdef CONFIG_SCHED_SMT > > > > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > > > > > +#endif > > > > > > + > > > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > > > > > +#endif > > > > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > > > > > + { NULL, }, > > > > > > +}; > > > > > > + > > > > > > void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > { > > > > > > const struct cpu_operations *ops; > > > > > > + bool use_no_mc_topology = true; > > > > > > int err; > > > > > > unsigned int cpu; > > > > > > unsigned int this_cpu; > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > > > > > > > set_cpu_present(cpu, true); > > > > > > numa_store_cpu_info(cpu); > > > > > > + > > > > > > + /* > > > > > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > > > > > + */ > > > > > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > > > > > + use_no_mc_topology = false; > > > > > > > > > > This seems to be wrong? If you have 5 cpus, > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > > > > > need to remove MC, but for cpu1-4, you will need > > > > > CLS and MC both? > > > > > > > > What is the *current* behaviour on such a system? > > > > > > > > > > As I understand it, any system that uses the default topology which has > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 > > > will behave as described above by printing the following for each CPU > > > matching this criteria: > > > > > > BUG: arch topology borken > > > the [CLS,SMT,...] domain not a subset of the MC domain > > > > > > And then extend the MC domain cpumask to match that of the child and continue > > > on. > > > > > > That would still be the behavior for this type of system after this > > > patch is applied. > > > > That's what I thought, but in that case applying your patch is a net > > improvement: systems either get current or better behaviour. > > CLUSTER level is normally defined as a intermediate group of the MC > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES > flag > > In the case of Ampere altra, they consider that CPUA have a CLUSTER > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has > not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
Given the fairly loose definition of some of these domains and the freedom to adjust flags with custom topologies, I think it's difficult to say it needs to be this or that. As you point out, this stems from an assumption in the default topology, so eliding the MC level within the current set of abstractions for a very targeted topology still seems reasonable to address the BUG in the very near term in a contained way.
But if another SoC comes with a valid MC then a CLUSTER, this proposal will not work.
Keep in mind that the MC level will be removed/degenerate when building because it is useless in your case so the scheduler topology will still be the same at the end but it will support more case. That why I think you should keep MC level
Hi Vincent,
Thanks for reiterating, I don't think I quite understood what you were suggesting before. Is the following in line with what you were thinking?
I am testing a version of this patch which uses a topology like this instead:
MC CLS DIE NUMA
(I tested without an SMT domain since the trigger is still MC weight==1, so there is no valid topology that includes an SMT level under these conditions).
Which results in no BUG output and a final topology on Altra of:
CLS DIE NUMA
Which so far seems right (I still need to do some testing and review the sched debug data).
If we take this approach, I think to address your concern about other systems with valid MCs, we would need a different trigger that MC weight == 1 to use this alternate topology. Do you have a suggestion on what to trigger this on?
AFAICT, this CLUSTER level is only supported by ACPI. In parse_acpi_topology() you should be able to know if cluster level is above or below the level returned by acpi_find_last_cache_level() and set the correct topology table accordingly
Thanks Vincent,
This made sense as a place to start to me. The more I dug into the ACPI PPTT code, I kept running into conflicts with the API which would require extending it in ways that seems contrary to its intent. e.g. the exposed API uses Kernel logical CPUs rather than the topology ids (needed for working with processor containers). The cpu_topology masks haven't been populated yet, and acpi_find_last_cache_level is decoupled from the CPU topology level. So what we're really testing for is if the cluster cpumask is a subset of the coregroup cpumask or not, and it made the most sense to me to keep that in smp.c after the cpumasks have been updated and stored.
I'll send v2 out for review shortly using this approach.
On Wed, 23 Feb 2022 at 04:08, Darren Hart darren@os.amperecomputing.com wrote:
On Wed, Feb 16, 2022 at 06:52:29PM +0100, Vincent Guittot wrote:
On Wed, 16 Feb 2022 at 17:24, Darren Hart darren@os.amperecomputing.com wrote:
On Wed, Feb 16, 2022 at 09:30:49AM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 21:05, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 07:19:45PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote: > > On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote: > > On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote: > > > > > > On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote: > > > > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote: > > > > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > > > > > > Sent: Friday, February 11, 2022 2:43 PM > > > > > > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > > > > > > linux-arm-kernel@lists.infradead.org > > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > > > > > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > > > > > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > > > > > > song.bao.hua@hisilicon.com; Valentin Schneider > > > > > > > valentin.schneider@arm.com; D . Scott Phillips > > > > > > > scott@os.amperecomputing.com; Ilkka Koskinen > > > > > > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > > > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > > > > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared > > > > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > > > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > > > > > > > > > > > BUG: arch topology borken > > > > > > > the CLS domain not a subset of the MC domain > > > > > > > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > > > > > > level cpu mask is then extended to that of the CLS child, and is later > > > > > > > removed entirely as redundant. > > > > > > > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > > > > > > alternative sched_domain_topology equivalent to the default if > > > > > > > CONFIG_SCHED_MC were disabled. > > > > > > > > > > > > > > The final resulting sched domain topology is unchanged with or without > > > > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > > > > > > > > > > > For CPU0: > > > > > > > > > > > > > > With CLS: > > > > > > > CLS [0-1] > > > > > > > DIE [0-79] > > > > > > > NUMA [0-159] > > > > > > > > > > > > > > Without CLS: > > > > > > > DIE [0-79] > > > > > > > NUMA [0-159] > > > > > > > > > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > > > > > > Cc: Will Deacon will@kernel.org > > > > > > > Cc: Peter Zijlstra peterz@infradead.org > > > > > > > Cc: Vincent Guittot vincent.guittot@linaro.org > > > > > > > Cc: Barry Song song.bao.hua@hisilicon.com > > > > > > > Cc: Valentin Schneider valentin.schneider@arm.com > > > > > > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > > > > > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > > > > > > Cc: stable@vger.kernel.org # 5.16.x > > > > > > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > > > > > > > > > > > Hi Darrent, > > > > > > What kind of resources are clusters sharing on Ampere Altra? > > > > > > So on Altra, cpus are not sharing LLC? Each LLC is separate > > > > > > for each cpu? > > > > > > > > > > > > > --- > > > > > > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > > > > > > 1 file changed, 32 insertions(+) > > > > > > > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > > > > index 27df5c1e6baa..0a78ac5c8830 100644 > > > > > > > --- a/arch/arm64/kernel/smp.c > > > > > > > +++ b/arch/arm64/kernel/smp.c > > > > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > > > > > > +#ifdef CONFIG_SCHED_SMT > > > > > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > > > > > > +#endif > > > > > > > + > > > > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > > > > > > +#endif > > > > > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > > > > > > + { NULL, }, > > > > > > > +}; > > > > > > > + > > > > > > > void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > > { > > > > > > > const struct cpu_operations *ops; > > > > > > > + bool use_no_mc_topology = true; > > > > > > > int err; > > > > > > > unsigned int cpu; > > > > > > > unsigned int this_cpu; > > > > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > > > > > > > > > set_cpu_present(cpu, true); > > > > > > > numa_store_cpu_info(cpu); > > > > > > > + > > > > > > > + /* > > > > > > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > > > > > > + */ > > > > > > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > > > > > > + use_no_mc_topology = false; > > > > > > > > > > > > This seems to be wrong? If you have 5 cpus, > > > > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > > > > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > > > > > > need to remove MC, but for cpu1-4, you will need > > > > > > CLS and MC both? > > > > > > > > > > What is the *current* behaviour on such a system? > > > > > > > > > > > > > As I understand it, any system that uses the default topology which has > > > > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 > > > > will behave as described above by printing the following for each CPU > > > > matching this criteria: > > > > > > > > BUG: arch topology borken > > > > the [CLS,SMT,...] domain not a subset of the MC domain > > > > > > > > And then extend the MC domain cpumask to match that of the child and continue > > > > on. > > > > > > > > That would still be the behavior for this type of system after this > > > > patch is applied. > > > > > > That's what I thought, but in that case applying your patch is a net > > > improvement: systems either get current or better behaviour. > > > > CLUSTER level is normally defined as a intermediate group of the MC > > level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES > > flag > > > > In the case of Ampere altra, they consider that CPUA have a CLUSTER > > level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and > > larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES > > with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has > > not disappeared Looks like there is a mismatch in topology description > > Hi Vincent, > > Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
Given the fairly loose definition of some of these domains and the freedom to adjust flags with custom topologies, I think it's difficult to say it needs to be this or that. As you point out, this stems from an assumption in the default topology, so eliding the MC level within the current set of abstractions for a very targeted topology still seems reasonable to address the BUG in the very near term in a contained way.
But if another SoC comes with a valid MC then a CLUSTER, this proposal will not work.
Keep in mind that the MC level will be removed/degenerate when building because it is useless in your case so the scheduler topology will still be the same at the end but it will support more case. That why I think you should keep MC level
Hi Vincent,
Thanks for reiterating, I don't think I quite understood what you were suggesting before. Is the following in line with what you were thinking?
I am testing a version of this patch which uses a topology like this instead:
MC CLS DIE NUMA
(I tested without an SMT domain since the trigger is still MC weight==1, so there is no valid topology that includes an SMT level under these conditions).
Which results in no BUG output and a final topology on Altra of:
CLS DIE NUMA
Which so far seems right (I still need to do some testing and review the sched debug data).
If we take this approach, I think to address your concern about other systems with valid MCs, we would need a different trigger that MC weight == 1 to use this alternate topology. Do you have a suggestion on what to trigger this on?
AFAICT, this CLUSTER level is only supported by ACPI. In parse_acpi_topology() you should be able to know if cluster level is above or below the level returned by acpi_find_last_cache_level() and set the correct topology table accordingly
Thanks Vincent,
This made sense as a place to start to me. The more I dug into the ACPI PPTT code, I kept running into conflicts with the API which would require extending it in ways that seems contrary to its intent. e.g. the exposed API uses Kernel logical CPUs rather than the topology ids (needed for working with processor containers). The cpu_topology masks haven't been populated yet, and acpi_find_last_cache_level is decoupled from the CPU topology level. So what we're really testing for is if the cluster cpumask is a subset of the coregroup cpumask or not, and it made the most sense to me to keep that in smp.c after the cpumasks have been updated and stored.
I'm not sure why you want to compare cpumask when you can directly compare topology level which is exactly what we are looking for in order to correctly order the scheduler topology. I was expecting something like the below to be enough. acpi_find_cluster_level() needs to be created and should be similar to find_acpi_cpu_topology_cluster() but return level instead of id. The main advantage is that everything is contained in topology.c which makes sense as we are playing with topology
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 9ab78ad826e2..4dac0491b7e3 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -84,6 +84,7 @@ static bool __init acpi_cpu_is_threaded(int cpu) int __init parse_acpi_topology(void) { int cpu, topology_id; + bool default_cluster_topology = true;
if (acpi_disabled) return 0; @@ -119,8 +120,16 @@ int __init parse_acpi_topology(void) if (cache_id > 0) cpu_topology[cpu].llc_id = cache_id; } + + if (default_cluster_topology && + (i < acpi_find_cluster_level(cpu))) { + default_cluster_topology = false; + } }
+ if (!default_cluster_topology) + set_sched_topology(arm64_no_mc_topology); + return 0; } #endif
I'll send v2 out for review shortly using this approach.
Ok.
-- Darren Hart Ampere Computing / OS and Kernel
On Wed, Feb 23, 2022 at 09:19:17AM +0100, Vincent Guittot wrote:
...
AFAICT, this CLUSTER level is only supported by ACPI. In parse_acpi_topology() you should be able to know if cluster level is above or below the level returned by acpi_find_last_cache_level() and set the correct topology table accordingly
Thanks Vincent,
This made sense as a place to start to me. The more I dug into the ACPI PPTT code, I kept running into conflicts with the API which would require extending it in ways that seems contrary to its intent. e.g. the exposed API uses Kernel logical CPUs rather than the topology ids (needed for working with processor containers). The cpu_topology masks haven't been populated yet, and acpi_find_last_cache_level is decoupled from the CPU topology level. So what we're really testing for is if the cluster cpumask is a subset of the coregroup cpumask or not, and it made the most sense to me to keep that in smp.c after the cpumasks have been updated and stored.
I'm not sure why you want to compare cpumask when you can directly compare topology level which is exactly what we are looking for in order to correctly order the scheduler topology. I was expecting something like the below to be enough. acpi_find_cluster_level() needs to be created and should be similar to find_acpi_cpu_topology_cluster() but return level instead of id. The main advantage is that everything is contained in topology.c which makes sense as we are playing with topology
Hi Vincent,
This was my thinking as well before I dug into the acpi pptt interfaces.
The cpu topology levels and the cache levels are independent and assuming I've not misunderstood the implementation, acpi_find_cache_level() returns the highest *cache* level described in the PPTT for a given CPU.
For the Ampere Altra, for example:
CPU Topo 1 System Level 2 Package | 3 Cluster | 4 Processor --- L1 I Cache ____ L2 U Cache -x / --- L1 D Cache /
4 Processor --- L1 I Cache ____ L2 U Cache -x --- L1 D Cache /
Cache Level ----> 1 2
acpi_find_cache_level() returns "2" for any logical cpu, but this doesn't tell us anything about the CPU topology level across which this cache is shared.
This is what drove me out of topology.c and up into smp.c after the various topologies are populated and comparing masks. I'll spend a bit more time before sending a cpumask implementation to see if there is a better point to do this where the cpu topology level with shared cache is more readily available.
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 9ab78ad826e2..4dac0491b7e3 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -84,6 +84,7 @@ static bool __init acpi_cpu_is_threaded(int cpu) int __init parse_acpi_topology(void) { int cpu, topology_id;
bool default_cluster_topology = true; if (acpi_disabled) return 0;
@@ -119,8 +120,16 @@ int __init parse_acpi_topology(void) if (cache_id > 0) cpu_topology[cpu].llc_id = cache_id; }
if (default_cluster_topology &&
(i < acpi_find_cluster_level(cpu))) {
Per above, from what I understand, this is comparing cpu topology levels with cache levels, which are independent from each other.
default_cluster_topology = false;
} }
if (!default_cluster_topology)
set_sched_topology(arm64_no_mc_topology);
return 0;
}
Thanks,
On Wed, 23 Feb 2022 at 17:39, Darren Hart darren@os.amperecomputing.com wrote:
On Wed, Feb 23, 2022 at 09:19:17AM +0100, Vincent Guittot wrote:
...
AFAICT, this CLUSTER level is only supported by ACPI. In parse_acpi_topology() you should be able to know if cluster level is above or below the level returned by acpi_find_last_cache_level() and set the correct topology table accordingly
Thanks Vincent,
This made sense as a place to start to me. The more I dug into the ACPI PPTT code, I kept running into conflicts with the API which would require extending it in ways that seems contrary to its intent. e.g. the exposed API uses Kernel logical CPUs rather than the topology ids (needed for working with processor containers). The cpu_topology masks haven't been populated yet, and acpi_find_last_cache_level is decoupled from the CPU topology level. So what we're really testing for is if the cluster cpumask is a subset of the coregroup cpumask or not, and it made the most sense to me to keep that in smp.c after the cpumasks have been updated and stored.
I'm not sure why you want to compare cpumask when you can directly compare topology level which is exactly what we are looking for in order to correctly order the scheduler topology. I was expecting something like the below to be enough. acpi_find_cluster_level() needs to be created and should be similar to find_acpi_cpu_topology_cluster() but return level instead of id. The main advantage is that everything is contained in topology.c which makes sense as we are playing with topology
Hi Vincent,
This was my thinking as well before I dug into the acpi pptt interfaces.
The cpu topology levels and the cache levels are independent and assuming I've not misunderstood the implementation, acpi_find_cache_level() returns the highest *cache* level described in the PPTT for a given CPU.
For the Ampere Altra, for example:
CPU Topo 1 System Level 2 Package | 3 Cluster | 4 Processor --- L1 I Cache ____ L2 U Cache -x / --- L1 D Cache /
4 Processor --- L1 I Cache \____ L2 U Cache -x --- L1 D Cache / Cache Level ----> 1 2
acpi_find_cache_level() returns "2" for any logical cpu, but this doesn't tell us anything about the CPU topology level across which this cache is shared.
yeah, I have wrongly assumed that we can directly compare cache level with cluster level because find_acpi_cpu_cache_topology() returns a cpu node id but find_acpi_cpu_cache_topology walks the cpu topo to find the cpu node which maps the cache level. This means that we would have to walk the cpu topology until we find either cluster node id or cache node id to know which one belongs to the other
This is what drove me out of topology.c and up into smp.c after the various topologies are populated and comparing masks. I'll spend a bit more time before sending a cpumask implementation to see if there is a better point to do this where the cpu topology level with shared cache is more readily available.
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 9ab78ad826e2..4dac0491b7e3 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -84,6 +84,7 @@ static bool __init acpi_cpu_is_threaded(int cpu) int __init parse_acpi_topology(void) { int cpu, topology_id;
bool default_cluster_topology = true; if (acpi_disabled) return 0;
@@ -119,8 +120,16 @@ int __init parse_acpi_topology(void) if (cache_id > 0) cpu_topology[cpu].llc_id = cache_id; }
if (default_cluster_topology &&
(i < acpi_find_cluster_level(cpu))) {
Per above, from what I understand, this is comparing cpu topology levels with cache levels, which are independent from each other.
default_cluster_topology = false;
} }
if (!default_cluster_topology)
set_sched_topology(arm64_no_mc_topology);
return 0;
}
Thanks,
-- Darren Hart Ampere Computing / OS and Kernel
On Wed, Feb 16, 2022 at 7:30 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote:
On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > -----Original Message----- > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > Sent: Friday, February 11, 2022 2:43 PM > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > linux-arm-kernel@lists.infradead.org > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > song.bao.hua@hisilicon.com; Valentin Schneider > > valentin.schneider@arm.com; D . Scott Phillips > > scott@os.amperecomputing.com; Ilkka Koskinen > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > SoCs such as the Ampere Altra define clusters but have no shared > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > BUG: arch topology borken > > the CLS domain not a subset of the MC domain > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > level cpu mask is then extended to that of the CLS child, and is later > > removed entirely as redundant. > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > alternative sched_domain_topology equivalent to the default if > > CONFIG_SCHED_MC were disabled. > > > > The final resulting sched domain topology is unchanged with or without > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > For CPU0: > > > > With CLS: > > CLS [0-1] > > DIE [0-79] > > NUMA [0-159] > > > > Without CLS: > > DIE [0-79] > > NUMA [0-159] > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > Cc: Will Deacon will@kernel.org > > Cc: Peter Zijlstra peterz@infradead.org > > Cc: Vincent Guittot vincent.guittot@linaro.org > > Cc: Barry Song song.bao.hua@hisilicon.com > > Cc: Valentin Schneider valentin.schneider@arm.com > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > Cc: stable@vger.kernel.org # 5.16.x > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > Hi Darrent, > What kind of resources are clusters sharing on Ampere Altra? > So on Altra, cpus are not sharing LLC? Each LLC is separate > for each cpu? > > > --- > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > 1 file changed, 32 insertions(+) > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > index 27df5c1e6baa..0a78ac5c8830 100644 > > --- a/arch/arm64/kernel/smp.c > > +++ b/arch/arm64/kernel/smp.c > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > } > > } > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > +#ifdef CONFIG_SCHED_SMT > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > +#endif > > + > > +#ifdef CONFIG_SCHED_CLUSTER > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > +#endif > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > + { NULL, }, > > +}; > > + > > void __init smp_prepare_cpus(unsigned int max_cpus) > > { > > const struct cpu_operations *ops; > > + bool use_no_mc_topology = true; > > int err; > > unsigned int cpu; > > unsigned int this_cpu; > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > set_cpu_present(cpu, true); > > numa_store_cpu_info(cpu); > > + > > + /* > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > + */ > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > + use_no_mc_topology = false; > > This seems to be wrong? If you have 5 cpus, > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > need to remove MC, but for cpu1-4, you will need > CLS and MC both?
What is the *current* behaviour on such a system?
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in the PPTT table whereas the MC level is defined as the number of cache levels. So i would say that you should compare the level to know the ordering
Then, there is another point: In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES which is used to define some scheduler internal variable like sd_llc(sched domain last level of cache) which allows fast task migration between this cpus in this level at wakeup. In your case the sd_llc should not be the cluster but the MC with only one CPU. But I would not be surprised that most of perf improvement comes from this sd_llc wrongly set to cluster instead of the single CPU
I assume this "mistake" is actually what Ampere altra needs while it is wrong but getting right result? Ampere altra has already got both: 1. Load Balance between clusters 2. wake_affine by select sibling cpu which is sharing SCU
I am not sure how much 1 and 2 are helping Darren's workloads respectively.
I'd describe this as a mismatch between the default assumptions of the sched domains construction code (that SD_SHARE_PKG_RESOURCES implies a shared cpu-side cache) and SoCs without a shared cpu-side cache. This is encoded in properties of the MC level and the requirement that child domains be a subset of the parent domain cpumask.
The MC-less topology addresses this in a consistent way with other architectures using the provided API for non-default topologies without changing these fundamental assumptions and without changing the final resulting topology which is correct and matches the topology supplied in the ACPI PPTT.
-- Darren Hart Ampere Computing / OS and Kernel
Thanks Barry
On Wed, Feb 16, 2022 at 08:03:40PM +1300, Barry Song wrote:
On Wed, Feb 16, 2022 at 7:30 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote:
On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote: > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > > > > -----Original Message----- > > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > > Sent: Friday, February 11, 2022 2:43 PM > > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > > linux-arm-kernel@lists.infradead.org > > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > > song.bao.hua@hisilicon.com; Valentin Schneider > > > valentin.schneider@arm.com; D . Scott Phillips > > > scott@os.amperecomputing.com; Ilkka Koskinen > > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > > > SoCs such as the Ampere Altra define clusters but have no shared > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > > > BUG: arch topology borken > > > the CLS domain not a subset of the MC domain > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > > level cpu mask is then extended to that of the CLS child, and is later > > > removed entirely as redundant. > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > > alternative sched_domain_topology equivalent to the default if > > > CONFIG_SCHED_MC were disabled. > > > > > > The final resulting sched domain topology is unchanged with or without > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > > > For CPU0: > > > > > > With CLS: > > > CLS [0-1] > > > DIE [0-79] > > > NUMA [0-159] > > > > > > Without CLS: > > > DIE [0-79] > > > NUMA [0-159] > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > > Cc: Will Deacon will@kernel.org > > > Cc: Peter Zijlstra peterz@infradead.org > > > Cc: Vincent Guittot vincent.guittot@linaro.org > > > Cc: Barry Song song.bao.hua@hisilicon.com > > > Cc: Valentin Schneider valentin.schneider@arm.com > > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > > Cc: stable@vger.kernel.org # 5.16.x > > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > > > Hi Darrent, > > What kind of resources are clusters sharing on Ampere Altra? > > So on Altra, cpus are not sharing LLC? Each LLC is separate > > for each cpu? > > > > > --- > > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > > 1 file changed, 32 insertions(+) > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > index 27df5c1e6baa..0a78ac5c8830 100644 > > > --- a/arch/arm64/kernel/smp.c > > > +++ b/arch/arm64/kernel/smp.c > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > > } > > > } > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > > +#ifdef CONFIG_SCHED_SMT > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > > +#endif > > > + > > > +#ifdef CONFIG_SCHED_CLUSTER > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > > +#endif > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > > + { NULL, }, > > > +}; > > > + > > > void __init smp_prepare_cpus(unsigned int max_cpus) > > > { > > > const struct cpu_operations *ops; > > > + bool use_no_mc_topology = true; > > > int err; > > > unsigned int cpu; > > > unsigned int this_cpu; > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > set_cpu_present(cpu, true); > > > numa_store_cpu_info(cpu); > > > + > > > + /* > > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > > + */ > > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > > + use_no_mc_topology = false; > > > > This seems to be wrong? If you have 5 cpus, > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > > need to remove MC, but for cpu1-4, you will need > > CLS and MC both? > > What is the *current* behaviour on such a system? >
As I understand it, any system that uses the default topology which has a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 will behave as described above by printing the following for each CPU matching this criteria:
BUG: arch topology borken the [CLS,SMT,...] domain not a subset of the MC domain
And then extend the MC domain cpumask to match that of the child and continue on.
That would still be the behavior for this type of system after this patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in the PPTT table whereas the MC level is defined as the number of cache levels. So i would say that you should compare the level to know the ordering
Then, there is another point: In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES which is used to define some scheduler internal variable like sd_llc(sched domain last level of cache) which allows fast task migration between this cpus in this level at wakeup. In your case the sd_llc should not be the cluster but the MC with only one CPU. But I would not be surprised that most of perf improvement comes from this sd_llc wrongly set to cluster instead of the single CPU
I assume this "mistake" is actually what Ampere altra needs while it is wrong but getting right result? Ampere altra has already got both:
Hi Barry,
Generally yes - although I do think we're placing too much emphasis on the "right" or "wrong" of a heuristic which are more fluid in definition over time. (e.g. I expect this will look different in a year based on what we learn from this and other non current default topologies).
- Load Balance between clusters
- wake_affine by select sibling cpu which is sharing SCU
I am not sure how much 1 and 2 are helping Darren's workloads respectively.
We definitely see improvements with load balancing between clusters. We're running some tests with the wake_affine patchset you pointed me to (thanks for that). My initial tbench runs resulted in higher average and max latencies reported. I need to collect more results and see the impact to other benchmarks of interest before I have more to share on that.
Thanks,
On Thu, Feb 17, 2022 at 4:44 AM Darren Hart darren@os.amperecomputing.com wrote:
On Wed, Feb 16, 2022 at 08:03:40PM +1300, Barry Song wrote:
On Wed, Feb 16, 2022 at 7:30 PM Vincent Guittot vincent.guittot@linaro.org wrote:
On Tue, 15 Feb 2022 at 18:32, Darren Hart darren@os.amperecomputing.com wrote:
On Tue, Feb 15, 2022 at 06:09:08PM +0100, Vincent Guittot wrote:
On Tue, 15 Feb 2022 at 17:46, Will Deacon will@kernel.org wrote:
On Tue, Feb 15, 2022 at 08:44:23AM -0800, Darren Hart wrote: > On Tue, Feb 15, 2022 at 04:38:59PM +0000, Will Decon wrote: > > On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote: > > > > > > > > > > -----Original Message----- > > > > From: Darren Hart [mailto:darren@os.amperecomputing.com] > > > > Sent: Friday, February 11, 2022 2:43 PM > > > > To: LKML linux-kernel@vger.kernel.org; Linux Arm > > > > linux-arm-kernel@lists.infradead.org > > > > Cc: Catalin Marinas catalin.marinas@arm.com; Will Deacon will@kernel.org; > > > > Peter Zijlstra peterz@infradead.org; Vincent Guittot > > > > vincent.guittot@linaro.org; Song Bao Hua (Barry Song) > > > > song.bao.hua@hisilicon.com; Valentin Schneider > > > > valentin.schneider@arm.com; D . Scott Phillips > > > > scott@os.amperecomputing.com; Ilkka Koskinen > > > > ilkka@os.amperecomputing.com; stable@vger.kernel.org > > > > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache > > > > > > > > SoCs such as the Ampere Altra define clusters but have no shared > > > > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and > > > > CONFIG_SCHED_MC, build_sched_domain() will BUG() with: > > > > > > > > BUG: arch topology borken > > > > the CLS domain not a subset of the MC domain > > > > > > > > for each CPU (160 times for a 2 socket 80 core Altra system). The MC > > > > level cpu mask is then extended to that of the CLS child, and is later > > > > removed entirely as redundant. > > > > > > > > This change detects when all cpu_coregroup_mask weights=1 and uses an > > > > alternative sched_domain_topology equivalent to the default if > > > > CONFIG_SCHED_MC were disabled. > > > > > > > > The final resulting sched domain topology is unchanged with or without > > > > CONFIG_SCHED_CLUSTER, and the BUG is avoided: > > > > > > > > For CPU0: > > > > > > > > With CLS: > > > > CLS [0-1] > > > > DIE [0-79] > > > > NUMA [0-159] > > > > > > > > Without CLS: > > > > DIE [0-79] > > > > NUMA [0-159] > > > > > > > > Cc: Catalin Marinas catalin.marinas@arm.com > > > > Cc: Will Deacon will@kernel.org > > > > Cc: Peter Zijlstra peterz@infradead.org > > > > Cc: Vincent Guittot vincent.guittot@linaro.org > > > > Cc: Barry Song song.bao.hua@hisilicon.com > > > > Cc: Valentin Schneider valentin.schneider@arm.com > > > > Cc: D. Scott Phillips scott@os.amperecomputing.com > > > > Cc: Ilkka Koskinen ilkka@os.amperecomputing.com > > > > Cc: stable@vger.kernel.org # 5.16.x > > > > Signed-off-by: Darren Hart darren@os.amperecomputing.com > > > > > > Hi Darrent, > > > What kind of resources are clusters sharing on Ampere Altra? > > > So on Altra, cpus are not sharing LLC? Each LLC is separate > > > for each cpu? > > > > > > > --- > > > > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++ > > > > 1 file changed, 32 insertions(+) > > > > > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > index 27df5c1e6baa..0a78ac5c8830 100644 > > > > --- a/arch/arm64/kernel/smp.c > > > > +++ b/arch/arm64/kernel/smp.c > > > > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void) > > > > } > > > > } > > > > > > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = { > > > > +#ifdef CONFIG_SCHED_SMT > > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, > > > > +#endif > > > > + > > > > +#ifdef CONFIG_SCHED_CLUSTER > > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, > > > > +#endif > > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > > > > + { NULL, }, > > > > +}; > > > > + > > > > void __init smp_prepare_cpus(unsigned int max_cpus) > > > > { > > > > const struct cpu_operations *ops; > > > > + bool use_no_mc_topology = true; > > > > int err; > > > > unsigned int cpu; > > > > unsigned int this_cpu; > > > > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > > > > > > > > set_cpu_present(cpu, true); > > > > numa_store_cpu_info(cpu); > > > > + > > > > + /* > > > > + * Only use no_mc topology if all cpu_coregroup_mask weights=1 > > > > + */ > > > > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1) > > > > + use_no_mc_topology = false; > > > > > > This seems to be wrong? If you have 5 cpus, > > > Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4 > > > has cpu_coregroup_mask(cpu)== 4, for cpu0, you still > > > need to remove MC, but for cpu1-4, you will need > > > CLS and MC both? > > > > What is the *current* behaviour on such a system? > > > > As I understand it, any system that uses the default topology which has > a cpus_coregroup weight of 1 and a child (cluster, smt, ...) weight > 1 > will behave as described above by printing the following for each CPU > matching this criteria: > > BUG: arch topology borken > the [CLS,SMT,...] domain not a subset of the MC domain > > And then extend the MC domain cpumask to match that of the child and continue > on. > > That would still be the behavior for this type of system after this > patch is applied.
That's what I thought, but in that case applying your patch is a net improvement: systems either get current or better behaviour.
CLUSTER level is normally defined as a intermediate group of the MC level and both levels have the scheduler flag SD_SHARE_PKG_RESOURCES flag
In the case of Ampere altra, they consider that CPUA have a CLUSTER level which SD_SHARE_PKG_RESOURCES with another CPUB but the next and larger MC level then says that CPUA doesn't SD_SHARE_PKG_RESOURCES with CPUB which seems to be odd because the SD_SHARE_PKG_RESOURCES has not disappeared Looks like there is a mismatch in topology description
Hi Vincent,
Agree. Where do you think this mismatch exists?
I think that the problem comes from that the default topology order is assumed to be : SMT CLUSTER shares pkg resources i.e. cache MC DIE NUMA
but in your case, you want a topology order like : SMT MC CLUSTER shares SCU DIE NUMA
IIUC, the cluster is defined as the 2nd (no SMT) or 3rd (SMT) level in the PPTT table whereas the MC level is defined as the number of cache levels. So i would say that you should compare the level to know the ordering
Then, there is another point: In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES which is used to define some scheduler internal variable like sd_llc(sched domain last level of cache) which allows fast task migration between this cpus in this level at wakeup. In your case the sd_llc should not be the cluster but the MC with only one CPU. But I would not be surprised that most of perf improvement comes from this sd_llc wrongly set to cluster instead of the single CPU
I assume this "mistake" is actually what Ampere altra needs while it is wrong but getting right result? Ampere altra has already got both:
Hi Barry,
Generally yes - although I do think we're placing too much emphasis on the "right" or "wrong" of a heuristic which are more fluid in definition over time. (e.g. I expect this will look different in a year based on what we learn from this and other non current default topologies).
- Load Balance between clusters
- wake_affine by select sibling cpu which is sharing SCU
I am not sure how much 1 and 2 are helping Darren's workloads respectively.
We definitely see improvements with load balancing between clusters. We're running some tests with the wake_affine patchset you pointed me to (thanks for that). My initial tbench runs resulted in higher average and max latencies reported. I need to collect more results and see the impact to other benchmarks of interest before I have more to share on that.
Hi Darren, if you read Vincent's comments carefully, you will find it is pointless for you to test the wake_affine patchset as you have already got it. in your case, sd_llc_id is set to sd_cluster level due to PKG_RESOURCES sharing. So with my new patchset for wake_affine, it is completely redundant for your machine as it works with the assumption cluster-> llc. but for your case, llc=cluster, so it works in cluster->cluster. please read:
static void update_top_cache_domain(int cpu) { struct sched_domain_shared *sds = NULL; struct sched_domain *sd; int id = cpu; int size = 1;
sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); if (sd) { id = cpumask_first(sched_domain_span(sd)); size = cpumask_weight(sched_domain_span(sd)); sds = sd->shared; }
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); per_cpu(sd_llc_size, cpu) = size; per_cpu(sd_llc_id, cpu) = id; rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
... }
this is also the content of Vincent's last comment: " My concern is that there are some ongoing discussion to make more usage of the CLUSTER level than what is currently done and it assumes that we have a valid LLC level after the CLUSTER one which is not your case and I'm afraid that it will be suboptimal for you because CLUSTER and LLC are wrongly the same for your case and then you will come back to add more exception in the generic scheduler code to cover this first exception"
so it is incorrect for you to say you gain performance only by LB :-)
Thanks,
-- Darren Hart Ampere Computing / OS and Kernel
Thanks Barry
On Thu, Feb 17, 2022 at 07:56:00AM +1300, Barry Song wrote: ...
Then, there is another point: In your case, CLUSTER level still has the flag SD_SHARE_PKG_RESOURCES which is used to define some scheduler internal variable like sd_llc(sched domain last level of cache) which allows fast task migration between this cpus in this level at wakeup. In your case the sd_llc should not be the cluster but the MC with only one CPU. But I would not be surprised that most of perf improvement comes from this sd_llc wrongly set to cluster instead of the single CPU
I assume this "mistake" is actually what Ampere altra needs while it is wrong but getting right result? Ampere altra has already got both:
Hi Barry,
Generally yes - although I do think we're placing too much emphasis on the "right" or "wrong" of a heuristic which are more fluid in definition over time. (e.g. I expect this will look different in a year based on what we learn from this and other non current default topologies).
- Load Balance between clusters
- wake_affine by select sibling cpu which is sharing SCU
I am not sure how much 1 and 2 are helping Darren's workloads respectively.
We definitely see improvements with load balancing between clusters. We're running some tests with the wake_affine patchset you pointed me to (thanks for that). My initial tbench runs resulted in higher average and max latencies reported. I need to collect more results and see the impact to other benchmarks of interest before I have more to share on that.
Hi Darren, if you read Vincent's comments carefully, you will find it is pointless for you to test the wake_affine patchset as you have already got it. in your case, sd_llc_id is set to sd_cluster level due to PKG_RESOURCES sharing. So with my new patchset for wake_affine, it is completely redundant for your machine as it works with the assumption cluster-> llc. but for your case, llc=cluster, so it works in cluster->cluster.
Thanks Barry,
Makes sense as described. I did see degradation in the tests we ran with this patch applied to 5.17-rc3. I'll have to follow up with you on that when I can dig into it more. I'd be interested in the specifics of your testing to run something similar. I think you said you were reporting on tbench?
linux-stable-mirror@lists.linaro.org