Re: IPC drop down on AMD epyc 7702P

18 Apr 2025


      Hello Jean,
On 4/18/2025 2:38 AM, Jean-Baptiste Roquefere wrote:
...

Hi,
We (Ateme, a video encoding company) may have found an unwanted behavior in the scheduler since 5.10 (commit 16b0a7a1a0af), then 5.16 (commit c5b0a7eefc70), then 5.19 (commit not found yet), then maybe some other commits from 5.19 to 6.12, with a consequence of IPC decrease. Problem still appears on lasts 6.12, 6.13 and 6.14
Looking at the commit logs, it looks like these commits do solve other
problems around load balancing and might not be trivial to revert
without evaluating the damages.
...
We have reverted both 16b0a7a1a0af and c5b0a7eefc70 commits that reduce our performances (see fair.patch attached, applicable on 6.12.17). Performances increase but still doesnt reach our reference on 5.4.152.
Instead of trying to find every single commits from 5.18 to 6.12 that could decrease our performance, I chosed to bench 5.4.152 versus 6.12.17 with and without fair.patch.
The problem appeared clear : a lot of CPU migrations go out of CCX, then L3 miss, then IPC decrease.
Context of our bench: video decoder which work at a regulated speed, 1 process, 21 main threads, everyone of them creates 10 threads, 8 of them have a fine granularity, meaning they go to sleep quite often, giving the scheduler a lot of opportunities to act).
Hardware is an AMD Epyc 7702P, 128 cores, grouped by shared LLC 4 cores +4 hyperthreaded cores. NUMA topology is set by the BIOS to 1 node per socket.
Every pthread are created with default attributes.
I use AMDuProf (-C -A system -a -m ipc,l1,l2,l3,memory) for CPU Utilization (%), CPU effective freq, IPC, L2 access (pti), L2 miss (pti), L3 miss (absolute) and Mem (GB/s, and perf (stat -d -d -d -a) for Context switches, CPU migrations and Real time (s).
We noted that upgrade 5.4.152 to 6.12.17 without any special preempt configuration :
Two fold increase in CPU migration
30% memory bandwidth increase
20% L3 cache misses increase
10% IPC decrease
With the attached fair.patch applied to 6.12.17 (reminder : this patch reverts one commit appeared in 5.10 and another in 5.16) we managed to reduce CPU migrations and increase IPC but not as much as we had on 5.4.152. Our goal is to keep kernel "clean" without any patch (we don't want to apply and maintain fair.patch) then for the rest of my email we will consider stock kernel 6.12.17.
I've reduced the "sub threads count" to stays below 128 threads. Then still 21 main threads and instead of 10 worker per main thread I set 5 workers (4 of them with fine granularity) giving 105 pthreads -> everything goes fine in 6.12.17, no extra CPU migration, no extra memory bandwidth...
The processor you are running on, the AME EPYC 7702P based on the Zen2
architecture contains 4 cores / 8 threads per CCX (LLC domain) which is
perhaps why reducing the thread count to below this limit is helping
your workload.
What we suspect is that when running the workload, the threads that
regularly sleep trigger a newidle balancing which causes them to move
to another CCX leading to higher number of L3 misses.
To confirm this, would it be possible to run the workload with the
not-yet-upstream perf sched stats [1] tool and share the result from
perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch
to rule out any other second order effect.
[1] https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
...
But as soon as we increase worker threads count (10 instead of 5) the problem appears.
We know our decoder may have too many threads but that's out of our scope, it has been designed like that some years ago and moving from "lot of small threads to few of big thread" is for now not possible.
We have a work around : we group threads using pthread affinities. Every main thread (and by inheritance of affinities every worker threads) on a single CCX so we reduce the L3 miss for them, then decrease memory bandwidth, then finally increasing IPC.
With that solution, we go above our original performances, for both kernels, and they perform at the same level. However, it is impractical to productize as such.
I've tried many kernel build configurations (CONFIG_PREMPT_*, CONFIG_SCHEDULER_*, tuning offair.c:sysctl_sched_migration_cost) on 6.12.17, 6.12.21 (longterm), 6.13.9 (mainline), and 6.14.0 Nothing changes.
Q: Is there anyway to tune the kernel so we can get our performance back without using the pthread affinities work around ?
Assuming you control these deployments, would it possible to run
the workload on a kernel running with "relax_domain_level=2" kernel
cmdline that restricts newidle balance to only within the CCX. As a
side effect, it also limits  task wakeups to the same LLC domain but
I would still like to know if this makes a difference to the
workload you are running.
Note: This is a system-wide knob and will affect all workloads
running on the system and is better used for debug purposes.
-- 
Thanks and Regards,
Prateek

> 
> Feel free to ask an archive containing binaries and payload.
> 
> I first posted onhttps://bugzilla.kernel.org/show_bug.cgi?id=220000 but one told me the best way to get answers where these mailing lists
> 
> Regards,
> 
> 
> Jean-Baptiste Roquefere, Ateme
> 
> 
> 
>   Attached bench.tar.gz :
>   * bench/fair.patch
>   * bench/bench.ods with 2 sheets :
>      - regulated : decoder speed is regulated to keep real time constant
>      - no regul : decoder speed is not regulated and uses from 1 to 76 main threads with 10 worker per main thread
> * bench/regulated.csv :bench.ods:regulated exported in csv format
> * bench/not-regulated :bench.ods:no regul exported in csv format
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: IPC drop down on AMD epyc 7702P