 
            Hello Prateek,
thank's for your reponse.
Looking at the commit logs, it looks like these commits do solve other problems around load balancing and might not be trivial to revert without evaluating the damages.
it's definitely not a productizable workaround !
The processor you are running on, the AME EPYC 7702P based on the Zen2 architecture contains 4 cores / 8 threads per CCX (LLC domain) which is perhaps why reducing the thread count to below this limit is helping your workload.
What we suspect is that when running the workload, the threads that regularly sleep trigger a newidle balancing which causes them to move to another CCX leading to higher number of L3 misses.
To confirm this, would it be possible to run the workload with the not-yet-upstream perf sched stats [1] tool and share the result from perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch to rule out any other second order effect.
[1] https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
I had to patch tools/perf/util/session.c : static int open_file_read(struct perf_data *data) due to "failed to open perf.data: File exists" (looked more like a compiler issue than a tool/perf issue)
$ ./perf sched stats diff perf.data.6.12.17 perf.data.6.12.17patched > perf.diff (see perf.diff attached)
Assuming you control these deployments, would it possible to run the workload on a kernel running with "relax_domain_level=2" kernel cmdline that restricts newidle balance to only within the CCX. As a side effect, it also limits task wakeups to the same LLC domain but I would still like to know if this makes a difference to the workload you are running.
On vanilla 6.12.17 it gives the IPC we expected:
+--------------------+--------------------------+-----------------------+ | | relax_domain_level unset | relax_domain_level=2 | +--------------------+--------------------------+-----------------------+ | Threads | 210 | 210 | | Utilization (%) | 65,86 | 52,01 | | CPU effective freq | 1 622,93 | 1 294,12 | | IPC | 1,14 | 1,42 | | L2 access (pti) | 34,36 | 38,18 | | L2 miss (pti) | 7,34 | 7,78 | | L3 miss (abs) | 39 711 971 741 | 33 929 609 924 | | Mem (GB/s) | 70,68 | 49,10 | | Context switches | 109 281 524 | 107 896 729 | +--------------------+--------------------------+-----------------------+
Kind regards,
JB