Hello Vincent,
On 5/5/2025 3:58 PM, Vincent Guittot wrote:
On Wed, 30 Apr 2025 at 11:13, K Prateek Nayakkprateek.nayak@amd.com wrote:
(+ more scheduler folks)
tl;dr
JB has a workload that hates aggressive migration on the 2nd Generation EPYC platform that has a small LLC domain (4C/8T) and very noticeable C2C latency.
Based on JB's observation so far, reverting commit 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost condition") helps the workload. Both those commits allow aggressive migrations for work conservation except it also increased cache misses which slows the workload quite a bit.
commit 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") eases the spread of task inside a LLC so It's not obvious for me how it would increase "a lot of CPU migrations go out of CCX, then L3 miss,". On the other hand, it will spread task in SMT and in LLC which can prevent running at highest freq on some system but I don't know if it's relevant for this SoC.
I misspoke there. JB's workload seems to be sensitive even to core to core migrations - "relax_domain_level=2" actually disabled newidle balance above CLUSTER level which is a subset of MC on x86 and gets degenerated into the SMT domain.
commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost condition") makes newly idle migration happen more often which can then do migrate tasks across LLC. But then It's more about why enabling newly idle load balance out of LLC if it is so costly.
It seems to be very workload + possibly platform specific characteristic where re-priming the cache is actually very costly. I'm not sure if there are any other uarch factors at play here that require repriming (branch prediction, prefetcher, etc.) after a task migration to reach same IPC.
Essentially "relax_domain_level" gets the desired characteristic where only the periodic balance will balance long-term imbalance but as Libo mentioned the short term imbalances can build up and using "relax_domain_level" might lead to other problems.
Short of pinning / more analysis of which part of migrations make the workload unhappy, I couldn't think of a better way to communicate this requirement.
"relax_domain_level" helps but cannot be set at runtime and I couldn't think of any stable / debug interfaces that JB hasn't tried out already that can help this workload.
There is a patch towards the end to set "relax_domain_level" at runtime but given cpusets got away with this when transitioning to cgroup-v2, I don't know what the sentiments are around its usage. Any input / feedback is greatly appreciated.