Re: [PATCH v2] powerpc/mm: Fix SLB multihit issue during SLB preload

4 Aug 2025

"Nicholas Piggin" npiggin@gmail.com writes:
...
Hmm, interesting bug. Impressive work to track it down.
Thanks Nick for taking a look at it.
...
On Fri Aug 1, 2025 at 8:37 PM AEST, Donet Tom wrote:
...
On systems using the hash MMU, there is a software SLB preload cache that
mirrors the entries loaded into the hardware SLB buffer. This preload
cache is subject to periodic eviction — typically after every 256 context
switches — to remove old entry.
To optimize performance, the kernel skips switch_mmu_context() in
switch_mm_irqs_off() when the prev and next mm_struct are the same.
However, on hash MMU systems, this can lead to inconsistencies between
the hardware SLB and the software preload cache.
If an SLB entry for a process is evicted from the software cache on one
CPU, and the same process later runs on another CPU without executing
switch_mmu_context(), the hardware SLB may retain stale entries. If the
kernel then attempts to reload that entry, it can trigger an SLB
multi-hit error.
The following timeline shows how stale SLB entries are created and can
cause a multi-hit error when a process moves between CPUs without a
MMU context switch.
CPU 0                                   CPU 1

Process P
exec                                    swapper/1
 load_elf_binary
  begin_new_exc
    activate_mm
     switch_mm_irqs_off
      switch_mmu_context
       switch_slb
       /*
        * This invalidates all
        * the entries in the HW
        * and setup the new HW
        * SLB entries as per the
        * preload cache.
        */
context_switch
sched_migrate_task migrates process P to cpu-1
Process swapper/0                       context switch (to process P)
(uses mm_struct of Process P)           switch_mm_irqs_off()
                                         switch_slb
                                           load_slb++
                                            /*
                                            * load_slb becomes 0 here
                                            * and we evict an entry from
                                            * the preload cache with
                                            * preload_age(). We still
                                            * keep HW SLB and preload
                                            * cache in sync, that is
                                            * because all HW SLB entries
                                            * anyways gets evicted in
                                            * switch_slb during SLBIA.
                                            * We then only add those
                                            * entries back in HW SLB,
                                            * which are currently
                                            * present in preload_cache
                                            * (after eviction).
                                            */
                                        load_elf_binary continues...
                                         setup_new_exec()
                                          slb_setup_new_exec()
                                    sched_switch event
                                    sched_migrate_task migrates
                                    process P to cpu-0


context_switch from swapper/0 to Process P
 switch_mm_irqs_off()
  /*

Since both prev and next mm struct are same we don't call
switch_mmu_context(). This will cause the HW SLB and SW preload
cache to go out of sync in preload_new_slb_context. Because there
was an SLB entry which was evicted from both HW and preload cache
on cpu-1. Now later in preload_new_slb_context(), when we will try
to add the same preload entry again, we will add this to the SW
preload cache and then will add it to the HW SLB. Since on cpu-0
this entry was never invalidated, hence adding this entry to the HW
SLB will cause a SLB multi-hit error.

*/
load_elf_binary continues...
 START_THREAD
  start_thread
   preload_new_slb_context
   /*
    * This tries to add a new EA to preload cache which was earlier
    * evicted from both cpu-1 HW SLB and preload cache. This caused the
    * HW SLB of cpu-0 to go out of sync with the SW preload cache. The
    * reason for this was, that when we context switched back on CPU-0,
    * we should have ideally called switch_mmu_context() which will
    * bring the HW SLB entries on CPU-0 in sync with SW preload cache
    * entries by setting up the mmu context properly. But we didn't do
    * that since the prev mm_struct running on cpu-0 was same as the
    * next mm_struct (which is true for swapper / kernel threads). So
    * now when we try to add this new entry into the HW SLB of cpu-0,
    * we hit a SLB multi-hit error.
    */
Okay, so what happens is CPU0 has SLB entries remaining from when
P last ran on there, and the preload aging happens on CPU1 at a
time when that CPU does clear its SLB. That slb aging step doesn't
account for the fact CPU0 SLB entries still exist.
Yes, that is right.
...
...
WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62
assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
Modules linked in:
CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12
VOLUNTARY
Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected)
0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
NIP:  c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
REGS: c0000000497c77e0 TRAP: 0700   Not tainted  (6.16.0-rc3-dirty)
MSR:  8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28888482  XER: 00000000
CFAR: c0000000001543b0 IRQMASK: 3
<...>
NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
LR [c0000000001543b4] slb_insert_entry+0x124/0x390
Call Trace:
  0x7fffceb5ffff (unreliable)
  preload_new_slb_context+0x100/0x1a0
  start_thread+0x26c/0x420
  load_elf_binary+0x1b04/0x1c40
  bprm_execve+0x358/0x680
  do_execveat_common+0x1f8/0x240
  sys_execve+0x58/0x70
  system_call_exception+0x114/0x300
  system_call_common+0x160/0x2c4
To fix this issue, we add a code change to always switch the MMU context on
hash MMU if the SLB preload cache has aged. With this change, the
SLB multi-hit error no longer occurs.
cc: Christophe Leroy christophe.leroy@csgroup.eu
cc: Ritesh Harjani (IBM) ritesh.list@gmail.com
cc: Michael Ellerman mpe@ellerman.id.au
cc: Nicholas Piggin npiggin@gmail.com
Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
cc: stable@vger.kernel.org
Suggested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Signed-off-by: Donet Tom donettom@linux.ibm.com

v1 -> v2 : Changed commit message and added a comment in
switch_mm_irqs_off()
v1 - https://lore.kernel.org/all/20250731161027.966196-1-donettom@linux.ibm.com/
arch/powerpc/mm/book3s64/slb.c | 2 +-
 arch/powerpc/mm/mmu_context.c  | 7 +++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index 6b783552403c..08daac3f978c 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -509,7 +509,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
    * SLB preload cache.
    */
   tsk->thread.load_slb++;

if (!tsk->thread.load_slb) {


if (tsk->thread.load_slb == U8_MAX) {
unsigned long pc = KSTK_EIP(tsk);

preload_age(ti);
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 3e3af29b4523..95455d787288 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -83,8 +83,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
   /* Some subarchs need to track the PGD elsewhere */
   switch_mm_pgdir(tsk, next);

/* Nothing else to do if we aren't actually switching */
if (prev == next)


/*
* Nothing else to do if we aren't actually switching and


* the preload slb cache has not aged


*/


if ((prev == next) && (tsk->thread.load_slb != U8_MAX))
return;

/*
I see couple of issues with this fix. First of all, it's a bit wrong to
call switch subsequent switch_mm functions if prev == next, they are not
all powerpc specific. We could work around that somehow with some hash
specific knowledge. But worse I think is that load_slb could be
incremented again if we context switched P again before migrating back
here, then we would miss it.
Aah right. Got too involved in the issue, missed to see that coming.
You are right, if the process is context switched twice before coming
back on the original cpu (while still in load_elf_binary() path), we can
still hit the same issue. Though it's probablity of hitting must be very
low, but it is still possible.
...
How about removing preload_new_slb_context() and slb_setup_new_exec()
entirely? Then slb preload is a much simpler thing that is only loaded
after the SLB has been cleared. Those functions were always a bit
janky and for performance, context switch is the most improtant I think,
new thread/proc creation less so.
Thanks for the suggestion. Right the above changes should not affect
context switch which is more performance critical.
I agree, removing these two functions should solve the reported problem,
since these two are the only callers where we don't invalidate the HW
SLB before preloading. switch_slb() on the other hand takes care of
that (which gets called during context switch).
I see in the original commit - you had measured context_switch benchmark
which showed 27% performance improvement. Was this context_switch in
will-it-scale?
Is there any workload which you think we could run to confirm that no
regression should be observed with these changes (including for
proc/thread creation?)
...
Thanks,
Nick
Thanks again for taking a look. Appreciate your help!
-ritesh

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2] powerpc/mm: Fix SLB multihit issue during SLB preload

v1 - https://lore.kernel.org/all/20250731161027.966196-1-donettom@linux.ibm.com/