The kernel test robot has reported:
BUG: spinlock trylock failure on UP on CPU#0, kcompactd0/28 lock: 0xffff888807e35ef0, .magic: dead4ead, .owner: kcompactd0/28, .owner_cpu: 0 CPU: 0 UID: 0 PID: 28 Comm: kcompactd0 Not tainted 6.18.0-rc5-00127-ga06157804399 #1 PREEMPT 8cc09ef94dcec767faa911515ce9e609c45db470 Call Trace: <IRQ> __dump_stack (lib/dump_stack.c:95) dump_stack_lvl (lib/dump_stack.c:123) dump_stack (lib/dump_stack.c:130) spin_dump (kernel/locking/spinlock_debug.c:71) do_raw_spin_trylock (kernel/locking/spinlock_debug.c:?) _raw_spin_trylock (include/linux/spinlock_api_smp.h:89 kernel/locking/spinlock.c:138) __free_frozen_pages (mm/page_alloc.c:2973) ___free_pages (mm/page_alloc.c:5295) __free_pages (mm/page_alloc.c:5334) tlb_remove_table_rcu (include/linux/mm.h:? include/linux/mm.h:3122 include/asm-generic/tlb.h:220 mm/mmu_gather.c:227 mm/mmu_gather.c:290) ? __cfi_tlb_remove_table_rcu (mm/mmu_gather.c:289) ? rcu_core (kernel/rcu/tree.c:?) rcu_core (include/linux/rcupdate.h:341 kernel/rcu/tree.c:2607 kernel/rcu/tree.c:2861) rcu_core_si (kernel/rcu/tree.c:2879) handle_softirqs (arch/x86/include/asm/jump_label.h:36 include/trace/events/irq.h:142 kernel/softirq.c:623) __irq_exit_rcu (arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:725) irq_exit_rcu (kernel/softirq.c:741) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052) </IRQ> <TASK> RIP: 0010:_raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) free_pcppages_bulk (mm/page_alloc.c:1494) drain_pages_zone (include/linux/spinlock.h:391 mm/page_alloc.c:2632) __drain_all_pages (mm/page_alloc.c:2731) drain_all_pages (mm/page_alloc.c:2747) kcompactd (mm/compaction.c:3115) kthread (kernel/kthread.c:465) ? __cfi_kcompactd (mm/compaction.c:3166) ? __cfi_kthread (kernel/kthread.c:412) ret_from_fork (arch/x86/kernel/process.c:164) ? __cfi_kthread (kernel/kthread.c:412) ret_from_fork_asm (arch/x86/entry/entry_64.S:255) </TASK>
Matthew has analyzed the report and identified that in drain_page_zone() we are in a section protected by spin_lock(&pcp->lock) and then get an interrupt that attempts spin_trylock() on the same lock. The code is designed to work this way without disabling IRQs and occasionally fail the trylock with a fallback. However, the SMP=n spinlock implementation assumes spin_trylock() will always succeed, and thus it's normally a no-op. Here the enabled lock debugging catches the problem, but otherwise it could cause a corruption of the pcp structure.
The problem has been introduced by commit 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations"). The pcp locking scheme recognizes the need for disabling IRQs to prevent nesting spin_trylock() sections on SMP=n, but the need to prevent the nesting in spin_lock() has not been recognized. Fix it by introducing local wrappers that change the spin_lock() to spin_lock_iqsave() with SMP=n and use them in all places that do spin_lock(&pcp->lock).
Fixes: 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations") Cc: stable@vger.kernel.org Reported-by: kernel test robot oliver.sang@intel.com Closes: https://lore.kernel.org/oe-lkp/202512101320.e2f2dd6f-lkp@intel.com Analyzed-by: Matthew Wilcox willy@infradead.org Link: https://lore.kernel.org/all/aUW05pyc9nZkvY-1@casper.infradead.org/ Signed-off-by: Vlastimil Babka vbabka@suse.cz --- This fix is intentionally made self-contained and not trying to expand upon the existing pcp[u]_spin() helpers. This is to make stable backports easier due to recent cleanups to that helpers.
We could follow up with a proper helpers integration going forward. However I think the assumptions SMP=n of the spinlock UP implementation are just wrong. It should be valid to do a spin_lock() without disabling irq's and rely on a nested spin_trylock() to fail. I will thus try proposing the remove the UP implementation first. It should be within the current trend of removing stuff that's optimized for a minority configuration if it makes maintainability of the majority worse. (c.f. recent scheduler SMP=n removal) --- mm/page_alloc.c | 45 +++++++++++++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 8 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 822e05f1a964..ec3551d56cde 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -167,6 +167,31 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { } pcp_trylock_finish(UP_flags); \ })
+/* + * With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e. + * a potentially remote cpu drain) and get interrupted by an operation that + * attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP + * spinlock assumptions making the trylock a no-op. So we have to turn that + * spin_lock() to a spin_lock_irqsave(). This works because on UP there are no + * remote cpu's so we can only be locking the only existing local one. + */ +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) +static inline void __flags_noop(unsigned long *flags) { } +#define spin_lock_maybe_irqsave(lock, flags) \ +({ \ + __flags_noop(&(flags)); \ + spin_lock(lock); \ +}) +#define spin_unlock_maybe_irqrestore(lock, flags) \ +({ \ + spin_unlock(lock); \ + __flags_noop(&(flags)); \ +}) +#else +#define spin_lock_maybe_irqsave(lock, flags) spin_lock_irqsave(lock, flags) +#define spin_unlock_maybe_irqrestore(lock, flags) spin_unlock_irqrestore(lock, flags) +#endif + #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); @@ -2556,6 +2581,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) { int high_min, to_drain, to_drain_batched, batch; + unsigned long UP_flags; bool todo = false;
high_min = READ_ONCE(pcp->high_min); @@ -2575,9 +2601,9 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) to_drain = pcp->count - pcp->high; while (to_drain > 0) { to_drain_batched = min(to_drain, batch); - spin_lock(&pcp->lock); + spin_lock_maybe_irqsave(&pcp->lock, UP_flags); free_pcppages_bulk(zone, to_drain_batched, pcp, 0); - spin_unlock(&pcp->lock); + spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags); todo = true;
to_drain -= to_drain_batched; @@ -2594,14 +2620,15 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) */ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) { + unsigned long UP_flags; int to_drain, batch;
batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); if (to_drain > 0) { - spin_lock(&pcp->lock); + spin_lock_maybe_irqsave(&pcp->lock, UP_flags); free_pcppages_bulk(zone, to_drain, pcp, 0); - spin_unlock(&pcp->lock); + spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags); } } #endif @@ -2612,10 +2639,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) static void drain_pages_zone(unsigned int cpu, struct zone *zone) { struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); + unsigned long UP_flags; int count;
do { - spin_lock(&pcp->lock); + spin_lock_maybe_irqsave(&pcp->lock, UP_flags); count = pcp->count; if (count) { int to_drain = min(count, @@ -2624,7 +2652,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) free_pcppages_bulk(zone, to_drain, pcp, 0); count -= to_drain; } - spin_unlock(&pcp->lock); + spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags); } while (count); }
@@ -6109,6 +6137,7 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu) { struct per_cpu_pages *pcp; struct cpu_cacheinfo *cci; + unsigned long UP_flags;
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); cci = get_cpu_cacheinfo(cpu); @@ -6119,12 +6148,12 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu) * This can reduce zone lock contention without hurting * cache-hot pages sharing. */ - spin_lock(&pcp->lock); + spin_lock_maybe_irqsave(&pcp->lock, UP_flags); if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch) pcp->flags |= PCPF_FREE_HIGH_BATCH; else pcp->flags &= ~PCPF_FREE_HIGH_BATCH; - spin_unlock(&pcp->lock); + spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags); }
void setup_pcp_cacheinfo(unsigned int cpu)
--- base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8 change-id: 20260105-fix-pcp-up-3c88c09752ec
Best regards,
On Mon, 05 Jan 2026 16:08:56 +0100 Vlastimil Babka vbabka@suse.cz wrote:
+++ b/mm/page_alloc.c @@ -167,6 +167,31 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { } pcp_trylock_finish(UP_flags); \ }) +/*
- With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e.
- a potentially remote cpu drain) and get interrupted by an operation that
- attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP
- spinlock assumptions making the trylock a no-op. So we have to turn that
- spin_lock() to a spin_lock_irqsave(). This works because on UP there are no
- remote cpu's so we can only be locking the only existing local one.
- */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) +static inline void __flags_noop(unsigned long *flags) { } +#define spin_lock_maybe_irqsave(lock, flags) \ +({ \
__flags_noop(&(flags)); \spin_lock(lock); \+}) +#define spin_unlock_maybe_irqrestore(lock, flags) \ +({ \
spin_unlock(lock); \__flags_noop(&(flags)); \+}) +#else +#define spin_lock_maybe_irqsave(lock, flags) spin_lock_irqsave(lock, flags) +#define spin_unlock_maybe_irqrestore(lock, flags) spin_unlock_irqrestore(lock, flags) +#endif
These are very generic looking names for something specific for page_alloc.c. Could you add a prefix of some kind to make it easy to see that these are specific to the mm code?
mm_spin_lock_maybe_irqsave() ?
Thanks,
-- Steve
On 1/5/26 22:40, Steven Rostedt wrote:
On Mon, 05 Jan 2026 16:08:56 +0100 Vlastimil Babka vbabka@suse.cz wrote:
+++ b/mm/page_alloc.c @@ -167,6 +167,31 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { } pcp_trylock_finish(UP_flags); \ }) +/*
- With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e.
- a potentially remote cpu drain) and get interrupted by an operation that
- attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP
- spinlock assumptions making the trylock a no-op. So we have to turn that
- spin_lock() to a spin_lock_irqsave(). This works because on UP there are no
- remote cpu's so we can only be locking the only existing local one.
- */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) +static inline void __flags_noop(unsigned long *flags) { } +#define spin_lock_maybe_irqsave(lock, flags) \ +({ \
__flags_noop(&(flags)); \spin_lock(lock); \+}) +#define spin_unlock_maybe_irqrestore(lock, flags) \ +({ \
spin_unlock(lock); \__flags_noop(&(flags)); \+}) +#else +#define spin_lock_maybe_irqsave(lock, flags) spin_lock_irqsave(lock, flags) +#define spin_unlock_maybe_irqrestore(lock, flags) spin_unlock_irqrestore(lock, flags) +#endif
These are very generic looking names for something specific for page_alloc.c. Could you add a prefix of some kind to make it easy to see that these are specific to the mm code?
mm_spin_lock_maybe_irqsave() ?
OK, I think it's best like this:
----8<---- From a6da5d9e3db005a2f44f3196814d7253dce21d3e Mon Sep 17 00:00:00 2001 From: Vlastimil Babka vbabka@suse.cz Date: Tue, 6 Jan 2026 09:23:37 +0100 Subject: [PATCH] mm/page_alloc: prevent pcp corruption with SMP=n - fix
Add pcp_ prefix to the spin_lock_irqsave wrappers, per Steven. With that make them also take pcp pointer and reference the lock field themselves, to be like the existing pcp trylock wrappers.
Signed-off-by: Vlastimil Babka vbabka@suse.cz --- mm/page_alloc.c | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ec3551d56cde..dd72ff39da8c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -177,19 +177,21 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { } */ #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) static inline void __flags_noop(unsigned long *flags) { } -#define spin_lock_maybe_irqsave(lock, flags) \ +#define pcp_spin_lock_maybe_irqsave(ptr, flags) \ ({ \ __flags_noop(&(flags)); \ - spin_lock(lock); \ + spin_lock(&(ptr)->lock); \ }) -#define spin_unlock_maybe_irqrestore(lock, flags) \ +#define pcp_spin_unlock_maybe_irqrestore(ptr, flags) \ ({ \ - spin_unlock(lock); \ + spin_unlock(&(ptr)->lock); \ __flags_noop(&(flags)); \ }) #else -#define spin_lock_maybe_irqsave(lock, flags) spin_lock_irqsave(lock, flags) -#define spin_unlock_maybe_irqrestore(lock, flags) spin_unlock_irqrestore(lock, flags) +#define pcp_spin_lock_maybe_irqsave(ptr, flags) \ + spin_lock_irqsave(&(ptr)->lock, flags) +#define pcp_spin_unlock_maybe_irqrestore(ptr, flags) \ + spin_unlock_irqrestore(&(ptr)->lock, flags) #endif
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID @@ -2601,9 +2603,9 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) to_drain = pcp->count - pcp->high; while (to_drain > 0) { to_drain_batched = min(to_drain, batch); - spin_lock_maybe_irqsave(&pcp->lock, UP_flags); + pcp_spin_lock_maybe_irqsave(pcp, UP_flags); free_pcppages_bulk(zone, to_drain_batched, pcp, 0); - spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags); + pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags); todo = true;
to_drain -= to_drain_batched; @@ -2626,9 +2628,9 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); if (to_drain > 0) { - spin_lock_maybe_irqsave(&pcp->lock, UP_flags); + pcp_spin_lock_maybe_irqsave(pcp, UP_flags); free_pcppages_bulk(zone, to_drain, pcp, 0); - spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags); + pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags); } } #endif @@ -2643,7 +2645,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) int count;
do { - spin_lock_maybe_irqsave(&pcp->lock, UP_flags); + pcp_spin_lock_maybe_irqsave(pcp, UP_flags); count = pcp->count; if (count) { int to_drain = min(count, @@ -2652,7 +2654,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) free_pcppages_bulk(zone, to_drain, pcp, 0); count -= to_drain; } - spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags); + pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags); } while (count); }
@@ -6148,12 +6150,12 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu) * This can reduce zone lock contention without hurting * cache-hot pages sharing. */ - spin_lock_maybe_irqsave(&pcp->lock, UP_flags); + pcp_spin_lock_maybe_irqsave(pcp, UP_flags); if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch) pcp->flags |= PCPF_FREE_HIGH_BATCH; else pcp->flags &= ~PCPF_FREE_HIGH_BATCH; - spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags); + pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags); }
void setup_pcp_cacheinfo(unsigned int cpu)
On Tue, 6 Jan 2026 09:28:29 +0100 Vlastimil Babka vbabka@suse.cz wrote:
- */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) +static inline void __flags_noop(unsigned long *flags) { } +#define spin_lock_maybe_irqsave(lock, flags) \ +({ \
__flags_noop(&(flags)); \spin_lock(lock); \+}) +#define spin_unlock_maybe_irqrestore(lock, flags) \ +({ \
spin_unlock(lock); \__flags_noop(&(flags)); \+}) +#else +#define spin_lock_maybe_irqsave(lock, flags) spin_lock_irqsave(lock, flags) +#define spin_unlock_maybe_irqrestore(lock, flags) spin_unlock_irqrestore(lock, flags) +#endif
These are very generic looking names for something specific for page_alloc.c. Could you add a prefix of some kind to make it easy to see that these are specific to the mm code?
mm_spin_lock_maybe_irqsave() ?
OK, I think it's best like this:
Yeah, thanks.
-- Steve
----8<----
From a6da5d9e3db005a2f44f3196814d7253dce21d3e Mon Sep 17 00:00:00 2001
From: Vlastimil Babka vbabka@suse.cz Date: Tue, 6 Jan 2026 09:23:37 +0100 Subject: [PATCH] mm/page_alloc: prevent pcp corruption with SMP=n - fix
Add pcp_ prefix to the spin_lock_irqsave wrappers, per Steven. With that make them also take pcp pointer and reference the lock field themselves, to be like the existing pcp trylock wrappers.
Signed-off-by: Vlastimil Babka vbabka@suse.cz
mm/page_alloc.c | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ec3551d56cde..dd72ff39da8c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -177,19 +177,21 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { } */ #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) static inline void __flags_noop(unsigned long *flags) { } -#define spin_lock_maybe_irqsave(lock, flags) \ +#define pcp_spin_lock_maybe_irqsave(ptr, flags) \ ({ \ __flags_noop(&(flags)); \
spin_lock(lock); \
spin_lock(&(ptr)->lock); \})
linux-stable-mirror@lists.linaro.org