The patch below does not apply to the 5.10-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y git checkout FETCH_HEAD git cherry-pick -x 1007843a91909a4995ee78a538f62d8665705b66 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2023042446-gills-morality-d566@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1007843a91909a4995ee78a538f62d8665705b66 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Date: Tue, 4 Apr 2023 23:31:58 +0900 Subject: [PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
syzbot is reporting circular locking dependency which involves zonelist_update_seq seqlock [1], for this lock is checked by memory allocation requests which do not need to be retried.
One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
CPU0 ---- __build_all_zonelists() { write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd // e.g. timer interrupt handler runs at this moment some_timer_func() { kmalloc(GFP_ATOMIC) { __alloc_pages_slowpath() { read_seqbegin(&zonelist_update_seq) { // spins forever because zonelist_update_seq.seqcount is odd } } } } // e.g. timer interrupt handler finishes write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even }
This deadlock scenario can be easily eliminated by not calling read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation requests. But Michal Hocko does not know whether we should go with this approach.
Another deadlock scenario which syzbot is reporting is a race between kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with port->lock held and printk() from __build_all_zonelists() with zonelist_update_seq held.
CPU0 CPU1 ---- ---- pty_write() { tty_insert_flip_string_and_push_buffer() { __build_all_zonelists() { write_seqlock(&zonelist_update_seq); build_zonelists() { printk() { vprintk() { vprintk_default() { vprintk_emit() { console_unlock() { console_flush_all() { console_emit_next_record() { con->write() = serial8250_console_write() { spin_lock_irqsave(&port->lock, flags); tty_insert_flip_string() { tty_insert_flip_string_fixed_flag() { __tty_buffer_request_room() { tty_buffer_alloc() { kmalloc(GFP_ATOMIC | __GFP_NOWARN) { __alloc_pages_slowpath() { zonelist_iter_begin() { read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held } } } } } } } } spin_unlock_irqrestore(&port->lock, flags); // message is printed to console spin_unlock_irqrestore(&port->lock, flags); } } } } } } } } } write_sequnlock(&zonelist_update_seq); } } }
This deadlock scenario can be eliminated by
preventing interrupt context from calling kmalloc(GFP_ATOMIC)
and
preventing printk() from calling console_flush_all()
while zonelist_update_seq.seqcount is odd.
Since Petr Mladek thinks that __build_all_zonelists() can become a candidate for deferring printk() [2], let's address this problem by
disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
and
disabling synchronous printk() in order to avoid console_flush_all()
.
As a side effect of minimizing duration of zonelist_update_seq.seqcount being odd by disabling synchronous printk(), latency at read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and __GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e. do not record unnecessary locking dependency) from interrupt context is still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq) section...
Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA... Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2] Reported-by: syzbot syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1] Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Acked-by: Michal Hocko mhocko@suse.com Acked-by: Mel Gorman mgorman@techsingularity.net Cc: Petr Mladek pmladek@suse.com Cc: David Hildenbrand david@redhat.com Cc: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Cc: John Ogness john.ogness@linutronix.de Cc: Patrick Daly quic_pdaly@quicinc.com Cc: Sergey Senozhatsky senozhatsky@chromium.org Cc: Steven Rostedt rostedt@goodmis.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7136c36c5d01..e8b4f294d763 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6632,7 +6632,21 @@ static void __build_all_zonelists(void *data) int nid; int __maybe_unused cpu; pg_data_t *self = data; + unsigned long flags;
+ /* + * Explicitly disable this CPU's interrupts before taking seqlock + * to prevent any IRQ handler from calling into the page allocator + * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. + */ + local_irq_save(flags); + /* + * Explicitly disable this CPU's synchronous printk() before taking + * seqlock to prevent any printk() from trying to hold port->lock, for + * tty_insert_flip_string_and_push_buffer() on other CPU might be + * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. + */ + printk_deferred_enter(); write_seqlock(&zonelist_update_seq);
#ifdef CONFIG_NUMA @@ -6671,6 +6685,8 @@ static void __build_all_zonelists(void *data) }
write_sequnlock(&zonelist_update_seq); + printk_deferred_exit(); + local_irq_restore(flags); }
static noinline void __init
commit 85e3e7fbbb720b9897fba9a99659e31cbd1c082e upstream.
[This patch implements subset of original commit 85e3e7fbbb72 ("printk: remove NMI tracking") where commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") depends on, for commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") was backported to stable.]
All NMI contexts are handled the same as the safe context: store the message and defer printing. There is no need to have special NMI context tracking for this. Using in_nmi() is enough.
There are several parts of the kernel that are manually calling into the printk NMI context tracking in order to cause general printk deferred printing:
arch/arm/kernel/smp.c arch/powerpc/kexec/crash.c kernel/trace/trace.c
For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new function pair printk_deferred_enter/exit that explicitly achieves the same objective.
For ftrace, remove the printk context manipulation completely. It was added in commit 03fc7f9c99c1 ("printk/nmi: Prevent deadlock when accessing the main log buffer in NMI"). The purpose was to enforce storing messages directly into the ring buffer even in NMI context. It really should have only modified the behavior in NMI context. There is no need for a special behavior any longer. All messages are always stored directly now. The console deferring is handled transparently in vprintk().
Signed-off-by: John Ogness john.ogness@linutronix.de [pmladek@suse.com: Remove special handling in ftrace.c completely. Signed-off-by: Petr Mladek pmladek@suse.com Link: https://lore.kernel.org/r/20210715193359.25946-5-john.ogness@linutronix.de [penguin-kernel: Copy only printk_deferred_{enter,safe}() definition ] Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp --- include/linux/printk.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/include/linux/printk.h b/include/linux/printk.h index fe7eb2351610..344f6da3d4c3 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -623,4 +623,23 @@ static inline void print_hex_dump_debug(const char *prefix_str, int prefix_type, #define print_hex_dump_bytes(prefix_str, prefix_type, buf, len) \ print_hex_dump_debug(prefix_str, prefix_type, 16, 1, buf, len, true)
+#ifdef CONFIG_PRINTK +extern void __printk_safe_enter(void); +extern void __printk_safe_exit(void); +/* + * The printk_deferred_enter/exit macros are available only as a hack for + * some code paths that need to defer all printk console printing. Interrupts + * must be disabled for the deferred duration. + */ +#define printk_deferred_enter __printk_safe_enter +#define printk_deferred_exit __printk_safe_exit +#else +static inline void printk_deferred_enter(void) +{ +} +static inline void printk_deferred_exit(void) +{ +} +#endif + #endif
On Sun, May 14, 2023 at 01:41:27PM +0900, Tetsuo Handa wrote:
commit 85e3e7fbbb720b9897fba9a99659e31cbd1c082e upstream.
[This patch implements subset of original commit 85e3e7fbbb72 ("printk: remove NMI tracking") where commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") depends on, for commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") was backported to stable.]
All NMI contexts are handled the same as the safe context: store the message and defer printing. There is no need to have special NMI context tracking for this. Using in_nmi() is enough.
There are several parts of the kernel that are manually calling into the printk NMI context tracking in order to cause general printk deferred printing:
arch/arm/kernel/smp.c arch/powerpc/kexec/crash.c kernel/trace/trace.c
For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new function pair printk_deferred_enter/exit that explicitly achieves the same objective.
For ftrace, remove the printk context manipulation completely. It was added in commit 03fc7f9c99c1 ("printk/nmi: Prevent deadlock when accessing the main log buffer in NMI"). The purpose was to enforce storing messages directly into the ring buffer even in NMI context. It really should have only modified the behavior in NMI context. There is no need for a special behavior any longer. All messages are always stored directly now. The console deferring is handled transparently in vprintk().
Signed-off-by: John Ogness john.ogness@linutronix.de [pmladek@suse.com: Remove special handling in ftrace.c completely. Signed-off-by: Petr Mladek pmladek@suse.com Link: https://lore.kernel.org/r/20210715193359.25946-5-john.ogness@linutronix.de [penguin-kernel: Copy only printk_deferred_{enter,safe}() definition ] Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
include/linux/printk.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/include/linux/printk.h b/include/linux/printk.h index fe7eb2351610..344f6da3d4c3 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -623,4 +623,23 @@ static inline void print_hex_dump_debug(const char *prefix_str, int prefix_type, #define print_hex_dump_bytes(prefix_str, prefix_type, buf, len) \ print_hex_dump_debug(prefix_str, prefix_type, 16, 1, buf, len, true) +#ifdef CONFIG_PRINTK +extern void __printk_safe_enter(void); +extern void __printk_safe_exit(void); +/*
- The printk_deferred_enter/exit macros are available only as a hack for
- some code paths that need to defer all printk console printing. Interrupts
- must be disabled for the deferred duration.
- */
+#define printk_deferred_enter __printk_safe_enter +#define printk_deferred_exit __printk_safe_exit +#else +static inline void printk_deferred_enter(void) +{ +} +static inline void printk_deferred_exit(void) +{ +} +#endif
#endif
2.34.1
All now queued up, thanks.
greg k-h
On 2023/05/15 21:56, Greg Kroah-Hartman wrote:
On Sun, May 14, 2023 at 01:41:27PM +0900, Tetsuo Handa wrote:
commit 85e3e7fbbb720b9897fba9a99659e31cbd1c082e upstream.
[This patch implements subset of original commit 85e3e7fbbb72 ("printk: remove NMI tracking") where commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") depends on, for commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") was backported to stable.]
All now queued up, thanks.
Thank you. Then, please also queue original "[PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq" (Message ID listed below) to stable kernels.
2023042446-gills-morality-d566@gregkh 2023042449-wobbling-putdown-13ea@gregkh 2023042452-stopper-engross-e9da@gregkh 2023042455-skinless-muzzle-1c50@gregkh
On Mon, May 15, 2023 at 10:11:33PM +0900, Tetsuo Handa wrote:
On 2023/05/15 21:56, Greg Kroah-Hartman wrote:
On Sun, May 14, 2023 at 01:41:27PM +0900, Tetsuo Handa wrote:
commit 85e3e7fbbb720b9897fba9a99659e31cbd1c082e upstream.
[This patch implements subset of original commit 85e3e7fbbb72 ("printk: remove NMI tracking") where commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") depends on, for commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") was backported to stable.]
All now queued up, thanks.
Thank you. Then, please also queue original "[PATCH] mm/page_alloc: fix potential deadlock on zonelist_update_seq" (Message ID listed below) to stable kernels.
2023042446-gills-morality-d566@gregkh 2023042449-wobbling-putdown-13ea@gregkh 2023042452-stopper-engross-e9da@gregkh 2023042455-skinless-muzzle-1c50@gregkh
Now done, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org