Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") the alsa non-contiguous allocator always called the alsa fallback allocator in the non-iommu case. This allocated non-contig memory consisting of progressively smaller contiguous chunks. Allocation was fast due to the OR-ing in of __GFP_NORETRY.
After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") the code tries the dma non-contig allocator first, then falls back to the alsa fallback allocator. In the non-iommu case, the former supports only a single contiguous chunk.
We have observed experimentally that under heavy memory fragmentation, allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL triggers an indefinite hang in the dma non-contig allocator. This has high-impact, as an occurrence will trigger a device reboot, resulting in loss of user state.
Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly so it does not get stuck looking for that elusive large contiguous chunk, in which case we will fall back to the alsa fallback allocator.
Note that the iommu dma non-contiguous allocator is not affected. While assembling an array of pages, it tries consecutively smaller contiguous allocations, and lets higher-order chunk allocations fail quickly.
Suggested-by: Sven van Ashbrook svenva@chromium.org Suggested-by: Brian Geffon bgeffon@google.com Fixes: 9d8e536d36e7 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") Cc: stable@vger.kernel.org Cc: Sven van Ashbrook svenva@chromium.org Cc: Brian Geffon bgeffon@google.com Cc: Curtis Malainey cujomalainey@chromium.org Signed-off-by: Karthikeyan Ramasubramanian kramasub@chromium.org ---
sound/core/memalloc.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c index f901504b5afc1..5f6526a0d731c 100644 --- a/sound/core/memalloc.c +++ b/sound/core/memalloc.c @@ -540,13 +540,18 @@ static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size) { struct sg_table *sgt; void *p; + gfp_t gfp_flags = DEFAULT_GFP;
#ifdef CONFIG_SND_DMA_SGBUF if (cpu_feature_enabled(X86_FEATURE_XENPV)) return snd_dma_sg_fallback_alloc(dmab, size); + + /* Non-IOMMU case: prevent allocator from searching forever */ + if (!get_dma_ops(dmab->dev.dev)) + gfp_flags |= __GFP_NORETRY; #endif sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir, - DEFAULT_GFP, 0); + gfp_flags, 0); #ifdef CONFIG_SND_DMA_SGBUF if (!sgt && !get_dma_ops(dmab->dev.dev)) return snd_dma_sg_fallback_alloc(dmab, size);
On Thu, 15 Feb 2024 01:07:25 +0100, Karthikeyan Ramasubramanian wrote:
Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") the alsa non-contiguous allocator always called the alsa fallback allocator in the non-iommu case. This allocated non-contig memory consisting of progressively smaller contiguous chunks. Allocation was fast due to the OR-ing in of __GFP_NORETRY.
After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") the code tries the dma non-contig allocator first, then falls back to the alsa fallback allocator. In the non-iommu case, the former supports only a single contiguous chunk.
We have observed experimentally that under heavy memory fragmentation, allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL triggers an indefinite hang in the dma non-contig allocator. This has high-impact, as an occurrence will trigger a device reboot, resulting in loss of user state.
Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly so it does not get stuck looking for that elusive large contiguous chunk, in which case we will fall back to the alsa fallback allocator.
Note that the iommu dma non-contiguous allocator is not affected. While assembling an array of pages, it tries consecutively smaller contiguous allocations, and lets higher-order chunk allocations fail quickly.
Suggested-by: Sven van Ashbrook svenva@chromium.org Suggested-by: Brian Geffon bgeffon@google.com Fixes: 9d8e536d36e7 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") Cc: stable@vger.kernel.org Cc: Sven van Ashbrook svenva@chromium.org Cc: Brian Geffon bgeffon@google.com Cc: Curtis Malainey cujomalainey@chromium.org Signed-off-by: Karthikeyan Ramasubramanian kramasub@chromium.org
After chatting with Vlastimil, he recommended to get linux-mm people involved, as basically such a problem shouldn't happen in the page allocator side. So let's widen the audience.
To recap the thread: the problem is that dma_alloc_contiguous() call with high order pages and __GFP_FAIL_MAYRETRY leads to indefinite stall. (It was __GFP_NORETRY beforehand.) This looks like the code path with the direct page allocation where no IOMMU is involved.
Karthikeyan, Sven, and co: could you guys show the stack trace at the stall? This may give us more clear light.
Also, Vlastimil suggested that tracepoints would be helpful if that's really in the page allocator, too.
Thanks!
Takashi
sound/core/memalloc.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c index f901504b5afc1..5f6526a0d731c 100644 --- a/sound/core/memalloc.c +++ b/sound/core/memalloc.c @@ -540,13 +540,18 @@ static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size) { struct sg_table *sgt; void *p;
- gfp_t gfp_flags = DEFAULT_GFP;
#ifdef CONFIG_SND_DMA_SGBUF if (cpu_feature_enabled(X86_FEATURE_XENPV)) return snd_dma_sg_fallback_alloc(dmab, size);
- /* Non-IOMMU case: prevent allocator from searching forever */
- if (!get_dma_ops(dmab->dev.dev))
gfp_flags |= __GFP_NORETRY;
#endif sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
DEFAULT_GFP, 0);
gfp_flags, 0);
#ifdef CONFIG_SND_DMA_SGBUF if (!sgt && !get_dma_ops(dmab->dev.dev)) return snd_dma_sg_fallback_alloc(dmab, size); -- 2.43.0.687.g38aa6559b0-goog
On 2/19/24 12:36, Takashi Iwai wrote:
On Thu, 15 Feb 2024 01:07:25 +0100, Karthikeyan Ramasubramanian wrote:
Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") the alsa non-contiguous allocator always called the alsa fallback allocator in the non-iommu case. This allocated non-contig memory consisting of progressively smaller contiguous chunks. Allocation was fast due to the OR-ing in of __GFP_NORETRY.
After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") the code tries the dma non-contig allocator first, then falls back to the alsa fallback allocator. In the non-iommu case, the former supports only a single contiguous chunk.
We have observed experimentally that under heavy memory fragmentation, allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL triggers an indefinite hang in the dma non-contig allocator. This has high-impact, as an occurrence will trigger a device reboot, resulting in loss of user state.
Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly so it does not get stuck looking for that elusive large contiguous chunk, in which case we will fall back to the alsa fallback allocator.
Note that the iommu dma non-contiguous allocator is not affected. While assembling an array of pages, it tries consecutively smaller contiguous allocations, and lets higher-order chunk allocations fail quickly.
Suggested-by: Sven van Ashbrook svenva@chromium.org Suggested-by: Brian Geffon bgeffon@google.com Fixes: 9d8e536d36e7 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first") Cc: stable@vger.kernel.org Cc: Sven van Ashbrook svenva@chromium.org Cc: Brian Geffon bgeffon@google.com Cc: Curtis Malainey cujomalainey@chromium.org Signed-off-by: Karthikeyan Ramasubramanian kramasub@chromium.org
After chatting with Vlastimil, he recommended to get linux-mm people involved, as basically such a problem shouldn't happen in the page allocator side. So let's widen the audience.
To recap the thread: the problem is that dma_alloc_contiguous() call with high order pages and __GFP_FAIL_MAYRETRY leads to indefinite stall. (It was __GFP_NORETRY beforehand.) This looks like the code path with the direct page allocation where no IOMMU is involved.
Karthikeyan, Sven, and co: could you guys show the stack trace at the stall? This may give us more clear light.
Yeah, if the inifinite loop with __GFP_RETRY_MAYFAIL happens in a call to __alloc_pages and not in some retry loop around it in an upper layer (I tried to check the dma functions but got lost quickly so the exact call stack would be useful), we definitely want to know the details. It shouldn't happen for costly orders (>3) because the retries are hard limited for those despite apparent progress or reclaim or compaction.
Also, Vlastimil suggested that tracepoints would be helpful if that's really in the page allocator, too.
Thanks!
Takashi
sound/core/memalloc.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c index f901504b5afc1..5f6526a0d731c 100644 --- a/sound/core/memalloc.c +++ b/sound/core/memalloc.c @@ -540,13 +540,18 @@ static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size) { struct sg_table *sgt; void *p;
- gfp_t gfp_flags = DEFAULT_GFP;
#ifdef CONFIG_SND_DMA_SGBUF if (cpu_feature_enabled(X86_FEATURE_XENPV)) return snd_dma_sg_fallback_alloc(dmab, size);
- /* Non-IOMMU case: prevent allocator from searching forever */
- if (!get_dma_ops(dmab->dev.dev))
gfp_flags |= __GFP_NORETRY;
#endif sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
DEFAULT_GFP, 0);
gfp_flags, 0);
#ifdef CONFIG_SND_DMA_SGBUF if (!sgt && !get_dma_ops(dmab->dev.dev)) return snd_dma_sg_fallback_alloc(dmab, size); -- 2.43.0.687.g38aa6559b0-goog
Takaski, Vlastimil: thanks so much for the engagement! See below.
On 2/19/24 12:36, Takashi Iwai wrote:
Karthikeyan, Sven, and co: could you guys show the stack trace at the stall? This may give us more clear light.
Here are two typical stack traces at the stall. Note that the timer interrupt is just a software watchdog that fires to generate the stack trace. This is running the v6.1 kernel. We should be able to reproduce this on v6.6 as well if need be.
<4>[310289.546429] <TASK> <4>[310289.546431] asm_sysvec_apic_timer_interrupt+0x16/0x20 <4>[310289.546434] RIP: 0010:super_cache_count+0xc/0xea <4>[310289.546438] Code: ff ff e8 48 ac e3 ff 4c 89 e0 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 1f 44 00 00 f6 87 23 fc ff ff 20 <75> 08 31 c0 c3 cc cc cc cc cc 55 48 89 e5 41 57 41 56 41 54 53 49 <4>[310289.546440] RSP: 0018:ffffa64e8aed35c0 EFLAGS: 00000202 <4>[310289.546443] RAX: 0000000000000080 RBX: 0000000000000400 RCX: 0000000000000000 <4>[310289.546445] RDX: ffffffffa6d66bc8 RSI: ffffa64e8aed3610 RDI: ffff9fd2873dbc30 <4>[310289.546447] RBP: ffffa64e8aed3660 R08: 0000000000000064 R09: 0000000000000000 <4>[310289.546449] R10: ffffffffa6e3b260 R11: ffffffffa5163a52 R12: ffff9fd2873dbc50 <4>[310289.546451] R13: 0000000000046c00 R14: 0000000000000000 R15: 0000000000000000 <4>[310289.546453] ? super_cache_scan+0x199/0x199 <4>[310289.546457] shrink_slab+0xb3/0x37e <4>[310289.546460] shrink_node+0x377/0x110e <4>[310289.546464] ? sysvec_apic_timer_interrupt+0x17/0x80 <4>[310289.546467] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 <4>[310289.546471] try_to_free_pages+0x46e/0x857 <4>[310289.546475] ? psi_task_change+0x7f/0x9c <4>[310289.546478] __alloc_pages_slowpath+0x4e2/0xe5c <4>[310289.546482] __alloc_pages+0x225/0x2a2 <4>[310289.546486] __dma_direct_alloc_pages+0xed/0x1cb <4>[310289.546489] dma_direct_alloc_pages+0x21/0xa3 <4>[310289.546493] dma_alloc_noncontiguous+0xd1/0x144 <4>[310289.546496] snd_dma_noncontig_alloc+0x45/0xe3 <4>[310289.546499] snd_dma_alloc_dir_pages+0x4f/0x81 <4>[310289.546502] hda_cl_stream_prepare+0x66/0x15e [snd_sof_intel_hda_common (HASH:1255 1)] <4>[310289.546510] hda_dsp_cl_boot_firmware+0xc4/0x2ca [snd_sof_intel_hda_common (HASH:1255 1)] <4>[310289.546518] snd_sof_run_firmware+0xca/0x2d7 [snd_sof (HASH:ecd9 2)] <4>[310289.546526] ? hda_dsp_resume+0x97/0x1a7 [snd_sof_intel_hda_common (HASH:1255 1)] <4>[310289.546534] sof_resume+0x155/0x251 [snd_sof (HASH:ecd9 2)] <4>[310289.546542] ? pci_pm_suspend+0x1e7/0x1e7 <4>[310289.546546] dpm_run_callback+0x3c/0x132 <4>[310289.546549] device_resume+0x1f7/0x282 <4>[310289.546552] ? dpm_watchdog_set+0x54/0x54 <4>[310289.546555] async_resume+0x1f/0x5b <4>[310289.546558] async_run_entry_fn+0x2b/0xc5 <4>[310289.546561] process_one_work+0x1be/0x381 <4>[310289.546564] worker_thread+0x20b/0x35b <4>[310289.546568] kthread+0xde/0xf7 <4>[310289.546571] ? pr_cont_work+0x54/0x54 <4>[310289.546574] ? kthread_blkcg+0x32/0x32 <4>[310289.546577] ret_from_fork+0x1f/0x30 <4>[310289.546580] </TASK>
<4>[171032.151834] <TASK> <4>[171032.151835] asm_sysvec_apic_timer_interrupt+0x16/0x20 <4>[171032.151839] RIP: 0010:_raw_spin_unlock_irq+0x10/0x28 <4>[171032.151842] Code: 2c 70 74 06 c3 cc cc cc cc cc 55 48 89 e5 e8 7e 30 2b ff 5d c3 cc cc cc cc cc 0f 1f 44 00 00 c6 07 00 fb 65 ff 0d af b1 2c 70 <74> 06 c3 cc cc cc cc cc 55 48 89 e5 e8 56 30 2b ff 5d c3 cc cc cc <4>[171032.151844] RSP: 0018:ffff942447b334d8 EFLAGS: 00000286 <4>[171032.151847] RAX: 0000000000000031 RBX: 0000000000000001 RCX: 0000000000000034 <4>[171032.151849] RDX: 0000000000000031 RSI: 0000000000000002 RDI: ffffffff9103b1b0 <4>[171032.151851] RBP: ffff942447b33660 R08: 0000000000000032 R09: 0000000000000010 <4>[171032.151853] R10: ffffffff9103b370 R11: 00000000ffffffff R12: ffffffff9103b160 <4>[171032.151855] R13: ffffd055000111c8 R14: 0000000000000000 R15: 0000000000000031 <4>[171032.151858] evict_folios+0xf9e/0x1307 <4>[171032.151861] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 <4>[171032.151866] shrink_node+0x2e8/0x110e <4>[171032.151870] ? common_interrupt+0x1c/0x95 <4>[171032.151872] ? common_interrupt+0x1c/0x95 <4>[171032.151875] ? asm_common_interrupt+0x22/0x40 <4>[171032.151878] ? __compaction_suitable+0x7c/0x9d <4>[171032.151882] try_to_free_pages+0x46e/0x857 <4>[171032.151885] ? psi_task_change+0x7f/0x9c <4>[171032.151889] __alloc_pages_slowpath+0x4e2/0xe5c <4>[171032.151893] __alloc_pages+0x225/0x2a2 <4>[171032.151896] __dma_direct_alloc_pages+0xed/0x1cb <4>[171032.151900] dma_direct_alloc_pages+0x21/0xa3 <4>[171032.151903] dma_alloc_noncontiguous+0xd1/0x144 <4>[171032.151907] snd_dma_noncontig_alloc+0x45/0xe3 <4>[171032.151910] snd_dma_alloc_dir_pages+0x4f/0x81 <4>[171032.151913] hda_cl_stream_prepare+0x66/0x15e [snd_sof_intel_hda_common (HASH:7df0 1)] <4>[171032.151921] hda_dsp_cl_boot_firmware+0xc4/0x2ca [snd_sof_intel_hda_common (HASH:7df0 1)] <4>[171032.151929] snd_sof_run_firmware+0xca/0x2d7 [snd_sof (HASH:9f20 2)] <4>[171032.151937] ? hda_dsp_resume+0x97/0x1a7 [snd_sof_intel_hda_common (HASH:7df0 1)] <4>[171032.151945] sof_resume+0x155/0x251 [snd_sof (HASH:9f20 2)] <4>[171032.151953] ? pci_pm_suspend+0x1e7/0x1e7 <4>[171032.151957] dpm_run_callback+0x3c/0x132 <4>[171032.151960] device_resume+0x1f7/0x282 <4>[171032.151962] ? dpm_watchdog_set+0x54/0x54 <4>[171032.151965] async_resume+0x1f/0x5b <4>[171032.151968] async_run_entry_fn+0x2b/0xc5 <4>[171032.151971] process_one_work+0x1be/0x381 <4>[171032.151975] worker_thread+0x20b/0x35b <4>[171032.151978] kthread+0xde/0xf7 <4>[171032.151981] ? pr_cont_work+0x54/0x54 <4>[171032.151984] ? kthread_blkcg+0x32/0x32 <4>[171032.151987] ret_from_fork+0x1f/0x30 <4>[171032.151991] </TASK>
On Mon, Feb 19, 2024 at 6:40 AM Vlastimil Babka vbabka@suse.cz wrote:
Yeah, if the inifinite loop with __GFP_RETRY_MAYFAIL happens in a call to __alloc_pages and not in some retry loop around it in an upper layer (I tried to check the dma functions but got lost quickly so the exact call stack would be useful), we definitely want to know the details. It shouldn't happen for costly orders (>3) because the retries are hard limited for those despite apparent progress or reclaim or compaction.
Here are our notes of the indefinite stall we saw on v5.10 with iommu SoCs. We did not pursue debugging the stall at the time, in favour of a work-around with the gfp flags. Therefore we only have partial confidence in the notes below. Take them with a block of salt, but they may point in a useful direction.
1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set.
2. page alloc's __alloc_pages_slowpath [1] tries to get a page from the freelist. This fails because there is nothing free of that costly order.
3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out [2] because a zone is ready to be compacted; it pretends to have made a single page of progress.
4. page alloc tries to compact, but this always bails out early [3] because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway).
5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic [4] thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction
6. goto 2. indefinite stall.
Also, Vlastimil suggested that tracepoints would be helpful if that's really in the page allocator, too.
We might be able to generate traces by bailing out of the indefinite stall using a timer, which should hopefully give us a device that's "alive enough" to read the traces.
Can you advise which tracepoints you'd like to see? Is trace-cmd [5] suitable to capture this?
[1] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/thir... [2] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/thir... [3] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/thir... [4] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/thir... [5] https://chromium.googlesource.com/chromiumos/docs/+/HEAD/kernel_development....
On Tue, 20 Feb 2024 16:52:14 +0100, Sven van Ashbrook wrote:
Takaski, Vlastimil: thanks so much for the engagement! See below.
On 2/19/24 12:36, Takashi Iwai wrote:
Karthikeyan, Sven, and co: could you guys show the stack trace at the stall? This may give us more clear light.
Here are two typical stack traces at the stall. Note that the timer interrupt is just a software watchdog that fires to generate the stack trace. This is running the v6.1 kernel. We should be able to reproduce this on v6.6 as well if need be.
<4>[310289.546429] <TASK> <4>[310289.546431] asm_sysvec_apic_timer_interrupt+0x16/0x20 <4>[310289.546434] RIP: 0010:super_cache_count+0xc/0xea <4>[310289.546438] Code: ff ff e8 48 ac e3 ff 4c 89 e0 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 1f 44 00 00 f6 87 23 fc ff ff 20 <75> 08 31 c0 c3 cc cc cc cc cc 55 48 89 e5 41 57 41 56 41 54 53 49 <4>[310289.546440] RSP: 0018:ffffa64e8aed35c0 EFLAGS: 00000202 <4>[310289.546443] RAX: 0000000000000080 RBX: 0000000000000400 RCX: 0000000000000000 <4>[310289.546445] RDX: ffffffffa6d66bc8 RSI: ffffa64e8aed3610 RDI: ffff9fd2873dbc30 <4>[310289.546447] RBP: ffffa64e8aed3660 R08: 0000000000000064 R09: 0000000000000000 <4>[310289.546449] R10: ffffffffa6e3b260 R11: ffffffffa5163a52 R12: ffff9fd2873dbc50 <4>[310289.546451] R13: 0000000000046c00 R14: 0000000000000000 R15: 0000000000000000 <4>[310289.546453] ? super_cache_scan+0x199/0x199 <4>[310289.546457] shrink_slab+0xb3/0x37e <4>[310289.546460] shrink_node+0x377/0x110e <4>[310289.546464] ? sysvec_apic_timer_interrupt+0x17/0x80 <4>[310289.546467] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 <4>[310289.546471] try_to_free_pages+0x46e/0x857 <4>[310289.546475] ? psi_task_change+0x7f/0x9c <4>[310289.546478] __alloc_pages_slowpath+0x4e2/0xe5c <4>[310289.546482] __alloc_pages+0x225/0x2a2 <4>[310289.546486] __dma_direct_alloc_pages+0xed/0x1cb <4>[310289.546489] dma_direct_alloc_pages+0x21/0xa3 <4>[310289.546493] dma_alloc_noncontiguous+0xd1/0x144 <4>[310289.546496] snd_dma_noncontig_alloc+0x45/0xe3 <4>[310289.546499] snd_dma_alloc_dir_pages+0x4f/0x81 <4>[310289.546502] hda_cl_stream_prepare+0x66/0x15e [snd_sof_intel_hda_common (HASH:1255 1)] <4>[310289.546510] hda_dsp_cl_boot_firmware+0xc4/0x2ca [snd_sof_intel_hda_common (HASH:1255 1)] <4>[310289.546518] snd_sof_run_firmware+0xca/0x2d7 [snd_sof (HASH:ecd9 2)] <4>[310289.546526] ? hda_dsp_resume+0x97/0x1a7 [snd_sof_intel_hda_common (HASH:1255 1)] <4>[310289.546534] sof_resume+0x155/0x251 [snd_sof (HASH:ecd9 2)] <4>[310289.546542] ? pci_pm_suspend+0x1e7/0x1e7 <4>[310289.546546] dpm_run_callback+0x3c/0x132 <4>[310289.546549] device_resume+0x1f7/0x282 <4>[310289.546552] ? dpm_watchdog_set+0x54/0x54 <4>[310289.546555] async_resume+0x1f/0x5b <4>[310289.546558] async_run_entry_fn+0x2b/0xc5 <4>[310289.546561] process_one_work+0x1be/0x381 <4>[310289.546564] worker_thread+0x20b/0x35b <4>[310289.546568] kthread+0xde/0xf7 <4>[310289.546571] ? pr_cont_work+0x54/0x54 <4>[310289.546574] ? kthread_blkcg+0x32/0x32 <4>[310289.546577] ret_from_fork+0x1f/0x30 <4>[310289.546580] </TASK>
<4>[171032.151834] <TASK> <4>[171032.151835] asm_sysvec_apic_timer_interrupt+0x16/0x20 <4>[171032.151839] RIP: 0010:_raw_spin_unlock_irq+0x10/0x28 <4>[171032.151842] Code: 2c 70 74 06 c3 cc cc cc cc cc 55 48 89 e5 e8 7e 30 2b ff 5d c3 cc cc cc cc cc 0f 1f 44 00 00 c6 07 00 fb 65 ff 0d af b1 2c 70 <74> 06 c3 cc cc cc cc cc 55 48 89 e5 e8 56 30 2b ff 5d c3 cc cc cc <4>[171032.151844] RSP: 0018:ffff942447b334d8 EFLAGS: 00000286 <4>[171032.151847] RAX: 0000000000000031 RBX: 0000000000000001 RCX: 0000000000000034 <4>[171032.151849] RDX: 0000000000000031 RSI: 0000000000000002 RDI: ffffffff9103b1b0 <4>[171032.151851] RBP: ffff942447b33660 R08: 0000000000000032 R09: 0000000000000010 <4>[171032.151853] R10: ffffffff9103b370 R11: 00000000ffffffff R12: ffffffff9103b160 <4>[171032.151855] R13: ffffd055000111c8 R14: 0000000000000000 R15: 0000000000000031 <4>[171032.151858] evict_folios+0xf9e/0x1307 <4>[171032.151861] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 <4>[171032.151866] shrink_node+0x2e8/0x110e <4>[171032.151870] ? common_interrupt+0x1c/0x95 <4>[171032.151872] ? common_interrupt+0x1c/0x95 <4>[171032.151875] ? asm_common_interrupt+0x22/0x40 <4>[171032.151878] ? __compaction_suitable+0x7c/0x9d <4>[171032.151882] try_to_free_pages+0x46e/0x857 <4>[171032.151885] ? psi_task_change+0x7f/0x9c <4>[171032.151889] __alloc_pages_slowpath+0x4e2/0xe5c <4>[171032.151893] __alloc_pages+0x225/0x2a2 <4>[171032.151896] __dma_direct_alloc_pages+0xed/0x1cb <4>[171032.151900] dma_direct_alloc_pages+0x21/0xa3 <4>[171032.151903] dma_alloc_noncontiguous+0xd1/0x144 <4>[171032.151907] snd_dma_noncontig_alloc+0x45/0xe3 <4>[171032.151910] snd_dma_alloc_dir_pages+0x4f/0x81 <4>[171032.151913] hda_cl_stream_prepare+0x66/0x15e [snd_sof_intel_hda_common (HASH:7df0 1)] <4>[171032.151921] hda_dsp_cl_boot_firmware+0xc4/0x2ca [snd_sof_intel_hda_common (HASH:7df0 1)] <4>[171032.151929] snd_sof_run_firmware+0xca/0x2d7 [snd_sof (HASH:9f20 2)] <4>[171032.151937] ? hda_dsp_resume+0x97/0x1a7 [snd_sof_intel_hda_common (HASH:7df0 1)] <4>[171032.151945] sof_resume+0x155/0x251 [snd_sof (HASH:9f20 2)] <4>[171032.151953] ? pci_pm_suspend+0x1e7/0x1e7 <4>[171032.151957] dpm_run_callback+0x3c/0x132 <4>[171032.151960] device_resume+0x1f7/0x282 <4>[171032.151962] ? dpm_watchdog_set+0x54/0x54 <4>[171032.151965] async_resume+0x1f/0x5b <4>[171032.151968] async_run_entry_fn+0x2b/0xc5 <4>[171032.151971] process_one_work+0x1be/0x381 <4>[171032.151975] worker_thread+0x20b/0x35b <4>[171032.151978] kthread+0xde/0xf7 <4>[171032.151981] ? pr_cont_work+0x54/0x54 <4>[171032.151984] ? kthread_blkcg+0x32/0x32 <4>[171032.151987] ret_from_fork+0x1f/0x30 <4>[171032.151991] </TASK>
Thanks!
Both look like the code path via async PM resume. Were both from the runtime PM resume? Or the system resume?
Takashi
Hi Takashi,
On Wed, Feb 21, 2024 at 4:21 AM Takashi Iwai tiwai@suse.de wrote:
Both look like the code path via async PM resume. Were both from the runtime PM resume? Or the system resume?
The large firmware allocation that triggers the stall happens in runtime resume.
This means runtime resume and system resume are affected.
Due to the way runtime PM works, the system suspend path is also affected. Because system suspend first wakes up any runtime suspended devices, before executing their system suspend callback.
So we get: system active, SOF runtime suspended (audio not active) -> system suspend request -> calls SOF runtime resume callback (stall happens here) -> calls SOF system suspend callback
On 2/20/24 16:52, Sven van Ashbrook wrote:
Takaski, Vlastimil: thanks so much for the engagement! See below.
On 2/19/24 12:36, Takashi Iwai wrote:
Karthikeyan, Sven, and co: could you guys show the stack trace at the stall? This may give us more clear light.
Here are our notes of the indefinite stall we saw on v5.10 with iommu SoCs. We did not pursue debugging the stall at the time, in favour of a work-around with the gfp flags. Therefore we only have partial confidence in the notes below. Take them with a block of salt, but they may point in a useful direction.
try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set.
page alloc's __alloc_pages_slowpath [1] tries to get a page from
the freelist. This fails because there is nothing free of that costly order.
- page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out [2] because a zone is ready to be compacted; it pretends
to have made a single page of progress.
- page alloc tries to compact, but this always bails out early [3]
because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway).
- page alloc believes reclaim progress was made (because of the
pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic [4] thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction
- goto 2. indefinite stall.
Thanks a lot, seems this can indeed happen even in 6.8-rc5. We're mishandling the case where compaction is skipped due to lack of __GFP_IO, which is indeed cleared in suspend/resume. I'll create a fix. Please don't hesitate to report such issues the next time, even if not fully debugged :)
Also, Vlastimil suggested that tracepoints would be helpful if that's really in the page allocator, too.
We might be able to generate traces by bailing out of the indefinite stall using a timer, which should hopefully give us a device that's "alive enough" to read the traces.
Can you advise which tracepoints you'd like to see? Is trace-cmd [5] suitable to capture this?
[1] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/thir... [2] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/thir... [3] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/thir... [4] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/thir... [5] https://chromium.googlesource.com/chromiumos/docs/+/HEAD/kernel_development....
On Wed, Feb 21, 2024 at 4:58 AM Vlastimil Babka vbabka@suse.cz wrote:
Thanks a lot, seems this can indeed happen even in 6.8-rc5. We're mishandling the case where compaction is skipped due to lack of __GFP_IO, which is indeed cleared in suspend/resume. I'll create a fix.
I'm really grateful for the engagement here !!
Please don't hesitate to report such issues the next time, even if not fully debugged :)
Will do :)
Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask.
Quoting Sven:
1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set.
2. page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order.
3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress.
4. page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway).
5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction
6. goto 2. indefinite stall.
(end quote)
The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO.
To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it.
Also use the new helper in: - compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order - in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily - in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases
Reported-by: Sven van Ashbrook svenva@chromium.org Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzP... Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Cc: stable@vger.kernel.org Signed-off-by: Vlastimil Babka vbabka@suse.cz --- include/linux/gfp.h | 9 +++++++++ mm/compaction.c | 7 +------ mm/page_alloc.c | 10 ++++++---- mm/vmscan.c | 5 ++++- 4 files changed, 20 insertions(+), 11 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index de292a007138..e2a916cf29c4 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -353,6 +353,15 @@ static inline bool gfp_has_io_fs(gfp_t gfp) return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS); }
+/* + * Check if the gfp flags allow compaction - GFP_NOIO is a really + * tricky context because the migration might require IO. + */ +static inline bool gfp_compaction_allowed(gfp_t gfp_mask) +{ + return IS_ENABLED(CONFIG_COMPACTION) && (gfp_mask & __GFP_IO); +} + extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
#ifdef CONFIG_CONTIG_ALLOC diff --git a/mm/compaction.c b/mm/compaction.c index 4add68d40e8d..b961db601df4 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2723,16 +2723,11 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum compact_priority prio, struct page **capture) { - int may_perform_io = (__force int)(gfp_mask & __GFP_IO); struct zoneref *z; struct zone *zone; enum compact_result rc = COMPACT_SKIPPED;
- /* - * Check if the GFP flags allow compaction - GFP_NOIO is really - * tricky context because the migration might require IO - */ - if (!may_perform_io) + if (!gfp_compaction_allowed(gfp_mask)) return COMPACT_SKIPPED;
trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 150d4f23b010..a663202045dc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4041,6 +4041,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) { bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; + bool can_compact = gfp_compaction_allowed(gfp_mask); const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; struct page *page = NULL; unsigned int alloc_flags; @@ -4111,7 +4112,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * Don't try this for allocations that are allowed to ignore * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen. */ - if (can_direct_reclaim && + if (can_direct_reclaim && can_compact && (costly_order || (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) { @@ -4209,9 +4210,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
/* * Do not retry costly high order allocations unless they are - * __GFP_RETRY_MAYFAIL + * __GFP_RETRY_MAYFAIL and we can compact */ - if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) + if (costly_order && (!can_compact || + !(gfp_mask & __GFP_RETRY_MAYFAIL))) goto nopage;
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, @@ -4224,7 +4226,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * implementation of the compaction depends on the sufficient amount * of free memory (see __compaction_suitable) */ - if (did_some_progress > 0 && + if (did_some_progress > 0 && can_compact && should_compact_retry(ac, order, alloc_flags, compact_result, &compact_priority, &compaction_retries)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..4255619a1a31 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5753,7 +5753,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) /* Use reclaim/compaction for costly allocs or under memory pressure */ static bool in_reclaim_compaction(struct scan_control *sc) { - if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && + if (gfp_compaction_allowed(sc->gfp_mask) && sc->order && (sc->order > PAGE_ALLOC_COSTLY_ORDER || sc->priority < DEF_PRIORITY - 2)) return true; @@ -5998,6 +5998,9 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) { unsigned long watermark;
+ if (!gfp_compaction_allowed(sc->gfp_mask)) + return false; + /* Allocation can already succeed, nothing to do */ if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone), sc->reclaim_idx, 0))
Thanks so much ! We will stress test this on our side.
We do this by exhausting memory and triggering many suspend/resume cycles. This reliably reproduces the problem (before this patch).
Of course, as we all know, absence of evidence (no more stalls in stress tests) does not equal evidence of absence (stalls are gone in all code paths).
With this patch, the test results have been promising so far. After 10 hours of stress testing, we have not hit the reported problem yet. We will keep testing and report here if we hit the problem again. Thanks for engaging with us.
On Wed, Feb 21, 2024 at 8:53 AM Sven van Ashbrook svenva@chromium.org wrote:
Thanks so much ! We will stress test this on our side.
We do this by exhausting memory and triggering many suspend/resume cycles. This reliably reproduces the problem (before this patch).
Of course, as we all know, absence of evidence (no more stalls in stress tests) does not equal evidence of absence (stalls are gone in all code paths).
Vlastimil,
We noticed that this patch is now added to Andrew Morton's mm-hotfixes-unstable branch.
How can we help to get this into mm-hotfixes-stable and from there, into mainline ?
We are still stress-testing using low memory suspend. Anything else that is required, and we can help with?
Sven
On Wed, Feb 21, 2024 at 6:44 AM Vlastimil Babka vbabka@suse.cz wrote:
Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask.
Quoting Sven:
try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set.
page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order.
page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress.
page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway).
page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction
goto 2. indefinite stall.
(end quote)
The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO.
To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it.
Also use the new helper in:
- compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order
- in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily
- in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases
Reported-by: Sven van Ashbrook svenva@chromium.org Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzP... Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Cc: stable@vger.kernel.org Signed-off-by: Vlastimil Babka vbabka@suse.cz
include/linux/gfp.h | 9 +++++++++ mm/compaction.c | 7 +------ mm/page_alloc.c | 10 ++++++---- mm/vmscan.c | 5 ++++- 4 files changed, 20 insertions(+), 11 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index de292a007138..e2a916cf29c4 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -353,6 +353,15 @@ static inline bool gfp_has_io_fs(gfp_t gfp) return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS); }
+/*
- Check if the gfp flags allow compaction - GFP_NOIO is a really
- tricky context because the migration might require IO.
- */
+static inline bool gfp_compaction_allowed(gfp_t gfp_mask) +{
return IS_ENABLED(CONFIG_COMPACTION) && (gfp_mask & __GFP_IO);
+}
extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
#ifdef CONFIG_CONTIG_ALLOC diff --git a/mm/compaction.c b/mm/compaction.c index 4add68d40e8d..b961db601df4 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2723,16 +2723,11 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum compact_priority prio, struct page **capture) {
int may_perform_io = (__force int)(gfp_mask & __GFP_IO); struct zoneref *z; struct zone *zone; enum compact_result rc = COMPACT_SKIPPED;
/*
* Check if the GFP flags allow compaction - GFP_NOIO is really
* tricky context because the migration might require IO
*/
if (!may_perform_io)
if (!gfp_compaction_allowed(gfp_mask)) return COMPACT_SKIPPED; trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 150d4f23b010..a663202045dc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4041,6 +4041,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) { bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
bool can_compact = gfp_compaction_allowed(gfp_mask); const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; struct page *page = NULL; unsigned int alloc_flags;
@@ -4111,7 +4112,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * Don't try this for allocations that are allowed to ignore * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen. */
if (can_direct_reclaim &&
if (can_direct_reclaim && can_compact && (costly_order || (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) {
@@ -4209,9 +4210,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
/* * Do not retry costly high order allocations unless they are
* __GFP_RETRY_MAYFAIL
* __GFP_RETRY_MAYFAIL and we can compact */
if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
if (costly_order && (!can_compact ||
!(gfp_mask & __GFP_RETRY_MAYFAIL))) goto nopage; if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
@@ -4224,7 +4226,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * implementation of the compaction depends on the sufficient amount * of free memory (see __compaction_suitable) */
if (did_some_progress > 0 &&
if (did_some_progress > 0 && can_compact && should_compact_retry(ac, order, alloc_flags, compact_result, &compact_priority, &compaction_retries))
diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..4255619a1a31 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5753,7 +5753,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) /* Use reclaim/compaction for costly allocs or under memory pressure */ static bool in_reclaim_compaction(struct scan_control *sc) {
if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
if (gfp_compaction_allowed(sc->gfp_mask) && sc->order && (sc->order > PAGE_ALLOC_COSTLY_ORDER || sc->priority < DEF_PRIORITY - 2)) return true;
@@ -5998,6 +5998,9 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) { unsigned long watermark;
if (!gfp_compaction_allowed(sc->gfp_mask))
return false;
/* Allocation can already succeed, nothing to do */ if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone), sc->reclaim_idx, 0))
-- 2.43.1
On 2/26/24 17:09, Sven van Ashbrook wrote:
Vlastimil,
We noticed that this patch is now added to Andrew Morton's mm-hotfixes-unstable branch.
How can we help to get this into mm-hotfixes-stable and from there, into mainline ?
A Tested-by: can't hurt, but of course you need to finish the testing first. It's encouraging you didn't hit the bug yet anymore, thought!
We are still stress-testing using low memory suspend. Anything else that is required, and we can help with?
I think that's already great enough, thanks!
Sven
On Wed, Feb 21, 2024 at 4:44 AM Vlastimil Babka vbabka@suse.cz wrote:
Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask.
Quoting Sven:
try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set.
page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order.
page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress.
page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway).
page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction
goto 2. indefinite stall.
(end quote)
The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO.
To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it.
Also use the new helper in:
- compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order
- in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily
- in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases
Reported-by: Sven van Ashbrook svenva@chromium.org Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzP... Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Cc: stable@vger.kernel.org Signed-off-by: Vlastimil Babka vbabka@suse.cz
Tested-by: Karthikeyan Ramasubramanian kramasub@chromium.org
include/linux/gfp.h | 9 +++++++++ mm/compaction.c | 7 +------ mm/page_alloc.c | 10 ++++++---- mm/vmscan.c | 5 ++++- 4 files changed, 20 insertions(+), 11 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index de292a007138..e2a916cf29c4 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -353,6 +353,15 @@ static inline bool gfp_has_io_fs(gfp_t gfp) return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS); }
+/*
- Check if the gfp flags allow compaction - GFP_NOIO is a really
- tricky context because the migration might require IO.
- */
+static inline bool gfp_compaction_allowed(gfp_t gfp_mask) +{
return IS_ENABLED(CONFIG_COMPACTION) && (gfp_mask & __GFP_IO);
+}
extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
#ifdef CONFIG_CONTIG_ALLOC diff --git a/mm/compaction.c b/mm/compaction.c index 4add68d40e8d..b961db601df4 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2723,16 +2723,11 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum compact_priority prio, struct page **capture) {
int may_perform_io = (__force int)(gfp_mask & __GFP_IO); struct zoneref *z; struct zone *zone; enum compact_result rc = COMPACT_SKIPPED;
/*
* Check if the GFP flags allow compaction - GFP_NOIO is really
* tricky context because the migration might require IO
*/
if (!may_perform_io)
if (!gfp_compaction_allowed(gfp_mask)) return COMPACT_SKIPPED; trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 150d4f23b010..a663202045dc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4041,6 +4041,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) { bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
bool can_compact = gfp_compaction_allowed(gfp_mask); const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; struct page *page = NULL; unsigned int alloc_flags;
@@ -4111,7 +4112,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * Don't try this for allocations that are allowed to ignore * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen. */
if (can_direct_reclaim &&
if (can_direct_reclaim && can_compact && (costly_order || (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) {
@@ -4209,9 +4210,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
/* * Do not retry costly high order allocations unless they are
* __GFP_RETRY_MAYFAIL
* __GFP_RETRY_MAYFAIL and we can compact */
if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
if (costly_order && (!can_compact ||
!(gfp_mask & __GFP_RETRY_MAYFAIL))) goto nopage; if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
@@ -4224,7 +4226,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * implementation of the compaction depends on the sufficient amount * of free memory (see __compaction_suitable) */
if (did_some_progress > 0 &&
if (did_some_progress > 0 && can_compact && should_compact_retry(ac, order, alloc_flags, compact_result, &compact_priority, &compaction_retries))
diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..4255619a1a31 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5753,7 +5753,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) /* Use reclaim/compaction for costly allocs or under memory pressure */ static bool in_reclaim_compaction(struct scan_control *sc) {
if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
if (gfp_compaction_allowed(sc->gfp_mask) && sc->order && (sc->order > PAGE_ALLOC_COSTLY_ORDER || sc->priority < DEF_PRIORITY - 2)) return true;
@@ -5998,6 +5998,9 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) { unsigned long watermark;
if (!gfp_compaction_allowed(sc->gfp_mask))
return false;
/* Allocation can already succeed, nothing to do */ if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone), sc->reclaim_idx, 0))
-- 2.43.1
linux-stable-mirror@lists.linaro.org