page_cache_ra_order() and page_cache_ra_unbounded() read mapping minimum folio constraints before taking the invalidate lock, allowing concurrent changes to violate page cache invariants.
Move the lookups under filemap_invalidate_lock_shared() to ensure readahead allocations respect the mapping constraints.
Fixes: 47dd67532303 ("block/bdev: lift block size restrictions to 64k") Reported-by: syzbot+4d3cc33ef7a77041efa6@syzkaller.appspotmail.com Reported-by: syzbot+fdba5cca73fee92c69d6@syzkaller.appspotmail.com Signed-off-by: Jinchao Wang wangjinchao600@gmail.com --- mm/readahead.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c index b415c9969176..74acd6c4f87c 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -214,7 +214,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, unsigned long index = readahead_index(ractl); gfp_t gfp_mask = readahead_gfp_mask(mapping); unsigned long mark = ULONG_MAX, i = 0; - unsigned int min_nrpages = mapping_min_folio_nrpages(mapping); + unsigned int min_nrpages;
/* * Partway through the readahead operation, we will have added @@ -232,6 +232,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, lookahead_size); filemap_invalidate_lock_shared(mapping); index = mapping_align_index(mapping, index); + min_nrpages = mapping_min_folio_nrpages(mapping);
/* * As iterator `i` is aligned to min_nrpages, round_up the @@ -467,7 +468,7 @@ void page_cache_ra_order(struct readahead_control *ractl, struct address_space *mapping = ractl->mapping; pgoff_t start = readahead_index(ractl); pgoff_t index = start; - unsigned int min_order = mapping_min_folio_order(mapping); + unsigned int min_order; pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; pgoff_t mark = index + ra->size - ra->async_size; unsigned int nofs; @@ -485,13 +486,16 @@ void page_cache_ra_order(struct readahead_control *ractl,
new_order = min(mapping_max_folio_order(mapping), new_order); new_order = min_t(unsigned int, new_order, ilog2(ra->size)); - new_order = max(new_order, min_order);
ra->order = new_order;
/* See comment in page_cache_ra_unbounded() */ nofs = memalloc_nofs_save(); filemap_invalidate_lock_shared(mapping); + + min_order = mapping_min_folio_order(mapping); + new_order = max(new_order, min_order); + /* * If the new_order is greater than min_order and index is * already aligned to new_order, then this will be noop as index
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH] mm/readahead: read min folio constraints under invalidate lock Link: https://lore.kernel.org/stable/20251215141936.1045907-1-wangjinchao600%40gma...
On Mon, Dec 15, 2025 at 10:19:00PM +0800, Jinchao Wang wrote:
page_cache_ra_order() and page_cache_ra_unbounded() read mapping minimum folio constraints before taking the invalidate lock, allowing concurrent changes to violate page cache invariants.
Move the lookups under filemap_invalidate_lock_shared() to ensure readahead allocations respect the mapping constraints.
Why are the mapping folio size constraints being changed? They're supposed to be set at inode instantiation and then never changed.
On Mon, Dec 15, 2025 at 02:22:23PM +0000, Matthew Wilcox wrote:
On Mon, Dec 15, 2025 at 10:19:00PM +0800, Jinchao Wang wrote:
page_cache_ra_order() and page_cache_ra_unbounded() read mapping minimum folio constraints before taking the invalidate lock, allowing concurrent changes to violate page cache invariants.
Move the lookups under filemap_invalidate_lock_shared() to ensure readahead allocations respect the mapping constraints.
Why are the mapping folio size constraints being changed? They're supposed to be set at inode instantiation and then never changed.
They can change after instantiation for block devices. In the syzbot repro: blkdev_ioctl() -> blkdev_bszset() -> set_blocksize() -> mapping_set_folio_min_order()
On Tue, Dec 16, 2025 at 09:37:51AM +0800, Jinchao Wang wrote:
On Mon, Dec 15, 2025 at 02:22:23PM +0000, Matthew Wilcox wrote:
On Mon, Dec 15, 2025 at 10:19:00PM +0800, Jinchao Wang wrote:
page_cache_ra_order() and page_cache_ra_unbounded() read mapping minimum folio constraints before taking the invalidate lock, allowing concurrent changes to violate page cache invariants.
Move the lookups under filemap_invalidate_lock_shared() to ensure readahead allocations respect the mapping constraints.
Why are the mapping folio size constraints being changed? They're supposed to be set at inode instantiation and then never changed.
They can change after instantiation for block devices. In the syzbot repro: blkdev_ioctl() -> blkdev_bszset() -> set_blocksize() -> mapping_set_folio_min_order()
Oh, this is just syzbot doing stupid things. We should probably make blkdev_bszset() fail if somebody else has an fd open.
On Tue, Dec 16, 2025 at 02:42:06AM +0000, Matthew Wilcox wrote:
On Tue, Dec 16, 2025 at 09:37:51AM +0800, Jinchao Wang wrote:
On Mon, Dec 15, 2025 at 02:22:23PM +0000, Matthew Wilcox wrote:
On Mon, Dec 15, 2025 at 10:19:00PM +0800, Jinchao Wang wrote:
page_cache_ra_order() and page_cache_ra_unbounded() read mapping minimum folio constraints before taking the invalidate lock, allowing concurrent changes to violate page cache invariants.
Move the lookups under filemap_invalidate_lock_shared() to ensure readahead allocations respect the mapping constraints.
Why are the mapping folio size constraints being changed? They're supposed to be set at inode instantiation and then never changed.
They can change after instantiation for block devices. In the syzbot repro: blkdev_ioctl() -> blkdev_bszset() -> set_blocksize() -> mapping_set_folio_min_order()
Oh, this is just syzbot doing stupid things. We should probably make blkdev_bszset() fail if somebody else has an fd open.
Thanks, that makes sense. Tightening blkdev_bszset() would avoid the race entirely. This change is meant as a defensive fix to prevent BUGs.
On Tue, Dec 16, 2025 at 11:12:21AM +0800, Jinchao Wang wrote:
On Tue, Dec 16, 2025 at 02:42:06AM +0000, Matthew Wilcox wrote:
On Tue, Dec 16, 2025 at 09:37:51AM +0800, Jinchao Wang wrote:
On Mon, Dec 15, 2025 at 02:22:23PM +0000, Matthew Wilcox wrote:
On Mon, Dec 15, 2025 at 10:19:00PM +0800, Jinchao Wang wrote:
page_cache_ra_order() and page_cache_ra_unbounded() read mapping minimum folio constraints before taking the invalidate lock, allowing concurrent changes to violate page cache invariants.
Move the lookups under filemap_invalidate_lock_shared() to ensure readahead allocations respect the mapping constraints.
Why are the mapping folio size constraints being changed? They're supposed to be set at inode instantiation and then never changed.
They can change after instantiation for block devices. In the syzbot repro: blkdev_ioctl() -> blkdev_bszset() -> set_blocksize() -> mapping_set_folio_min_order()
Oh, this is just syzbot doing stupid things. We should probably make blkdev_bszset() fail if somebody else has an fd open.
Thanks, that makes sense. Tightening blkdev_bszset() would avoid the race entirely. This change is meant as a defensive fix to prevent BUGs.
Yes, but the point is that there's a lot of code which relies on the AS_FOLIO bits not changing in the middle. Syzbot found one of them, but there are others.
Hello,
syzbot has tested the proposed patch but the reproducer is still triggering an issue: kernel BUG in __filemap_add_folio
reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1395 [inline] __free_frozen_pages+0x7df/0x1170 mm/page_alloc.c:2943 rcu_do_batch kernel/rcu/tree.c:2605 [inline] rcu_core+0x79c/0x15f0 kernel/rcu/tree.c:2857 handle_softirqs+0x219/0x950 kernel/softirq.c:622 run_ksoftirqd kernel/softirq.c:1063 [inline] run_ksoftirqd+0x3a/0x60 kernel/softirq.c:1055 smpboot_thread_fn+0x3f7/0xae0 kernel/smpboot.c:160 kthread+0x3c5/0x780 kernel/kthread.c:463 ret_from_fork+0x983/0xb10 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 ------------[ cut here ]------------ kernel BUG at mm/filemap.c:858! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 1 UID: 0 PID: 6821 Comm: syz.1.76 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 RIP: 0010:__filemap_add_folio+0xf29/0x11b0 mm/filemap.c:858 Code: 9b c6 ff 48 c7 c6 c0 e9 99 8b 4c 89 ef e8 0f 74 11 00 90 0f 0b e8 47 9b c6 ff 48 c7 c6 20 ea 99 8b 4c 89 ef e8 f8 73 11 00 90 <0f> 0b e8 30 9b c6 ff 90 0f 0b 90 e9 1c fc ff ff e8 22 9b c6 ff 48 RSP: 0018:ffffc900033af840 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff8880737fc980 RSI: ffffffff81f7ebf8 RDI: ffff8880737fce04 RBP: 0000000000112cc0 R08: 0000000000000001 R09: 0000000000000001 R10: ffffffff908689d7 R11: 0000000000000000 R12: 0000000000000002 R13: ffffea0001ce4980 R14: 0000000000000000 R15: 0000000000000000 FS: 000055557770b500(0000) GS:ffff888124a48000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9aef15c000 CR3: 000000002ee4c000 CR4: 00000000003526f0 Call Trace: <TASK> filemap_add_folio+0x19a/0x610 mm/filemap.c:966 ra_alloc_folio mm/readahead.c:453 [inline] page_cache_ra_order+0x637/0xed0 mm/readahead.c:512 do_sync_mmap_readahead mm/filemap.c:3400 [inline] filemap_fault+0x16ac/0x29d0 mm/filemap.c:3549 __do_fault+0x10d/0x490 mm/memory.c:5320 do_shared_fault mm/memory.c:5819 [inline] do_fault+0x302/0x1ad0 mm/memory.c:5893 do_pte_missing mm/memory.c:4401 [inline] handle_pte_fault mm/memory.c:6273 [inline] __handle_mm_fault+0x1919/0x2bb0 mm/memory.c:6411 handle_mm_fault+0x3fe/0xad0 mm/memory.c:6580 do_user_addr_fault+0x60c/0x1370 arch/x86/mm/fault.c:1336 handle_page_fault arch/x86/mm/fault.c:1476 [inline] exc_page_fault+0x64/0xc0 arch/x86/mm/fault.c:1532 asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618 RIP: 0033:0x7f8af1a55171 Code: 48 8b 54 24 08 48 85 d2 74 17 8b 44 24 18 0f c8 89 c0 48 89 44 24 18 48 83 fa 01 0f 85 b3 01 00 00 48 8b 44 24 10 8b 54 24 18 <89> 10 e9 15 fd ff ff 48 8b 44 24 10 8b 10 48 8b 44 24 08 48 85 c0 RSP: 002b:00007ffc7d678bf0 EFLAGS: 00010246 RAX: 0000200000000980 RBX: 0000000000000004 RCX: 0000000000000000 RDX: 0000000000004000 RSI: 0000000000000000 RDI: 000055557770b3c8 RBP: 00007ffc7d678cf8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000002 R12: 00007f8af1dd5fac R13: 00007f8af1dd5fa0 R14: fffffffffffffffe R15: 00007ffc7d678d40 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:__filemap_add_folio+0xf29/0x11b0 mm/filemap.c:858 Code: 9b c6 ff 48 c7 c6 c0 e9 99 8b 4c 89 ef e8 0f 74 11 00 90 0f 0b e8 47 9b c6 ff 48 c7 c6 20 ea 99 8b 4c 89 ef e8 f8 73 11 00 90 <0f> 0b e8 30 9b c6 ff 90 0f 0b 90 e9 1c fc ff ff e8 22 9b c6 ff 48 RSP: 0018:ffffc900033af840 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff8880737fc980 RSI: ffffffff81f7ebf8 RDI: ffff8880737fce04 RBP: 0000000000112cc0 R08: 0000000000000001 R09: 0000000000000001 R10: ffffffff908689d7 R11: 0000000000000000 R12: 0000000000000002 R13: ffffea0001ce4980 R14: 0000000000000000 R15: 0000000000000000 FS: 000055557770b500(0000) GS:ffff888124a48000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f772b5d7dac CR3: 000000002ee4c000 CR4: 00000000003526f0
Tested on:
commit: 40fbbd64 Merge tag 'pull-fixes' of git://git.kernel.or.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=10715dc2580000 kernel config: https://syzkaller.appspot.com/x/.config?x=495547a782e37c4f dashboard link: https://syzkaller.appspot.com/bug?extid=4d3cc33ef7a77041efa6 compiler: gcc (Debian 12.2.0-14+deb12u1) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
Note: no patches were applied.
On Tue, Dec 16, 2025 at 03:53:17AM +0000, Matthew Wilcox wrote:
On Tue, Dec 16, 2025 at 11:12:21AM +0800, Jinchao Wang wrote:
On Tue, Dec 16, 2025 at 02:42:06AM +0000, Matthew Wilcox wrote:
On Tue, Dec 16, 2025 at 09:37:51AM +0800, Jinchao Wang wrote:
On Mon, Dec 15, 2025 at 02:22:23PM +0000, Matthew Wilcox wrote:
On Mon, Dec 15, 2025 at 10:19:00PM +0800, Jinchao Wang wrote:
page_cache_ra_order() and page_cache_ra_unbounded() read mapping minimum folio constraints before taking the invalidate lock, allowing concurrent changes to violate page cache invariants.
Move the lookups under filemap_invalidate_lock_shared() to ensure readahead allocations respect the mapping constraints.
Why are the mapping folio size constraints being changed? They're supposed to be set at inode instantiation and then never changed.
They can change after instantiation for block devices. In the syzbot repro: blkdev_ioctl() -> blkdev_bszset() -> set_blocksize() -> mapping_set_folio_min_order()
Oh, this is just syzbot doing stupid things. We should probably make blkdev_bszset() fail if somebody else has an fd open.
Thanks, that makes sense. Tightening blkdev_bszset() would avoid the race entirely. This change is meant as a defensive fix to prevent BUGs.
Yes, but the point is that there's a lot of code which relies on the AS_FOLIO bits not changing in the middle. Syzbot found one of them, but there are others.
I've been thinking about this more, and I wanted to share another perspective if that's okay.
Rather than tracking down every place that might change AS_FOLIO bits (like blkdev_bszset() and potentially others), what if we make the page cache layer itself robust against such changes?
The invalidate_lock was introduced for exactly this kind of protection (commit 730633f0b7f9: "mm: Protect operations adding pages to page cache with invalidate_lock"). This way, the page cache doesn't need to rely on assumptions about what upper layers might do.
The readahead functions already hold filemap_invalidate_lock_shared(), so moving the constraint reads under the lock adds no overhead. It would protect against AS_FOLIO changes regardless of their source.
I think this separates concerns nicely: upper layers can change constraints through the invalidate_lock protocol, and page cache operations are automatically safe. But I'd really value your thoughts on this approach - you have much more experience with these tradeoffs than I do.
Thanks again for taking the time to discuss this.
linux-stable-mirror@lists.linaro.org