RE: [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()

28 Aug 2025

      On Thu, 2025-08-28 at 21:05 +0200, Ilya Dryomov wrote:
...
On Thu, Aug 28, 2025 at 8:55 PM Viacheslav Dubeyko
Slava.Dubeyko@ibm.com wrote:
...
On Wed, 2025-08-27 at 20:17 +0200, Max Kellermann wrote:
...
The function ceph_process_folio_batch() sets folio_batch entries to
NULL, which is an illegal state.  Before folio_batch_release() crashes
due to this API violation, the function
ceph_shift_unused_folios_left() is supposed to remove those NULLs from
the array.
However, since commit ce80b76dd327 ("ceph: introduce
ceph_process_folio_batch() method"), this shifting doesn't happen
anymore because the "for" loop got moved to
ceph_process_folio_batch(), and now the `i` variable that remains in
ceph_writepages_start() doesn't get incremented anymore, making the
shifting effectively unreachable much of the time.
Later, commit 1551ec61dc55 ("ceph: introduce ceph_submit_write()
method") added more preconditions for doing the shift, replacing the
`i` check (with something that is still just as broken):

if ceph_process_folio_batch() fails, shifting never happens

if ceph_move_dirty_page_in_page_array() was never called (because
ceph_process_folio_batch() has returned early for some of various
reasons), shifting never happens

if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
has returned early for some of the reasons mentioned above or
because ceph_move_dirty_page_in_page_array() has failed), shifting
never happens

Since those two commits, any problem in ceph_process_folio_batch()
could crash the kernel, e.g. this way:
BUG: kernel NULL pointer dereference, address: 0000000000000034
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0002 [#1] SMP NOPTI
 CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 Workqueue: writeback wb_workfn (flush-ceph-1)
 RIP: 0010:folios_put_refs+0x85/0x140
 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
 RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
 R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
 FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ceph_writepages_start+0xeb9/0x1410
The crash can be reproduced easily by changing the
ceph_check_page_before_write() return value to `-E2BIG`.
I cannot reproduce the crash/issue. If ceph_check_page_before_write() returns
`-E2BIG`, then nothing happens. There is no crush and no write operations could
be processed by file system driver anymore. So, it doesn't look like recipe to
reproduce the issue. I cannot confirm that the patch fixes the issue without
clear way to reproduce the issue.
Could you please provide more clear explanation of the issue reproduction path?
Hi Slava,
Was this bit taken into account?
(Interestingly, the crash happens only if `huge_zero_folio` has
  already been allocated; without `huge_zero_folio`,
  is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
  entries instead of dereferencing them.  That makes reproducing the bug
  somewhat unreliable.  See
  https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com  
  for a discussion of this detail.)
Hi Ilya,
And which practical step of actions do you see to repeat and reproduce it? :)
Thanks,
Slava.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

RE: [PATCH] fs/ceph/addr: always call ceph_shift_unused_folios_left()