This is based on mm-unstable and was cross-compiled heavily.
I should probably have already dropped the RFC label but I want to hear first if I ignored some corner case (SG entries?) and I need to do at least a bit more testing.
I will only CC non-MM folks on the cover letter and the respective patch to not flood too many inboxes (the lists receive all patches).
---
As discussed recently with Linus, nth_page() is just nasty and we would like to remove it.
To recap, the reason we currently need nth_page() within a folio is because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the memmap is allocated per memory section.
While buddy allocations cannot cross memory section boundaries, hugetlb and dax folios can.
So crossing a memory section means that "page++" could do the wrong thing. Instead, nth_page() on these problematic configs always goes from page->pfn, to the go from (++pfn)->page, which is rather nasty.
Likely, many people have no idea when nth_page() is required and when it might be dropped.
We refer to such problematic PFN ranges and "non-contiguous pages". If we only deal with "contiguous pages", there is not need for nth_page().
Besides that "obvious" folio case, we might end up using nth_page() within CMA allocations (again, could span memory sections), and in one corner case (kfence) when processing memblock allocations (again, could span memory sections).
So let's handle all that, add sanity checks, and remove nth_page().
Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups Patch #6 -> #12 : disallow folios to have non-contiguous pages Patch #13 -> #20 : remove nth_page() usage within folios Patch #21 : disallow CMA allocations of non-contiguous pages Patch #22 -> #31 : sanity+check + remove nth_page() usage within SG entry Patch #32 : sanity-check + remove nth_page() usage in unpin_user_page_range_dirty_lock() Patch #33 : remove nth_page() in kfence Patch #34 : adjust stale comment regarding nth_page Patch #35 : mm: remove nth_page()
A lot of this is inspired from the discussion at [1] between Linus, Jason and me, so cudos to them.
[1] https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-Gy...
Cc: Andrew Morton akpm@linux-foundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Jason Gunthorpe jgg@nvidia.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Mike Rapoport rppt@kernel.org Cc: Suren Baghdasaryan surenb@google.com Cc: Michal Hocko mhocko@suse.com Cc: Jens Axboe axboe@kernel.dk Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Cc: John Hubbard jhubbard@nvidia.com Cc: Peter Xu peterx@redhat.com Cc: Alexander Potapenko glider@google.com Cc: Marco Elver elver@google.com Cc: Dmitry Vyukov dvyukov@google.com Cc: Brendan Jackman jackmanb@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Zi Yan ziy@nvidia.com Cc: Dennis Zhou dennis@kernel.org Cc: Tejun Heo tj@kernel.org Cc: Christoph Lameter cl@gentwo.org Cc: Muchun Song muchun.song@linux.dev Cc: Oscar Salvador osalvador@suse.de Cc: x86@kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mips@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-crypto@vger.kernel.org Cc: linux-ide@vger.kernel.org Cc: intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Cc: linux-mmc@vger.kernel.org Cc: linux-arm-kernel@axis.com Cc: linux-scsi@vger.kernel.org Cc: kvm@vger.kernel.org Cc: virtualization@lists.linux.dev Cc: linux-mm@kvack.org Cc: io-uring@vger.kernel.org Cc: iommu@lists.linux.dev Cc: kasan-dev@googlegroups.com Cc: wireguard@lists.zx2c4.com Cc: netdev@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Cc: linux-riscv@lists.infradead.org
David Hildenbrand (35): mm: stop making SPARSEMEM_VMEMMAP user-selectable arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() mm/hugetlb: check for unreasonable folio sizes when registering hstate mm/mm_init: make memmap_init_compound() look more like prep_compound_page() mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() mm: sanity-check maximum folio size in folio_set_order() mm: limit folio/compound page sizes in problematic kernel configs mm: simplify folio_page() and folio_page_idx() mm/mm/percpu-km: drop nth_page() usage within single allocation fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() mm/gup: drop nth_page() usage within folio when recording subpages io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage io_uring/zcrx: remove nth_page() usage within folio mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() mm/cma: refuse handing out non-contiguous page ranges dma-remap: drop nth_page() in dma_common_contiguous_remap() scatterlist: disallow non-contigous page ranges in a single SG entry ata: libata-eh: drop nth_page() usage within SG entry drm/i915/gem: drop nth_page() usage within SG entry mspro_block: drop nth_page() usage within SG entry memstick: drop nth_page() usage within SG entry mmc: drop nth_page() usage within SG entry scsi: core: drop nth_page() usage within SG entry vfio/pci: drop nth_page() usage within SG entry crypto: remove nth_page() usage within SG entry mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() kfence: drop nth_page() usage block: update comment of "struct bio_vec" regarding nth_page() mm: remove nth_page()
arch/arm64/Kconfig | 1 - arch/mips/include/asm/cacheflush.h | 11 +++-- arch/mips/mm/cache.c | 8 ++-- arch/s390/Kconfig | 1 - arch/x86/Kconfig | 1 - crypto/ahash.c | 4 +- crypto/scompress.c | 8 ++-- drivers/ata/libata-sff.c | 6 +-- drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +- drivers/memstick/core/mspro_block.c | 3 +- drivers/memstick/host/jmb38x_ms.c | 3 +- drivers/memstick/host/tifm_ms.c | 3 +- drivers/mmc/host/tifm_sd.c | 4 +- drivers/mmc/host/usdhi6rol0.c | 4 +- drivers/scsi/scsi_lib.c | 3 +- drivers/scsi/sg.c | 3 +- drivers/vfio/pci/pds/lm.c | 3 +- drivers/vfio/pci/virtio/migrate.c | 3 +- fs/hugetlbfs/inode.c | 25 ++++------ include/crypto/scatterwalk.h | 4 +- include/linux/bvec.h | 7 +-- include/linux/mm.h | 48 +++++++++++++++---- include/linux/page-flags.h | 5 +- include/linux/scatterlist.h | 4 +- io_uring/zcrx.c | 34 ++++--------- kernel/dma/remap.c | 2 +- mm/Kconfig | 3 +- mm/cma.c | 36 +++++++++----- mm/gup.c | 13 +++-- mm/hugetlb.c | 23 ++++----- mm/internal.h | 1 + mm/kfence/core.c | 17 ++++--- mm/memremap.c | 3 ++ mm/mm_init.c | 13 ++--- mm/page_alloc.c | 5 +- mm/pagewalk.c | 2 +- mm/percpu-km.c | 2 +- mm/util.c | 33 +++++++++++++ tools/testing/scatterlist/linux/mm.h | 1 - .../selftests/wireguard/qemu/kernel.config | 1 - 40 files changed, 203 insertions(+), 150 deletions(-)
base-commit: c0e3b3f33ba7b767368de4afabaf7c1ddfdc3872
In an ideal world, we wouldn't have to deal with SPARSEMEM without SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is considered too costly and consequently not supported.
However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just like we already do for arm64, s390 and x86.
So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without SPARSEMEM_VMEMMAP.
This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone for loongarch, powerpc, riscv and sparc. All architectures only enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big downside to using the VMEMMAP (quite the contrary).
This is a preparation for not supporting
(1) folio sizes that exceed a single memory section (2) CMA allocations of non-contiguous page ranges
in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit possible impact as much as possible (e.g., gigantic hugetlb page allocations suddenly fails).
Cc: Huacai Chen chenhuacai@kernel.org Cc: WANG Xuerui kernel@xen0n.name Cc: Madhavan Srinivasan maddy@linux.ibm.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Nicholas Piggin npiggin@gmail.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexandre Ghiti alex@ghiti.fr Cc: "David S. Miller" davem@davemloft.net Cc: Andreas Larsson andreas@gaisler.com Signed-off-by: David Hildenbrand david@redhat.com --- mm/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig index 4108bcd967848..330d0e698ef96 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE bool
config SPARSEMEM_VMEMMAP - bool "Sparse Memory virtual memmap" + def_bool y depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE - default y help SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most
On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
In an ideal world, we wouldn't have to deal with SPARSEMEM without SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is considered too costly and consequently not supported.
However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just like we already do for arm64, s390 and x86.
So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without SPARSEMEM_VMEMMAP.
This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone for loongarch, powerpc, riscv and sparc. All architectures only enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big downside to using the VMEMMAP (quite the contrary).
This is a preparation for not supporting
(1) folio sizes that exceed a single memory section (2) CMA allocations of non-contiguous page ranges
in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit possible impact as much as possible (e.g., gigantic hugetlb page allocations suddenly fails).
Sounds like a good idea.
Cc: Huacai Chen chenhuacai@kernel.org Cc: WANG Xuerui kernel@xen0n.name Cc: Madhavan Srinivasan maddy@linux.ibm.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Nicholas Piggin npiggin@gmail.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexandre Ghiti alex@ghiti.fr Cc: "David S. Miller" davem@davemloft.net Cc: Andreas Larsson andreas@gaisler.com Signed-off-by: David Hildenbrand david@redhat.com
mm/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Acked-by: Zi Yan ziy@nvidia.com
Best Regards, Yan, Zi
On Thu, Aug 21, 2025 at 10:06:27PM +0200, David Hildenbrand wrote:
In an ideal world, we wouldn't have to deal with SPARSEMEM without SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is considered too costly and consequently not supported.
However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just like we already do for arm64, s390 and x86.
So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without SPARSEMEM_VMEMMAP.
This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone for loongarch, powerpc, riscv and sparc. All architectures only enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big downside to using the VMEMMAP (quite the contrary).
This is a preparation for not supporting
(1) folio sizes that exceed a single memory section (2) CMA allocations of non-contiguous page ranges
in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit possible impact as much as possible (e.g., gigantic hugetlb page allocations suddenly fails).
Cc: Huacai Chen chenhuacai@kernel.org Cc: WANG Xuerui kernel@xen0n.name Cc: Madhavan Srinivasan maddy@linux.ibm.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Nicholas Piggin npiggin@gmail.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexandre Ghiti alex@ghiti.fr Cc: "David S. Miller" davem@davemloft.net Cc: Andreas Larsson andreas@gaisler.com Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Mike Rapoport (Microsoft) rppt@kernel.org
mm/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig index 4108bcd967848..330d0e698ef96 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE bool config SPARSEMEM_VMEMMAP
- bool "Sparse Memory virtual memmap"
- def_bool y depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE
- default y help SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most
-- 2.50.1
On Thu, 21 Aug 2025 22:06:27 +0200 David Hildenbrand david@redhat.com wrote:
In an ideal world, we wouldn't have to deal with SPARSEMEM without SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is considered too costly and consequently not supported.
However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just like we already do for arm64, s390 and x86.
So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without SPARSEMEM_VMEMMAP.
This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone for loongarch, powerpc, riscv and sparc. All architectures only enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big downside to using the VMEMMAP (quite the contrary).
This is a preparation for not supporting
(1) folio sizes that exceed a single memory section (2) CMA allocations of non-contiguous page ranges
in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit possible impact as much as possible (e.g., gigantic hugetlb page allocations suddenly fails).
Cc: Huacai Chen chenhuacai@kernel.org Cc: WANG Xuerui kernel@xen0n.name Cc: Madhavan Srinivasan maddy@linux.ibm.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Nicholas Piggin npiggin@gmail.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexandre Ghiti alex@ghiti.fr Cc: "David S. Miller" davem@davemloft.net Cc: Andreas Larsson andreas@gaisler.com Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: SeongJae Park sj@kernel.org
Thanks, SJ
[...]
Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected.
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Signed-off-by: David Hildenbrand david@redhat.com --- arch/arm64/Kconfig | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e9bbfacc35a64..b1d1f2ff2493b 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz" config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_VMEMMAP_ENABLE - select SPARSEMEM_VMEMMAP
config HW_PERF_EVENTS def_bool y
On Thu, Aug 21, 2025 at 10:06:28PM +0200, David Hildenbrand wrote:
Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected.
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Signed-off-by: David Hildenbrand david@redhat.com
Reviewed-by: Mike Rapoport (Microsoft) rppt@kernel.org
arch/arm64/Kconfig | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e9bbfacc35a64..b1d1f2ff2493b 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz" config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_VMEMMAP_ENABLE
- select SPARSEMEM_VMEMMAP
config HW_PERF_EVENTS def_bool y -- 2.50.1
Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected.
Cc: Heiko Carstens hca@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Sven Schnelle svens@linux.ibm.com Signed-off-by: David Hildenbrand david@redhat.com --- arch/s390/Kconfig | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index bf680c26a33cf..145ca23c2fff6 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -710,7 +710,6 @@ menu "Memory setup" config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_VMEMMAP_ENABLE - select SPARSEMEM_VMEMMAP
config ARCH_SPARSEMEM_DEFAULT def_bool y
On Thu, Aug 21, 2025 at 10:06:29PM +0200, David Hildenbrand wrote:
Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected.
Cc: Heiko Carstens hca@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Sven Schnelle svens@linux.ibm.com Signed-off-by: David Hildenbrand david@redhat.com
Reviewed-by: Mike Rapoport (Microsoft) rppt@kernel.org
arch/s390/Kconfig | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index bf680c26a33cf..145ca23c2fff6 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -710,7 +710,6 @@ menu "Memory setup" config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_VMEMMAP_ENABLE
- select SPARSEMEM_VMEMMAP
config ARCH_SPARSEMEM_DEFAULT def_bool y -- 2.50.1
Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected.
Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: Borislav Petkov bp@alien8.de Cc: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: David Hildenbrand david@redhat.com --- arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 58d890fe2100e..e431d1c06fecd 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1552,7 +1552,6 @@ config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_STATIC if X86_32 select SPARSEMEM_VMEMMAP_ENABLE if X86_64 - select SPARSEMEM_VMEMMAP if X86_64
config ARCH_SPARSEMEM_DEFAULT def_bool X86_64 || (NUMA && X86_32)
On Thu, Aug 21, 2025 at 10:06:30PM +0200, David Hildenbrand wrote:
Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected.
Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: Borislav Petkov bp@alien8.de Cc: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: David Hildenbrand david@redhat.com
Reviewed-by: Mike Rapoport (Microsoft) rppt@kernel.org
arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 58d890fe2100e..e431d1c06fecd 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1552,7 +1552,6 @@ config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_STATIC if X86_32 select SPARSEMEM_VMEMMAP_ENABLE if X86_64
- select SPARSEMEM_VMEMMAP if X86_64
config ARCH_SPARSEMEM_DEFAULT def_bool X86_64 || (NUMA && X86_32) -- 2.50.1
It's no longer user-selectable (and the default was already "y"), so let's just drop it.
Cc: "Jason A. Donenfeld" Jason@zx2c4.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: David Hildenbrand david@redhat.com --- tools/testing/selftests/wireguard/qemu/kernel.config | 1 - 1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config index 0a5381717e9f4..1149289f4b30f 100644 --- a/tools/testing/selftests/wireguard/qemu/kernel.config +++ b/tools/testing/selftests/wireguard/qemu/kernel.config @@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y CONFIG_FUTEX=y CONFIG_SHMEM=y CONFIG_SLUB=y -CONFIG_SPARSEMEM_VMEMMAP=y CONFIG_SMP=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y
On Thu, Aug 21, 2025 at 10:06:31PM +0200, David Hildenbrand wrote:
It's no longer user-selectable (and the default was already "y"), so let's just drop it.
and it should not matter for wireguard selftest anyway
Cc: "Jason A. Donenfeld" Jason@zx2c4.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Mike Rapoport (Microsoft) rppt@kernel.org
tools/testing/selftests/wireguard/qemu/kernel.config | 1 - 1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config index 0a5381717e9f4..1149289f4b30f 100644 --- a/tools/testing/selftests/wireguard/qemu/kernel.config +++ b/tools/testing/selftests/wireguard/qemu/kernel.config @@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y CONFIG_FUTEX=y CONFIG_SHMEM=y CONFIG_SLUB=y -CONFIG_SPARSEMEM_VMEMMAP=y CONFIG_SMP=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y -- 2.50.1
Let's reject them early, which in turn makes folio_alloc_gigantic() reject them properly.
To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER and calculate MAX_FOLIO_NR_PAGES based on that.
Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/mm.h | 6 ++++-- mm/page_alloc.c | 5 ++++- 2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 00c8a54127d37..77737cbf2216a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio)
/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) +#define MAX_FOLIO_ORDER PUD_ORDER #else -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER #endif
+#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) + /* * compound_nr() returns the number of pages in this potentially compound * page. compound_nr() can be called on a tail page, and is defined to diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ca9e6b9633f79..1e6ae4c395b30 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) int alloc_contig_range_noprof(unsigned long start, unsigned long end, acr_flags_t alloc_flags, gfp_t gfp_mask) { + const unsigned int order = ilog2(end - start); unsigned long outer_start, outer_end; int ret = 0;
@@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, PB_ISOLATE_MODE_CMA_ALLOC : PB_ISOLATE_MODE_OTHER;
+ if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) + return -EINVAL; + gfp_mask = current_gfp_context(gfp_mask); if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) return -EINVAL; @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, free_contig_range(end, outer_end - end); } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { struct page *head = pfn_to_page(start); - int order = ilog2(end - start);
check_new_pages(head, order); prep_new_page(head, order, gfp_mask, 0);
On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
Let's reject them early, which in turn makes folio_alloc_gigantic() reject them properly.
To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER and calculate MAX_FOLIO_NR_PAGES based on that.
Signed-off-by: David Hildenbrand david@redhat.com
include/linux/mm.h | 6 ++++-- mm/page_alloc.c | 5 ++++- 2 files changed, 8 insertions(+), 3 deletions(-)
LGTM. Reviewed-by: Zi Yan ziy@nvidia.com
Best Regards, Yan, Zi
On Thu, 21 Aug 2025 22:06:32 +0200 David Hildenbrand david@redhat.com wrote:
Let's reject them early,
I like early failures. :)
which in turn makes folio_alloc_gigantic() reject them properly.
To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER and calculate MAX_FOLIO_NR_PAGES based on that.
Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: SeongJae Park sj@kernel.org
Thanks, SJ
[...]
Let's reject unreasonable folio sizes early, where we can still fail. We'll add sanity checks to prepare_compound_head/prepare_compound_page next.
Is there a way to configure a system such that unreasonable folio sizes would be possible? It would already be rather questionable.
If so, we'd probably want to bail out earlier, where we can avoid a WARN and just report a proper error message that indicates where something went wrong such that we messed up.
Signed-off-by: David Hildenbrand david@redhat.com --- mm/memremap.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/mm/memremap.c b/mm/memremap.c index b0ce0d8254bd8..a2d4bb88f64b6 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -275,6 +275,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) return ERR_PTR(-EINVAL); + if (WARN_ONCE(pgmap->vmemmap_shift > MAX_FOLIO_ORDER, + "requested folio size unsupported\n")) + return ERR_PTR(-EINVAL);
switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE:
On Thu, 21 Aug 2025 22:06:33 +0200 David Hildenbrand david@redhat.com wrote:
Let's reject unreasonable folio sizes early, where we can still fail. We'll add sanity checks to prepare_compound_head/prepare_compound_page next.
Is there a way to configure a system such that unreasonable folio sizes would be possible? It would already be rather questionable.
If so, we'd probably want to bail out earlier, where we can avoid a WARN and just report a proper error message that indicates where something went wrong such that we messed up.
Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: SeongJae Park sj@kernel.org
Thanks, SJ
[...]
Let's check that no hstate that corresponds to an unreasonable folio size is registered by an architecture. If we were to succeed registering, we could later try allocating an unsupported gigantic folio size.
Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have to use a BUILD_BUG_ON_INVALID() to make it compile.
No existing kernel configuration should be able to trigger this check: either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or gigantic folios will not exceed a memory section (the case on sparse).
Signed-off-by: David Hildenbrand david@redhat.com --- mm/hugetlb.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 514fab5a20ef8..d12a9d5146af4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void)
BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < __NR_HPAGEFLAGS); + BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER);
if (!hugepages_supported()) { if (hugetlb_max_hstate || default_hstate_max_huge_pages) @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) } BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); + WARN_ON(order > MAX_FOLIO_ORDER); h = &hstates[hugetlb_max_hstate++]; __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); h->order = order;
Grepping for "prep_compound_page" leaves on clueless how devdax gets its compound pages initialized.
Let's add a comment that might help finding this open-coded prep_compound_page() initialization more easily.
Further, let's be less smart about the ordering of initialization and just perform the prep_compound_head() call after all tail pages were initialized: just like prep_compound_page() does.
No need for a lengthy comment then: again, just like prep_compound_page().
Note that prep_compound_head() already does initialize stuff in page[2] through prep_compound_head() that successive tail page initialization will overwrite: _deferred_list, and on 32bit _entire_mapcount and _pincount. Very likely 32bit does not apply, and likely nobody ever ends up testing whether the _deferred_list is empty.
So it shouldn't be a fix at this point, but certainly something to clean up.
Signed-off-by: David Hildenbrand david@redhat.com --- mm/mm_init.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c index 5c21b3af216b2..708466c5b2cc9 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head, unsigned long pfn, end_pfn = head_pfn + nr_pages; unsigned int order = pgmap->vmemmap_shift;
+ /* + * This is an open-coded prep_compound_page() whereby we avoid + * walking pages twice by initializing them in the same go. + */ __SetPageHead(head); for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); @@ -1098,15 +1102,8 @@ static void __ref memmap_init_compound(struct page *head, __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); prep_compound_tail(head, pfn - head_pfn); set_page_count(page, 0); - - /* - * The first tail page stores important compound page info. - * Call prep_compound_head() after the first tail page has - * been initialized, to not have the data overwritten. - */ - if (pfn == head_pfn + 1) - prep_compound_head(head, order); } + prep_compound_head(head, order); }
void __ref memmap_init_zone_device(struct zone *zone,
On Thu, Aug 21, 2025 at 10:06:35PM +0200, David Hildenbrand wrote:
Grepping for "prep_compound_page" leaves on clueless how devdax gets its compound pages initialized.
Let's add a comment that might help finding this open-coded prep_compound_page() initialization more easily.
Further, let's be less smart about the ordering of initialization and just perform the prep_compound_head() call after all tail pages were initialized: just like prep_compound_page() does.
No need for a lengthy comment then: again, just like prep_compound_page().
Note that prep_compound_head() already does initialize stuff in page[2] through prep_compound_head() that successive tail page initialization will overwrite: _deferred_list, and on 32bit _entire_mapcount and _pincount. Very likely 32bit does not apply, and likely nobody ever ends up testing whether the _deferred_list is empty.
So it shouldn't be a fix at this point, but certainly something to clean up.
Signed-off-by: David Hildenbrand david@redhat.com
mm/mm_init.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c index 5c21b3af216b2..708466c5b2cc9 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head, unsigned long pfn, end_pfn = head_pfn + nr_pages; unsigned int order = pgmap->vmemmap_shift;
- /*
* This is an open-coded prep_compound_page() whereby we avoid
* walking pages twice by initializing them in the same go.
*/
While on it, can you also mention that prep_compound_page() is not used to properly set page zone link?
With this
Reviewed-by: Mike Rapoport (Microsoft) rppt@kernel.org
__SetPageHead(head); for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); @@ -1098,15 +1102,8 @@ static void __ref memmap_init_compound(struct page *head, __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); prep_compound_tail(head, pfn - head_pfn); set_page_count(page, 0);
/*
* The first tail page stores important compound page info.
* Call prep_compound_head() after the first tail page has
* been initialized, to not have the data overwritten.
*/
if (pfn == head_pfn + 1)
}prep_compound_head(head, order);
- prep_compound_head(head, order);
} void __ref memmap_init_zone_device(struct zone *zone, -- 2.50.1
On 22.08.25 17:27, Mike Rapoport wrote:
On Thu, Aug 21, 2025 at 10:06:35PM +0200, David Hildenbrand wrote:
Grepping for "prep_compound_page" leaves on clueless how devdax gets its compound pages initialized.
Let's add a comment that might help finding this open-coded prep_compound_page() initialization more easily.
Further, let's be less smart about the ordering of initialization and just perform the prep_compound_head() call after all tail pages were initialized: just like prep_compound_page() does.
No need for a lengthy comment then: again, just like prep_compound_page().
Note that prep_compound_head() already does initialize stuff in page[2] through prep_compound_head() that successive tail page initialization will overwrite: _deferred_list, and on 32bit _entire_mapcount and _pincount. Very likely 32bit does not apply, and likely nobody ever ends up testing whether the _deferred_list is empty.
So it shouldn't be a fix at this point, but certainly something to clean up.
Signed-off-by: David Hildenbrand david@redhat.com
mm/mm_init.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c index 5c21b3af216b2..708466c5b2cc9 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head, unsigned long pfn, end_pfn = head_pfn + nr_pages; unsigned int order = pgmap->vmemmap_shift;
- /*
* This is an open-coded prep_compound_page() whereby we avoid
* walking pages twice by initializing them in the same go.
*/
While on it, can you also mention that prep_compound_page() is not used to properly set page zone link?
Sure, thanks!
All pages were already initialized and set to PageReserved() with a refcount of 1 by MM init code.
In fact, by using __init_single_page(), we will be setting the refcount to 1 just to freeze it again immediately afterwards.
So drop the __init_single_page() and use __ClearPageReserved() instead. Adjust the comments to highlight that we are dealing with an open-coded prep_compound_page() variant.
Further, as we can now safely iterate over all pages in a folio, let's avoid the page-pfn dance and just iterate the pages directly.
Note that the current code was likely problematic, but we never ran into it: prep_compound_tail() would have been called with an offset that might exceed a memory section, and prep_compound_tail() would have simply added that offset to the page pointer -- which would not have done the right thing on sparsemem without vmemmap.
Signed-off-by: David Hildenbrand david@redhat.com --- mm/hugetlb.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d12a9d5146af4..ae82a845b14ad 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3235,17 +3235,14 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, unsigned long start_page_number, unsigned long end_page_number) { - enum zone_type zone = zone_idx(folio_zone(folio)); - int nid = folio_nid(folio); - unsigned long head_pfn = folio_pfn(folio); - unsigned long pfn, end_pfn = head_pfn + end_page_number; + struct page *head_page = folio_page(folio, 0); + struct page *page = folio_page(folio, start_page_number); + unsigned long i; int ret;
- for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { - struct page *page = pfn_to_page(pfn); - - __init_single_page(page, pfn, zone, nid); - prep_compound_tail((struct page *)folio, pfn - head_pfn); + for (i = start_page_number; i < end_page_number; i++, page++) { + __ClearPageReserved(page); + prep_compound_tail(head_page, i); ret = page_ref_freeze(page, 1); VM_BUG_ON(!ret); } @@ -3257,12 +3254,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, { int ret;
- /* Prepare folio head */ + /* + * This is an open-coded prep_compound_page() whereby we avoid + * walking pages twice by preparing+freezing them in the same go. + */ __folio_clear_reserved(folio); __folio_set_head(folio); ret = folio_ref_freeze(folio, 1); VM_BUG_ON(!ret); - /* Initialize the necessary tail struct pages */ hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); prep_compound_head((struct page *)folio, huge_page_order(h)); }
On 8/21/25 23:06, David Hildenbrand wrote:
All pages were already initialized and set to PageReserved() with a refcount of 1 by MM init code.
Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to initialize struct pages?
In fact, by using __init_single_page(), we will be setting the refcount to 1 just to freeze it again immediately afterwards.
So drop the __init_single_page() and use __ClearPageReserved() instead. Adjust the comments to highlight that we are dealing with an open-coded prep_compound_page() variant.
Further, as we can now safely iterate over all pages in a folio, let's avoid the page-pfn dance and just iterate the pages directly.
Note that the current code was likely problematic, but we never ran into it: prep_compound_tail() would have been called with an offset that might exceed a memory section, and prep_compound_tail() would have simply added that offset to the page pointer -- which would not have done the right thing on sparsemem without vmemmap.
Signed-off-by: David Hildenbrand david@redhat.com
mm/hugetlb.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d12a9d5146af4..ae82a845b14ad 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3235,17 +3235,14 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, unsigned long start_page_number, unsigned long end_page_number) {
- enum zone_type zone = zone_idx(folio_zone(folio));
- int nid = folio_nid(folio);
- unsigned long head_pfn = folio_pfn(folio);
- unsigned long pfn, end_pfn = head_pfn + end_page_number;
- struct page *head_page = folio_page(folio, 0);
- struct page *page = folio_page(folio, start_page_number);
- unsigned long i; int ret;
- for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
struct page *page = pfn_to_page(pfn);
__init_single_page(page, pfn, zone, nid);
prep_compound_tail((struct page *)folio, pfn - head_pfn);
- for (i = start_page_number; i < end_page_number; i++, page++) {
__ClearPageReserved(page);
ret = page_ref_freeze(page, 1); VM_BUG_ON(!ret); }prep_compound_tail(head_page, i);
@@ -3257,12 +3254,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, { int ret;
- /* Prepare folio head */
- /*
* This is an open-coded prep_compound_page() whereby we avoid
* walking pages twice by preparing+freezing them in the same go.
__folio_clear_reserved(folio); __folio_set_head(folio); ret = folio_ref_freeze(folio, 1); VM_BUG_ON(!ret);*/
- /* Initialize the necessary tail struct pages */ hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); prep_compound_head((struct page *)folio, huge_page_order(h));
}
--Mika
On 22.08.25 06:09, Mika Penttilä wrote:
On 8/21/25 23:06, David Hildenbrand wrote:
All pages were already initialized and set to PageReserved() with a refcount of 1 by MM init code.
Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to initialize struct pages?
Excellent point, I did not know about that one.
Spotting that we don't do the same for the head page made me assume that it's just a misuse of __init_single_page().
But the nasty thing is that we use memblock_reserved_mark_noinit() to only mark the tail pages ...
Let me revert back to __init_single_page() and add a big fat comment why this is required.
Thanks!
On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
On 22.08.25 06:09, Mika Penttilä wrote:
On 8/21/25 23:06, David Hildenbrand wrote:
All pages were already initialized and set to PageReserved() with a refcount of 1 by MM init code.
Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to initialize struct pages?
Excellent point, I did not know about that one.
Spotting that we don't do the same for the head page made me assume that it's just a misuse of __init_single_page().
But the nasty thing is that we use memblock_reserved_mark_noinit() to only mark the tail pages ...
And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled struct pages are initialized regardless of memblock_reserved_mark_noinit().
I think this patch should go in before your updates:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 753f99b4c718..1c51788339a5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3230,6 +3230,22 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) return 1; }
+/* + * Tail pages in a huge folio allocated from memblock are marked as 'noinit', + * which means that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled their + * struct page won't be initialized + */ +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +static void __init hugetlb_init_tail_page(struct page *page, unsigned long pfn, + enum zone_type zone, int nid) +{ + __init_single_page(page, pfn, zone, nid); +} +#else +static inline void hugetlb_init_tail_page(struct page *page, unsigned long pfn, + enum zone_type zone, int nid) {} +#endif + /* Initialize [start_page:end_page_number] tail struct pages of a hugepage */ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, unsigned long start_page_number, @@ -3244,7 +3260,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn);
- __init_single_page(page, pfn, zone, nid); + hugetlb_init_tail_page(page, pfn, zone, nid); prep_compound_tail((struct page *)folio, pfn - head_pfn); ret = page_ref_freeze(page, 1); VM_BUG_ON(!ret);
Let me revert back to __init_single_page() and add a big fat comment why this is required.
Thanks!
Let's sanity-check in folio_set_order() whether we would be trying to create a folio with an order that would make it exceed MAX_FOLIO_ORDER.
This will enable the check whenever a folio/compound page is initialized through prepare_compound_head() / prepare_compound_page().
Signed-off-by: David Hildenbrand david@redhat.com --- mm/internal.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc030..946ce97036d67 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) { if (WARN_ON_ONCE(!order || !folio_test_large(folio))) return; + VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER);
folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; #ifdef NR_PAGES_IN_LARGE_FOLIO
On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
Let's sanity-check in folio_set_order() whether we would be trying to create a folio with an order that would make it exceed MAX_FOLIO_ORDER.
This will enable the check whenever a folio/compound page is initialized through prepare_compound_head() / prepare_compound_page().
Signed-off-by: David Hildenbrand david@redhat.com
mm/internal.h | 1 + 1 file changed, 1 insertion(+)
Reviewed-by: Zi Yan ziy@nvidia.com
Best Regards, Yan, Zi
Let's limit the maximum folio size in problematic kernel config where the memmap is allocated per memory section (SPARSEMEM without SPARSEMEM_VMEMMAP) to a single memory section.
Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE but not SPARSEMEM_VMEMMAP: sh.
Fortunately, the biggest hugetlb size sh supports is 64 MiB (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB (SECTION_SIZE_BITS == 26), so their use case is not degraded.
As folios and memory sections are naturally aligned to their order-2 size in memory, consequently a single folio can no longer span multiple memory sections on these problematic kernel configs.
nth_page() is no longer required when operating within a single compound page / folio.
Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/mm.h | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 77737cbf2216a..48a985e17ef4e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) return folio_large_nr_pages(folio); }
-/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_ORDER PUD_ORDER -#else +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) +/* + * We don't expect any folios that exceed buddy sizes (and consequently + * memory sections). + */ #define MAX_FOLIO_ORDER MAX_PAGE_ORDER +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/* + * Only pages within a single memory section are guaranteed to be + * contiguous. By limiting folios to a single memory section, all folio + * pages are guaranteed to be contiguous. + */ +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT +#else +/* + * There is no real limit on the folio size. We limit them to the maximum we + * currently expect. + */ +#define MAX_FOLIO_ORDER PUD_ORDER #endif
#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
Let's limit the maximum folio size in problematic kernel config where the memmap is allocated per memory section (SPARSEMEM without SPARSEMEM_VMEMMAP) to a single memory section.
Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE but not SPARSEMEM_VMEMMAP: sh.
Fortunately, the biggest hugetlb size sh supports is 64 MiB (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB (SECTION_SIZE_BITS == 26), so their use case is not degraded.
As folios and memory sections are naturally aligned to their order-2 size in memory, consequently a single folio can no longer span multiple memory sections on these problematic kernel configs.
nth_page() is no longer required when operating within a single compound page / folio.
Signed-off-by: David Hildenbrand david@redhat.com
include/linux/mm.h | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 77737cbf2216a..48a985e17ef4e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) return folio_large_nr_pages(folio); }
-/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_ORDER PUD_ORDER -#else +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) +/*
- We don't expect any folios that exceed buddy sizes (and consequently
- memory sections).
- */
#define MAX_FOLIO_ORDER MAX_PAGE_ORDER +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/*
- Only pages within a single memory section are guaranteed to be
- contiguous. By limiting folios to a single memory section, all folio
- pages are guaranteed to be contiguous.
- */
+#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT +#else +/*
- There is no real limit on the folio size. We limit them to the maximum we
- currently expect.
The comment about hugetlbfs is helpful here, since the other folios are still limited by buddy allocator’s MAX_ORDER.
- */
+#define MAX_FOLIO_ORDER PUD_ORDER #endif
#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
2.50.1
Otherwise, Reviewed-by: Zi Yan ziy@nvidia.com
Best Regards, Yan, Zi
On 21.08.25 22:46, Zi Yan wrote:
On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
Let's limit the maximum folio size in problematic kernel config where the memmap is allocated per memory section (SPARSEMEM without SPARSEMEM_VMEMMAP) to a single memory section.
Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE but not SPARSEMEM_VMEMMAP: sh.
Fortunately, the biggest hugetlb size sh supports is 64 MiB (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB (SECTION_SIZE_BITS == 26), so their use case is not degraded.
As folios and memory sections are naturally aligned to their order-2 size in memory, consequently a single folio can no longer span multiple memory sections on these problematic kernel configs.
nth_page() is no longer required when operating within a single compound page / folio.
Signed-off-by: David Hildenbrand david@redhat.com
include/linux/mm.h | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 77737cbf2216a..48a985e17ef4e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) return folio_large_nr_pages(folio); }
-/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_ORDER PUD_ORDER -#else +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) +/*
- We don't expect any folios that exceed buddy sizes (and consequently
- memory sections).
- */ #define MAX_FOLIO_ORDER MAX_PAGE_ORDER
+#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/*
- Only pages within a single memory section are guaranteed to be
- contiguous. By limiting folios to a single memory section, all folio
- pages are guaranteed to be contiguous.
- */
+#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT +#else +/*
- There is no real limit on the folio size. We limit them to the maximum we
- currently expect.
The comment about hugetlbfs is helpful here, since the other folios are still limited by buddy allocator’s MAX_ORDER.
Yeah, but the old comment was wrong (there is DAX).
I can add here "currently expect (e.g., hugetlfs, dax)."
On 21 Aug 2025, at 16:49, David Hildenbrand wrote:
On 21.08.25 22:46, Zi Yan wrote:
On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
Let's limit the maximum folio size in problematic kernel config where the memmap is allocated per memory section (SPARSEMEM without SPARSEMEM_VMEMMAP) to a single memory section.
Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE but not SPARSEMEM_VMEMMAP: sh.
Fortunately, the biggest hugetlb size sh supports is 64 MiB (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB (SECTION_SIZE_BITS == 26), so their use case is not degraded.
As folios and memory sections are naturally aligned to their order-2 size in memory, consequently a single folio can no longer span multiple memory sections on these problematic kernel configs.
nth_page() is no longer required when operating within a single compound page / folio.
Signed-off-by: David Hildenbrand david@redhat.com
include/linux/mm.h | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 77737cbf2216a..48a985e17ef4e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) return folio_large_nr_pages(folio); }
-/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_ORDER PUD_ORDER -#else +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) +/*
- We don't expect any folios that exceed buddy sizes (and consequently
- memory sections).
- */ #define MAX_FOLIO_ORDER MAX_PAGE_ORDER
+#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/*
- Only pages within a single memory section are guaranteed to be
- contiguous. By limiting folios to a single memory section, all folio
- pages are guaranteed to be contiguous.
- */
+#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT +#else +/*
- There is no real limit on the folio size. We limit them to the maximum we
- currently expect.
The comment about hugetlbfs is helpful here, since the other folios are still limited by buddy allocator’s MAX_ORDER.
Yeah, but the old comment was wrong (there is DAX).
I can add here "currently expect (e.g., hugetlfs, dax)."
Sounds good.
Best Regards, Yan, Zi
Now that a single folio/compound page can no longer span memory sections in problematic kernel configurations, we can stop using nth_page().
While at it, turn both macros into static inline functions and add kernel doc for folio_page_idx().
Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/mm.h | 16 ++++++++++++++-- include/linux/page-flags.h | 5 ++++- 2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 48a985e17ef4e..ef360b72cb05c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) #else #define nth_page(page,n) ((page) + (n)) -#define folio_page_idx(folio, p) ((p) - &(folio)->page) #endif
/* to align the pointer to the (next) page boundary */ @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
+/** + * folio_page_idx - Return the number of a page in a folio. + * @folio: The folio. + * @page: The folio page. + * + * This function expects that the page is actually part of the folio. + * The returned number is relative to the start of the folio. + */ +static inline unsigned long folio_page_idx(const struct folio *folio, + const struct page *page) +{ + return page - &folio->page; +} + static inline struct folio *lru_to_folio(struct list_head *head) { return list_entry((head)->prev, struct folio, lru); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d53a86e68c89b..080ad10c0defc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page) * check that the page number lies within @folio; the caller is presumed * to have a reference to the page. */ -#define folio_page(folio, n) nth_page(&(folio)->page, n) +static inline struct page *folio_page(struct folio *folio, unsigned long nr) +{ + return &folio->page + nr; +}
static __always_inline int PageTail(const struct page *page) {
On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
Now that a single folio/compound page can no longer span memory sections in problematic kernel configurations, we can stop using nth_page().
While at it, turn both macros into static inline functions and add kernel doc for folio_page_idx().
Signed-off-by: David Hildenbrand david@redhat.com
include/linux/mm.h | 16 ++++++++++++++-- include/linux/page-flags.h | 5 ++++- 2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 48a985e17ef4e..ef360b72cb05c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) #else #define nth_page(page,n) ((page) + (n)) -#define folio_page_idx(folio, p) ((p) - &(folio)->page) #endif
/* to align the pointer to the (next) page boundary */ @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
+/**
- folio_page_idx - Return the number of a page in a folio.
- @folio: The folio.
- @page: The folio page.
- This function expects that the page is actually part of the folio.
- The returned number is relative to the start of the folio.
- */
+static inline unsigned long folio_page_idx(const struct folio *folio,
const struct page *page)
+{
- return page - &folio->page;
+}
static inline struct folio *lru_to_folio(struct list_head *head) { return list_entry((head)->prev, struct folio, lru); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d53a86e68c89b..080ad10c0defc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page)
- check that the page number lies within @folio; the caller is presumed
- to have a reference to the page.
*/ -#define folio_page(folio, n) nth_page(&(folio)->page, n) +static inline struct page *folio_page(struct folio *folio, unsigned long nr) +{
- return &folio->page + nr;
+}
Maybe s/nr/n/ or s/nr/nth/, since it returns the nth page within a folio.
Since you have added kernel doc for folio_page_idx(), it does not hurt to have something similar for folio_page(). :)
+/** + * folio_page - Return the nth page in a folio. + * @folio: The folio. + * @n: Page index within the folio. + * + * This function expects that n does not exceed folio_nr_pages(folio). + * The returned page is relative to the first page of the folio. + */
static __always_inline int PageTail(const struct page *page) { -- 2.50.1
Otherwise, Reviewed-by: Zi Yan ziy@nvidia.com
Best Regards, Yan, Zi
On 21.08.25 22:55, Zi Yan wrote:
On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
Now that a single folio/compound page can no longer span memory sections in problematic kernel configurations, we can stop using nth_page().
While at it, turn both macros into static inline functions and add kernel doc for folio_page_idx().
Signed-off-by: David Hildenbrand david@redhat.com
include/linux/mm.h | 16 ++++++++++++++-- include/linux/page-flags.h | 5 ++++- 2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 48a985e17ef4e..ef360b72cb05c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) #else #define nth_page(page,n) ((page) + (n)) -#define folio_page_idx(folio, p) ((p) - &(folio)->page) #endif
/* to align the pointer to the (next) page boundary */ @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
+/**
- folio_page_idx - Return the number of a page in a folio.
- @folio: The folio.
- @page: The folio page.
- This function expects that the page is actually part of the folio.
- The returned number is relative to the start of the folio.
- */
+static inline unsigned long folio_page_idx(const struct folio *folio,
const struct page *page)
+{
- return page - &folio->page;
+}
- static inline struct folio *lru_to_folio(struct list_head *head) { return list_entry((head)->prev, struct folio, lru);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d53a86e68c89b..080ad10c0defc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page)
- check that the page number lies within @folio; the caller is presumed
- to have a reference to the page.
*/ -#define folio_page(folio, n) nth_page(&(folio)->page, n) +static inline struct page *folio_page(struct folio *folio, unsigned long nr) +{
- return &folio->page + nr;
+}
Maybe s/nr/n/ or s/nr/nth/, since it returns the nth page within a folio.
Yeah, it's even called "n" in the kernel docs ...
Since you have added kernel doc for folio_page_idx(), it does not hurt to have something similar for folio_page(). :)
... which we already have! (see above the macro) :)
Thanks!
We're allocating a higher-order page from the buddy. For these pages (that are guaranteed to not exceed a single memory section) there is no need to use nth_page().
Signed-off-by: David Hildenbrand david@redhat.com --- mm/percpu-km.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/percpu-km.c b/mm/percpu-km.c index fe31aa19db81a..4efa74a495cb6 100644 --- a/mm/percpu-km.c +++ b/mm/percpu-km.c @@ -69,7 +69,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) }
for (i = 0; i < nr_pages; i++) - pcpu_set_page_chunk(nth_page(pages, i), chunk); + pcpu_set_page_chunk(pages + i, chunk);
chunk->data = pages; chunk->base_addr = page_address(pages);
The nth_page() is not really required anymore, so let's remove it. While at it, cleanup and simplify the code a bit.
Signed-off-by: David Hildenbrand david@redhat.com --- fs/hugetlbfs/inode.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 34d496a2b7de6..dc981509a7717 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -198,31 +198,22 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, static size_t adjust_range_hwpoison(struct folio *folio, size_t offset, size_t bytes) { - struct page *page; - size_t n = 0; - size_t res = 0; + struct page *page = folio_page(folio, offset / PAGE_SIZE); + size_t n, safe_bytes;
- /* First page to start the loop. */ - page = folio_page(folio, offset / PAGE_SIZE); offset %= PAGE_SIZE; - while (1) { + for (safe_bytes = 0; safe_bytes < bytes; safe_bytes += n) { + if (is_raw_hwpoison_page_in_hugepage(page)) break;
/* Safe to read n bytes without touching HWPOISON subpage. */ - n = min(bytes, (size_t)PAGE_SIZE - offset); - res += n; - bytes -= n; - if (!bytes || !n) - break; - offset += n; - if (offset == PAGE_SIZE) { - page = nth_page(page, 1); - offset = 0; - } + n = min(bytes - safe_bytes, (size_t)PAGE_SIZE - offset); + offset = 0; + page++; }
- return res; + return safe_bytes; }
/*
It's no longer required to use nth_page() within a folio, so let's just drop the nth_page() in folio_walk_start().
Signed-off-by: David Hildenbrand david@redhat.com --- mm/pagewalk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c index c6753d370ff4e..9e4225e5fcf5c 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -1004,7 +1004,7 @@ struct folio *folio_walk_start(struct folio_walk *fw, found: if (expose_page) /* Note: Offset from the mapped page, not the folio start. */ - fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT); + fw->page = page + ((addr & (entry_size - 1)) >> PAGE_SHIFT); else fw->page = NULL; fw->ptl = ptl;
nth_page() is no longer required when iterating over pages within a single folio, so let's just drop it when recording subpages.
Signed-off-by: David Hildenbrand david@redhat.com --- mm/gup.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index b2a78f0291273..f017ff6d7d61a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -491,9 +491,9 @@ static int record_subpages(struct page *page, unsigned long sz, struct page *start_page; int nr;
- start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT); + start_page = page + ((addr & (sz - 1)) >> PAGE_SHIFT); for (nr = 0; addr != end; nr++, addr += PAGE_SIZE) - pages[nr] = nth_page(start_page, nr); + pages[nr] = start_page + nr;
return nr; } @@ -1512,7 +1512,7 @@ static long __get_user_pages(struct mm_struct *mm, }
for (j = 0; j < page_increm; j++) { - subpage = nth_page(page, j); + subpage = page + j; pages[i + j] = subpage; flush_anon_page(vma, subpage, start + j * PAGE_SIZE); flush_dcache_page(subpage);
We always provide a single dst page, it's unclear why the io_copy_cache complexity is required.
So let's simplify and get rid of "struct io_copy_cache", simply working on the single page.
... which immediately allows us for dropping one "nth_page" usage, because it's really just a single page.
Cc: Jens Axboe axboe@kernel.dk Signed-off-by: David Hildenbrand david@redhat.com --- io_uring/zcrx.c | 32 +++++++------------------------- 1 file changed, 7 insertions(+), 25 deletions(-)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index e5ff49f3425e0..f29b2a4867516 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -954,29 +954,18 @@ static struct net_iov *io_zcrx_alloc_fallback(struct io_zcrx_area *area) return niov; }
-struct io_copy_cache { - struct page *page; - unsigned long offset; - size_t size; -}; - -static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, +static ssize_t io_copy_page(struct page *dst_page, struct page *src_page, unsigned int src_offset, size_t len) { - size_t copied = 0; + size_t dst_offset = 0;
- len = min(len, cc->size); + len = min(len, PAGE_SIZE);
while (len) { void *src_addr, *dst_addr; - struct page *dst_page = cc->page; - unsigned dst_offset = cc->offset; size_t n = len;
- if (folio_test_partial_kmap(page_folio(dst_page)) || - folio_test_partial_kmap(page_folio(src_page))) { - dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE); - dst_offset = offset_in_page(dst_offset); + if (folio_test_partial_kmap(page_folio(src_page))) { src_page = nth_page(src_page, src_offset / PAGE_SIZE); src_offset = offset_in_page(src_offset); n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset); @@ -991,12 +980,10 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, kunmap_local(src_addr); kunmap_local(dst_addr);
- cc->size -= n; - cc->offset += n; + dst_offset += n; len -= n; - copied += n; } - return copied; + return dst_offset; }
static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq, @@ -1011,7 +998,6 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq, return -EFAULT;
while (len) { - struct io_copy_cache cc; struct net_iov *niov; size_t n;
@@ -1021,11 +1007,7 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq, break; }
- cc.page = io_zcrx_iov_page(niov); - cc.offset = 0; - cc.size = PAGE_SIZE; - - n = io_copy_page(&cc, src_page, src_offset, len); + n = io_copy_page(io_zcrx_iov_page(niov), src_page, src_offset, len);
if (!io_zcrx_queue_cqe(req, niov, ifq, 0, n)) { io_zcrx_return_niov(niov);
On 8/21/25 21:06, David Hildenbrand wrote:
We always provide a single dst page, it's unclear why the io_copy_cache complexity is required.
Because it'll need to be pulled outside the loop to reuse the page for multiple copies, i.e. packing multiple fragments of the same skb into it. Not finished, and currently it's wasting memory.
Why not do as below? Pages there never cross boundaries of their folios.
Do you want it to be taken into the io_uring tree?
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index e5ff49f3425e..18c12f4b56b6 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
if (folio_test_partial_kmap(page_folio(dst_page)) || folio_test_partial_kmap(page_folio(src_page))) { - dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE); + dst_page += dst_offset / PAGE_SIZE; dst_offset = offset_in_page(dst_offset); - src_page = nth_page(src_page, src_offset / PAGE_SIZE); + src_page += src_offset / PAGE_SIZE; src_offset = offset_in_page(src_offset); n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset); n = min(n, len);
On 22.08.25 13:32, Pavel Begunkov wrote:
On 8/21/25 21:06, David Hildenbrand wrote:
We always provide a single dst page, it's unclear why the io_copy_cache complexity is required.
Because it'll need to be pulled outside the loop to reuse the page for multiple copies, i.e. packing multiple fragments of the same skb into it. Not finished, and currently it's wasting memory.
Okay, so what you're saying is that there will be follow-up work that will actually make this structure useful.
Why not do as below? Pages there never cross boundaries of their folios. > Do you want it to be taken into the io_uring tree?
This should better all go through the MM tree where we actually guarantee contiguous pages within a folio. (see the cover letter)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index e5ff49f3425e..18c12f4b56b6 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, if (folio_test_partial_kmap(page_folio(dst_page)) || folio_test_partial_kmap(page_folio(src_page))) {
dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE);
dst_page += dst_offset / PAGE_SIZE; dst_offset = offset_in_page(dst_offset);
src_page = nth_page(src_page, src_offset / PAGE_SIZE);
src_page += src_offset / PAGE_SIZE;
Yeah, I can do that in the next version given that you have plans on extending that code soon.
Within a folio/compound page, nth_page() is no longer required. Given that we call folio_test_partial_kmap()+kmap_local_page(), the code would already be problematic if the src_pages would span multiple folios.
So let's just assume that all src pages belong to a single folio/compound page and can be iterated ordinarily.
Cc: Jens Axboe axboe@kernel.dk Signed-off-by: David Hildenbrand david@redhat.com --- io_uring/zcrx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index f29b2a4867516..107b2a1b31c1c 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -966,7 +966,7 @@ static ssize_t io_copy_page(struct page *dst_page, struct page *src_page, size_t n = len;
if (folio_test_partial_kmap(page_folio(src_page))) { - src_page = nth_page(src_page, src_offset / PAGE_SIZE); + src_page += src_offset / PAGE_SIZE; src_offset = offset_in_page(src_offset); n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset); n = min(n, len);
Let's make it clearer that we are operating within a single folio by providing both the folio and the page.
This implies that for flush_dcache_folio() we'll now avoid one more page->folio lookup, and that we can safely drop the "nth_page" usage.
Cc: Thomas Bogendoerfer tsbogend@alpha.franken.de Signed-off-by: David Hildenbrand david@redhat.com --- arch/mips/include/asm/cacheflush.h | 11 +++++++---- arch/mips/mm/cache.c | 8 ++++---- 2 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h index 1f14132b3fc98..8a2de28936e07 100644 --- a/arch/mips/include/asm/cacheflush.h +++ b/arch/mips/include/asm/cacheflush.h @@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm); extern void (*flush_cache_range)(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); -extern void __flush_dcache_pages(struct page *page, unsigned int nr); +extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr);
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 static inline void flush_dcache_folio(struct folio *folio) { if (cpu_has_dc_aliases) - __flush_dcache_pages(&folio->page, folio_nr_pages(folio)); + __flush_dcache_folio_pages(folio, folio_page(folio, 0), + folio_nr_pages(folio)); else if (!cpu_has_ic_fills_f_dc) folio_set_dcache_dirty(folio); } @@ -64,10 +65,12 @@ static inline void flush_dcache_folio(struct folio *folio)
static inline void flush_dcache_page(struct page *page) { + struct folio *folio = page_folio(page); + if (cpu_has_dc_aliases) - __flush_dcache_pages(page, 1); + __flush_dcache_folio_pages(folio, page, folio_nr_pages(folio)); else if (!cpu_has_ic_fills_f_dc) - folio_set_dcache_dirty(page_folio(page)); + folio_set_dcache_dirty(folio); }
#define flush_dcache_mmap_lock(mapping) do { } while (0) diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c index bf9a37c60e9f0..e3b4224c9a406 100644 --- a/arch/mips/mm/cache.c +++ b/arch/mips/mm/cache.c @@ -99,9 +99,9 @@ SYSCALL_DEFINE3(cacheflush, unsigned long, addr, unsigned long, bytes, return 0; }
-void __flush_dcache_pages(struct page *page, unsigned int nr) +void __flush_dcache_folio_pages(struct folio *folio, struct page *page, + unsigned int nr) { - struct folio *folio = page_folio(page); struct address_space *mapping = folio_flush_mapping(folio); unsigned long addr; unsigned int i; @@ -117,12 +117,12 @@ void __flush_dcache_pages(struct page *page, unsigned int nr) * get faulted into the tlb (and thus flushed) anyways. */ for (i = 0; i < nr; i++) { - addr = (unsigned long)kmap_local_page(nth_page(page, i)); + addr = (unsigned long)kmap_local_page(page + i); flush_data_cache_page(addr); kunmap_local((void *)addr); } } -EXPORT_SYMBOL(__flush_dcache_pages); +EXPORT_SYMBOL(__flush_dcache_folio_pages);
void __flush_anon_page(struct page *page, unsigned long vmaddr) {
Let's disallow handing out PFN ranges with non-contiguous pages, so we can remove the nth-page usage in __cma_alloc(), and so any callers don't have to worry about that either when wanting to blindly iterate pages.
This is really only a problem in configs with SPARSEMEM but without SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some cases.
Will this cause harm? Probably not, because it's mostly 32bit that does not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could look into allocating the memmap for the memory sections spanned by a single CMA region in one go from memblock.
Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/mm.h | 6 ++++++ mm/cma.c | 36 +++++++++++++++++++++++------------- mm/util.c | 33 +++++++++++++++++++++++++++++++++ 3 files changed, 62 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index ef360b72cb05c..f59ad1f9fc792 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes;
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +bool page_range_contiguous(const struct page *page, unsigned long nr_pages); #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) #else #define nth_page(page,n) ((page) + (n)) +static inline bool page_range_contiguous(const struct page *page, + unsigned long nr_pages) +{ + return true; +} #endif
/* to align the pointer to the (next) page boundary */ diff --git a/mm/cma.c b/mm/cma.c index 2ffa4befb99ab..1119fa2830008 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, unsigned long count, unsigned int align, struct page **pagep, gfp_t gfp) { - unsigned long mask, offset; - unsigned long pfn = -1; - unsigned long start = 0; unsigned long bitmap_maxno, bitmap_no, bitmap_count; + unsigned long start, pfn, mask, offset; int ret = -EBUSY; struct page *page = NULL;
@@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, if (bitmap_count > bitmap_maxno) goto out;
- for (;;) { + for (start = 0; ; start = bitmap_no + mask + 1) { spin_lock_irq(&cma->lock); /* * If the request is larger than the available number @@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, spin_unlock_irq(&cma->lock); break; } + + pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); + page = pfn_to_page(pfn); + + /* + * Do not hand out page ranges that are not contiguous, so + * callers can just iterate the pages without having to worry + * about these corner cases. + */ + if (!page_range_contiguous(page, count)) { + spin_unlock_irq(&cma->lock); + pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]", + __func__, cma->name, pfn, pfn + count - 1); + continue; + } + bitmap_set(cmr->bitmap, bitmap_no, bitmap_count); cma->available_count -= count; /* @@ -821,29 +835,25 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, */ spin_unlock_irq(&cma->lock);
- pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); mutex_lock(&cma->alloc_mutex); ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp); mutex_unlock(&cma->alloc_mutex); - if (ret == 0) { - page = pfn_to_page(pfn); + if (!ret) break; - }
cma_clear_bitmap(cma, cmr, pfn, count); if (ret != -EBUSY) break;
pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n", - __func__, pfn, pfn_to_page(pfn)); + __func__, pfn, page);
trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn), count, align); - /* try again with a bit different memory target */ - start = bitmap_no + mask + 1; } out: - *pagep = page; + if (!ret) + *pagep = page; return ret; }
@@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, */ if (page) { for (i = 0; i < count; i++) - page_kasan_tag_reset(nth_page(page, i)); + page_kasan_tag_reset(page + i); }
if (ret && !(gfp & __GFP_NOWARN)) { diff --git a/mm/util.c b/mm/util.c index d235b74f7aff7..0bf349b19b652 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, { return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0); } + +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/** + * page_range_contiguous - test whether the page range is contiguous + * @page: the start of the page range. + * @nr_pages: the number of pages in the range. + * + * Test whether the page range is contiguous, such that they can be iterated + * naively, corresponding to iterating a contiguous PFN range. + * + * This function should primarily only be used for debug checks, or when + * working with page ranges that are not naturally contiguous (e.g., pages + * within a folio are). + * + * Returns true if contiguous, otherwise false. + */ +bool page_range_contiguous(const struct page *page, unsigned long nr_pages) +{ + const unsigned long start_pfn = page_to_pfn(page); + const unsigned long end_pfn = start_pfn + nr_pages; + unsigned long pfn; + + /* + * The memmap is allocated per memory section. We need to check + * each involved memory section once. + */ + for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); + pfn < end_pfn; pfn += PAGES_PER_SECTION) + if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn))) + return false; + return true; +} +#endif #endif /* CONFIG_MMU */
dma_common_contiguous_remap() is used to remap an "allocated contiguous region". Within a single allocation, there is no need to use nth_page() anymore.
Neither the buddy, nor hugetlb, nor CMA will hand out problematic page ranges.
Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: David Hildenbrand david@redhat.com --- kernel/dma/remap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index 9e2afad1c6152..b7c1c0c92d0c8 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size, if (!pages) return NULL; for (i = 0; i < count; i++) - pages[i] = nth_page(page, i); + pages[i] = page++; vaddr = vmap(pages, count, VM_DMA_COHERENT, prot); kvfree(pages);
On 21.08.2025 22:06, David Hildenbrand wrote:
dma_common_contiguous_remap() is used to remap an "allocated contiguous region". Within a single allocation, there is no need to use nth_page() anymore.
Neither the buddy, nor hugetlb, nor CMA will hand out problematic page ranges.
Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Marek Szyprowski m.szyprowski@samsung.com
kernel/dma/remap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index 9e2afad1c6152..b7c1c0c92d0c8 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size, if (!pages) return NULL; for (i = 0; i < count; i++)
pages[i] = nth_page(page, i);
vaddr = vmap(pages, count, VM_DMA_COHERENT, prot); kvfree(pages);pages[i] = page++;
Best regards
The expectation is that there is currently no user that would pass in non-contigous page ranges: no allocator, not even VMA, will hand these out.
The only problematic part would be if someone would provide a range obtained directly from memblock, or manually merge problematic ranges. If we find such cases, we should fix them to create separate SG entries.
Let's check in sg_set_page() that this is really the case. No need to check in sg_set_folio(), as pages in a folio are guaranteed to be contiguous.
We can now drop the nth_page() usage in sg_page_iter_page().
Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/scatterlist.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6f8a4965f9b98..8196949dfc82c 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -6,6 +6,7 @@ #include <linux/types.h> #include <linux/bug.h> #include <linux/mm.h> +#include <linux/mm_inline.h> #include <asm/io.h>
struct scatterlist { @@ -158,6 +159,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) static inline void sg_set_page(struct scatterlist *sg, struct page *page, unsigned int len, unsigned int offset) { + VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE)); sg_assign_page(sg, page); sg->offset = offset; sg->length = len; @@ -600,7 +602,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter, */ static inline struct page *sg_page_iter_page(struct sg_page_iter *piter) { - return nth_page(sg_page(piter->sg), piter->sg_pgoffset); + return sg_page(piter->sg) + piter->sg_pgoffset; }
/**
On 21.08.2025 22:06, David Hildenbrand wrote:
The expectation is that there is currently no user that would pass in non-contigous page ranges: no allocator, not even VMA, will hand these out.
The only problematic part would be if someone would provide a range obtained directly from memblock, or manually merge problematic ranges. If we find such cases, we should fix them to create separate SG entries.
Let's check in sg_set_page() that this is really the case. No need to check in sg_set_folio(), as pages in a folio are guaranteed to be contiguous.
We can now drop the nth_page() usage in sg_page_iter_page().
Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Marek Szyprowski m.szyprowski@samsung.com
include/linux/scatterlist.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6f8a4965f9b98..8196949dfc82c 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -6,6 +6,7 @@ #include <linux/types.h> #include <linux/bug.h> #include <linux/mm.h> +#include <linux/mm_inline.h> #include <asm/io.h> struct scatterlist { @@ -158,6 +159,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) static inline void sg_set_page(struct scatterlist *sg, struct page *page, unsigned int len, unsigned int offset) {
- VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE)); sg_assign_page(sg, page); sg->offset = offset; sg->length = len;
@@ -600,7 +602,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter, */ static inline struct page *sg_page_iter_page(struct sg_page_iter *piter) {
- return nth_page(sg_page(piter->sg), piter->sg_pgoffset);
- return sg_page(piter->sg) + piter->sg_pgoffset; }
/**
Best regards
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Damien Le Moal dlemoal@kernel.org Cc: Niklas Cassel cassel@kernel.org Signed-off-by: David Hildenbrand david@redhat.com --- drivers/ata/libata-sff.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c index 7fc407255eb46..9f5d0f9f6d686 100644 --- a/drivers/ata/libata-sff.c +++ b/drivers/ata/libata-sff.c @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) offset = qc->cursg->offset + qc->cursg_ofs;
/* get the current page and offset */ - page = nth_page(page, (offset >> PAGE_SHIFT)); + page += offset / PAGE_SHIFT; offset %= PAGE_SIZE;
/* don't overrun current sg */ @@ -631,7 +631,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) unsigned int split_len = PAGE_SIZE - offset;
ata_pio_xfer(qc, page, offset, split_len); - ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len); + ata_pio_xfer(qc, page + 1, 0, count - split_len); } else { ata_pio_xfer(qc, page, offset, count); } @@ -751,7 +751,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes) offset = sg->offset + qc->cursg_ofs;
/* get the current page and offset */ - page = nth_page(page, (offset >> PAGE_SHIFT)); + page += offset / PAGE_SIZE; offset %= PAGE_SIZE;
/* don't overrun current sg */
On 8/22/25 05:06, David Hildenbrand wrote:
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Damien Le Moal dlemoal@kernel.org Cc: Niklas Cassel cassel@kernel.org Signed-off-by: David Hildenbrand david@redhat.com
drivers/ata/libata-sff.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c index 7fc407255eb46..9f5d0f9f6d686 100644 --- a/drivers/ata/libata-sff.c +++ b/drivers/ata/libata-sff.c @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) offset = qc->cursg->offset + qc->cursg_ofs; /* get the current page and offset */
- page = nth_page(page, (offset >> PAGE_SHIFT));
- page += offset / PAGE_SHIFT;
Shouldn't this be "offset >> PAGE_SHIFT" ?
offset %= PAGE_SIZE; /* don't overrun current sg */ @@ -631,7 +631,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) unsigned int split_len = PAGE_SIZE - offset; ata_pio_xfer(qc, page, offset, split_len);
ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len);
} else { ata_pio_xfer(qc, page, offset, count); }ata_pio_xfer(qc, page + 1, 0, count - split_len);
@@ -751,7 +751,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes) offset = sg->offset + qc->cursg_ofs; /* get the current page and offset */
- page = nth_page(page, (offset >> PAGE_SHIFT));
- page += offset / PAGE_SIZE;
Same here, though this seems correct too.
offset %= PAGE_SIZE; /* don't overrun current sg */
On 22.08.25 03:59, Damien Le Moal wrote:
On 8/22/25 05:06, David Hildenbrand wrote:
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Damien Le Moal dlemoal@kernel.org Cc: Niklas Cassel cassel@kernel.org Signed-off-by: David Hildenbrand david@redhat.com
drivers/ata/libata-sff.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c index 7fc407255eb46..9f5d0f9f6d686 100644 --- a/drivers/ata/libata-sff.c +++ b/drivers/ata/libata-sff.c @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) offset = qc->cursg->offset + qc->cursg_ofs; /* get the current page and offset */
- page = nth_page(page, (offset >> PAGE_SHIFT));
- page += offset / PAGE_SHIFT;
Shouldn't this be "offset >> PAGE_SHIFT" ?
Thanks for taking a look!
Yeah, I already reverted back to "offset >> PAGE_SHIFT" after Linus mentioned in another mail in this thread that ">> PAGE_SHIFT" is generally preferred because the compiler cannot optimize as much if offset would be a signed variable.
So the next version will have the shift again.
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Jani Nikula jani.nikula@linux.intel.com Cc: Joonas Lahtinen joonas.lahtinen@linux.intel.com Cc: Rodrigo Vivi rodrigo.vivi@intel.com Cc: Tvrtko Ursulin tursulin@ursulin.net Cc: David Airlie airlied@gmail.com Cc: Simona Vetter simona@ffwll.ch Signed-off-by: David Hildenbrand david@redhat.com --- drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c index c16a57160b262..031d7acc16142 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c @@ -779,7 +779,7 @@ __i915_gem_object_get_page(struct drm_i915_gem_object *obj, pgoff_t n) GEM_BUG_ON(!i915_gem_object_has_struct_page(obj));
sg = i915_gem_object_get_sg(obj, n, &offset); - return nth_page(sg_page(sg), offset); + return sg_page(sg) + offset; }
/* Like i915_gem_object_get_page(), but mark the returned page dirty */
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Maxim Levitsky maximlevitsky@gmail.com Cc: Alex Dubov oakad@yahoo.com Cc: Ulf Hansson ulf.hansson@linaro.org Signed-off-by: David Hildenbrand david@redhat.com --- drivers/memstick/core/mspro_block.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/memstick/core/mspro_block.c b/drivers/memstick/core/mspro_block.c index c9853d887d282..985cfca3f6944 100644 --- a/drivers/memstick/core/mspro_block.c +++ b/drivers/memstick/core/mspro_block.c @@ -560,8 +560,7 @@ static int h_mspro_block_transfer_data(struct memstick_dev *card, t_offset += msb->current_page * msb->page_size;
sg_set_page(&t_sg, - nth_page(sg_page(&(msb->req_sg[msb->current_seg])), - t_offset >> PAGE_SHIFT), + sg_page(&(msb->req_sg[msb->current_seg])) + t_offset / PAGE_SIZE, msb->page_size, offset_in_page(t_offset));
memstick_init_req_sg(*mrq, msb->data_dir == READ
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Maxim Levitsky maximlevitsky@gmail.com Cc: Alex Dubov oakad@yahoo.com Cc: Ulf Hansson ulf.hansson@linaro.org Signed-off-by: David Hildenbrand david@redhat.com --- drivers/memstick/host/jmb38x_ms.c | 3 +-- drivers/memstick/host/tifm_ms.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/memstick/host/jmb38x_ms.c b/drivers/memstick/host/jmb38x_ms.c index cddddb3a5a27f..c5e71d39ffd51 100644 --- a/drivers/memstick/host/jmb38x_ms.c +++ b/drivers/memstick/host/jmb38x_ms.c @@ -317,8 +317,7 @@ static int jmb38x_ms_transfer_data(struct jmb38x_ms_host *host) unsigned int p_off;
if (host->req->long_data) { - pg = nth_page(sg_page(&host->req->sg), - off >> PAGE_SHIFT); + pg = sg_page(&host->req->sg) + off / PAGE_SIZE; p_off = offset_in_page(off); p_cnt = PAGE_SIZE - p_off; p_cnt = min(p_cnt, length); diff --git a/drivers/memstick/host/tifm_ms.c b/drivers/memstick/host/tifm_ms.c index db7f3a088fb09..0d64184ca10a9 100644 --- a/drivers/memstick/host/tifm_ms.c +++ b/drivers/memstick/host/tifm_ms.c @@ -201,8 +201,7 @@ static unsigned int tifm_ms_transfer_data(struct tifm_ms *host) unsigned int p_off;
if (host->req->long_data) { - pg = nth_page(sg_page(&host->req->sg), - off >> PAGE_SHIFT); + pg = sg_page(&host->req->sg) + off / PAGE_SIZE; p_off = offset_in_page(off); p_cnt = PAGE_SIZE - p_off; p_cnt = min(p_cnt, length);
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Alex Dubov oakad@yahoo.com Cc: Ulf Hansson ulf.hansson@linaro.org Cc: Jesper Nilsson jesper.nilsson@axis.com Cc: Lars Persson lars.persson@axis.com Signed-off-by: David Hildenbrand david@redhat.com --- drivers/mmc/host/tifm_sd.c | 4 ++-- drivers/mmc/host/usdhi6rol0.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/mmc/host/tifm_sd.c b/drivers/mmc/host/tifm_sd.c index ac636efd911d3..f1ede2b39b505 100644 --- a/drivers/mmc/host/tifm_sd.c +++ b/drivers/mmc/host/tifm_sd.c @@ -191,7 +191,7 @@ static void tifm_sd_transfer_data(struct tifm_sd *host) } off = sg[host->sg_pos].offset + host->block_pos;
- pg = nth_page(sg_page(&sg[host->sg_pos]), off >> PAGE_SHIFT); + pg = sg_page(&sg[host->sg_pos]) + off / PAGE_SIZE; p_off = offset_in_page(off); p_cnt = PAGE_SIZE - p_off; p_cnt = min(p_cnt, cnt); @@ -240,7 +240,7 @@ static void tifm_sd_bounce_block(struct tifm_sd *host, struct mmc_data *r_data) } off = sg[host->sg_pos].offset + host->block_pos;
- pg = nth_page(sg_page(&sg[host->sg_pos]), off >> PAGE_SHIFT); + pg = sg_page(&sg[host->sg_pos]) + off / PAGE_SIZE; p_off = offset_in_page(off); p_cnt = PAGE_SIZE - p_off; p_cnt = min(p_cnt, cnt); diff --git a/drivers/mmc/host/usdhi6rol0.c b/drivers/mmc/host/usdhi6rol0.c index 85b49c07918b3..3bccf800339ba 100644 --- a/drivers/mmc/host/usdhi6rol0.c +++ b/drivers/mmc/host/usdhi6rol0.c @@ -323,7 +323,7 @@ static void usdhi6_blk_bounce(struct usdhi6_host *host,
host->head_pg.page = host->pg.page; host->head_pg.mapped = host->pg.mapped; - host->pg.page = nth_page(host->pg.page, 1); + host->pg.page = host->pg.page + 1; host->pg.mapped = kmap(host->pg.page);
host->blk_page = host->bounce_buf; @@ -503,7 +503,7 @@ static void usdhi6_sg_advance(struct usdhi6_host *host) /* We cannot get here after crossing a page border */
/* Next page in the same SG */ - host->pg.page = nth_page(sg_page(host->sg), host->page_idx); + host->pg.page = sg_page(host->sg) + host->page_idx; host->pg.mapped = kmap(host->pg.page); host->blk_page = host->pg.mapped;
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: "James E.J. Bottomley" James.Bottomley@HansenPartnership.com Cc: "Martin K. Petersen" martin.petersen@oracle.com Cc: Doug Gilbert dgilbert@interlog.com Signed-off-by: David Hildenbrand david@redhat.com --- drivers/scsi/scsi_lib.c | 3 +-- drivers/scsi/sg.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 0c65ecfedfbd6..f523f85828b89 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -3148,8 +3148,7 @@ void *scsi_kmap_atomic_sg(struct scatterlist *sgl, int sg_count, /* Offset starting from the beginning of first page in this sg-entry */ *offset = *offset - len_complete + sg->offset;
- /* Assumption: contiguous pages can be accessed as "page + i" */ - page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT)); + page = sg_page(sg) + *offset / PAGE_SIZE; *offset &= ~PAGE_MASK;
/* Bytes in this sg-entry from *offset to the end of the page */ diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 3c02a5f7b5f39..2c653f2b21133 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -1235,8 +1235,7 @@ sg_vma_fault(struct vm_fault *vmf) len = vma->vm_end - sa; len = (len < length) ? len : length; if (offset < len) { - struct page *page = nth_page(rsv_schp->pages[k], - offset >> PAGE_SHIFT); + struct page *page = rsv_schp->pages[k] + offset / PAGE_SIZE; get_page(page); /* increment page count */ vmf->page = page; return 0; /* success */
On 8/21/25 1:06 PM, David Hildenbrand wrote:
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Usually the SCSI core and the SG I/O driver are updated separately. Anyway:
Reviewed-by: Bart Van Assche bvanassche@acm.org
On 22.08.25 20:01, Bart Van Assche wrote:
On 8/21/25 1:06 PM, David Hildenbrand wrote:
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Usually the SCSI core and the SG I/O driver are updated separately. Anyway:
Thanks, I had it separately but decided to merge per broader subsystem before sending. I can split it up in the next version.
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Brett Creeley brett.creeley@amd.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Yishai Hadas yishaih@nvidia.com Cc: Shameer Kolothum shameerali.kolothum.thodi@huawei.com Cc: Kevin Tian kevin.tian@intel.com Cc: Alex Williamson alex.williamson@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- drivers/vfio/pci/pds/lm.c | 3 +-- drivers/vfio/pci/virtio/migrate.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c index f2673d395236a..4d70c833fa32e 100644 --- a/drivers/vfio/pci/pds/lm.c +++ b/drivers/vfio/pci/pds/lm.c @@ -151,8 +151,7 @@ static struct page *pds_vfio_get_file_page(struct pds_vfio_lm_file *lm_file, lm_file->last_offset_sg = sg; lm_file->sg_last_entry += i; lm_file->last_offset = cur_offset; - return nth_page(sg_page(sg), - (offset - cur_offset) / PAGE_SIZE); + return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE; } cur_offset += sg->length; } diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c index ba92bb4e9af94..7dd0ac866461d 100644 --- a/drivers/vfio/pci/virtio/migrate.c +++ b/drivers/vfio/pci/virtio/migrate.c @@ -53,8 +53,7 @@ virtiovf_get_migration_page(struct virtiovf_data_buffer *buf, buf->last_offset_sg = sg; buf->sg_last_entry += i; buf->last_offset = cur_offset; - return nth_page(sg_page(sg), - (offset - cur_offset) / PAGE_SIZE); + return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE; } cur_offset += sg->length; }
It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage.
Cc: Herbert Xu herbert@gondor.apana.org.au Cc: "David S. Miller" davem@davemloft.net Signed-off-by: David Hildenbrand david@redhat.com --- crypto/ahash.c | 4 ++-- crypto/scompress.c | 8 ++++---- include/crypto/scatterwalk.h | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/crypto/ahash.c b/crypto/ahash.c index a227793d2c5b5..a9f757224a223 100644 --- a/crypto/ahash.c +++ b/crypto/ahash.c @@ -88,7 +88,7 @@ static int hash_walk_new_entry(struct crypto_hash_walk *walk)
sg = walk->sg; walk->offset = sg->offset; - walk->pg = nth_page(sg_page(walk->sg), (walk->offset >> PAGE_SHIFT)); + walk->pg = sg_page(walk->sg) + walk->offset / PAGE_SIZE; walk->offset = offset_in_page(walk->offset); walk->entrylen = sg->length;
@@ -226,7 +226,7 @@ int shash_ahash_digest(struct ahash_request *req, struct shash_desc *desc) if (!IS_ENABLED(CONFIG_HIGHMEM)) return crypto_shash_digest(desc, data, nbytes, req->result);
- page = nth_page(page, offset >> PAGE_SHIFT); + page += offset / PAGE_SIZE; offset = offset_in_page(offset);
if (nbytes > (unsigned int)PAGE_SIZE - offset) diff --git a/crypto/scompress.c b/crypto/scompress.c index c651e7f2197a9..1a7ed8ae65b07 100644 --- a/crypto/scompress.c +++ b/crypto/scompress.c @@ -198,7 +198,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) } else return -ENOSYS;
- dpage = nth_page(dpage, doff / PAGE_SIZE); + dpage += doff / PAGE_SIZE; doff = offset_in_page(doff);
n = (dlen - 1) / PAGE_SIZE; @@ -220,12 +220,12 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) } else break;
- spage = nth_page(spage, soff / PAGE_SIZE); + spage = spage + soff / PAGE_SIZE; soff = offset_in_page(soff);
n = (slen - 1) / PAGE_SIZE; n += (offset_in_page(slen - 1) + soff) / PAGE_SIZE; - if (PageHighMem(nth_page(spage, n)) && + if (PageHighMem(spage + n) && size_add(soff, slen) > PAGE_SIZE) break; src = kmap_local_page(spage) + soff; @@ -270,7 +270,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) if (dlen <= PAGE_SIZE) break; dlen -= PAGE_SIZE; - dpage = nth_page(dpage, 1); + dpage++; } }
diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h index 15ab743f68c8f..cdf8497d19d27 100644 --- a/include/crypto/scatterwalk.h +++ b/include/crypto/scatterwalk.h @@ -159,7 +159,7 @@ static inline void scatterwalk_map(struct scatter_walk *walk) if (IS_ENABLED(CONFIG_HIGHMEM)) { struct page *page;
- page = nth_page(base_page, offset >> PAGE_SHIFT); + page = base_page + offset / PAGE_SIZE; offset = offset_in_page(offset); addr = kmap_local_page(page) + offset; } else { @@ -259,7 +259,7 @@ static inline void scatterwalk_done_dst(struct scatter_walk *walk, end += (offset_in_page(offset) + offset_in_page(nbytes) + PAGE_SIZE - 1) >> PAGE_SHIFT; for (i = start; i < end; i++) - flush_dcache_page(nth_page(base_page, i)); + flush_dcache_page(base_page + i); } scatterwalk_advance(walk, nbytes); }
On Thu, 21 Aug 2025 at 16:08, David Hildenbrand david@redhat.com wrote:
page = nth_page(page, offset >> PAGE_SHIFT);
page += offset / PAGE_SIZE;
Please keep the " >> PAGE_SHIFT" form.
Is "offset" unsigned? Yes it is, But I had to look at the source code to make sure, because it wasn't locally obvious from the patch. And I'd rather we keep a pattern that is "safe", in that it doesn't generate strange code if the value might be a 's64' (eg loff_t) on 32-bit architectures.
Because doing a 64-bit shift on x86-32 is like three cycles. Doing a 64-bit signed division by a simple constant is something like ten strange instructions even if the end result is only 32-bit.
And again - not the case *here*, but just a general "let's keep to one pattern", and the shift pattern is simply the better choice.
Linus
On 21.08.25 22:24, Linus Torvalds wrote:
On Thu, 21 Aug 2025 at 16:08, David Hildenbrand david@redhat.com wrote:
page = nth_page(page, offset >> PAGE_SHIFT);
page += offset / PAGE_SIZE;
Please keep the " >> PAGE_SHIFT" form.
No strong opinion.
I was primarily doing it to get rid of (in other cases) the parentheses.
Like in patch #29
- /* Assumption: contiguous pages can be accessed as "page + i" */ - page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT)); + page = sg_page(sg) + *offset / PAGE_SIZE;
Is "offset" unsigned? Yes it is, But I had to look at the source code to make sure, because it wasn't locally obvious from the patch. And I'd rather we keep a pattern that is "safe", in that it doesn't generate strange code if the value might be a 's64' (eg loff_t) on 32-bit architectures.
Because doing a 64-bit shift on x86-32 is like three cycles. Doing a 64-bit signed division by a simple constant is something like ten strange instructions even if the end result is only 32-bit.
I would have thought that the compiler is smart enough to optimize that? PAGE_SIZE is a constant.
And again - not the case *here*, but just a general "let's keep to one pattern", and the shift pattern is simply the better choice.
It's a wild mixture, but I can keep doing what we already do in these cases.
Oh, an your reply was an invalid email and ended up in my spam-box:
From: David Hildenbrand david@redhat.com
but you apparently didn't use the redhat mail system, so the DKIM signing fails
dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE) header.from=redhat.com
and it gets marked as spam.
I think you may have gone through smtp.kernel.org, but then you need to use your kernel.org email address to get the DKIM right.
Linus
On 21.08.25 22:29, David Hildenbrand wrote:
On 21.08.25 22:24, Linus Torvalds wrote:
On Thu, 21 Aug 2025 at 16:08, David Hildenbrand david@redhat.com wrote:
page = nth_page(page, offset >> PAGE_SHIFT);
page += offset / PAGE_SIZE;
Please keep the " >> PAGE_SHIFT" form.
No strong opinion.
I was primarily doing it to get rid of (in other cases) the parentheses.
Like in patch #29
- /* Assumption: contiguous pages can be accessed as "page + i" */
- page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
- page = sg_page(sg) + *offset / PAGE_SIZE;
Is "offset" unsigned? Yes it is, But I had to look at the source code to make sure, because it wasn't locally obvious from the patch. And I'd rather we keep a pattern that is "safe", in that it doesn't generate strange code if the value might be a 's64' (eg loff_t) on 32-bit architectures.
Because doing a 64-bit shift on x86-32 is like three cycles. Doing a 64-bit signed division by a simple constant is something like ten strange instructions even if the end result is only 32-bit.
I would have thought that the compiler is smart enough to optimize that? PAGE_SIZE is a constant.
It's late, I get your point: if the compiler can't optimize if it's a signed value ...
On Thu, Aug 21, 2025 at 4:29 PM David Hildenbrand david@redhat.com wrote:
Because doing a 64-bit shift on x86-32 is like three cycles. Doing a 64-bit signed division by a simple constant is something like ten strange instructions even if the end result is only 32-bit.
I would have thought that the compiler is smart enough to optimize that? PAGE_SIZE is a constant.
Oh, the compiler optimizes things. But dividing a 64-bit signed value with a constant is still quite complicated.
It doesn't generate a 'div' instruction, but it generates something like this:
movl %ebx, %edx sarl $31, %edx movl %edx, %eax xorl %edx, %edx andl $4095, %eax addl %ecx, %eax adcl %ebx, %edx
and that's certainly a lot faster than an actual 64-bit divide would be.
An unsigned divide - or a shift - results in just
shrdl $12, %ecx, %eax
which is still not the fastest instruction (I think shrld gets split into two uops), but it's certainly simpler and easier to read.
Linus
There is the concern that unpin_user_page_range_dirty_lock() might do some weird merging of PFN ranges -- either now or in the future -- such that PFN range is contiguous but the page range might not be.
Let's sanity-check for that and drop the nth_page() usage.
Signed-off-by: David Hildenbrand david@redhat.com --- mm/gup.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c index f017ff6d7d61a..0a669a766204b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -237,7 +237,7 @@ void folio_add_pin(struct folio *folio) static inline struct folio *gup_folio_range_next(struct page *start, unsigned long npages, unsigned long i, unsigned int *ntails) { - struct page *next = nth_page(start, i); + struct page *next = start + i; struct folio *folio = page_folio(next); unsigned int nr = 1;
@@ -342,6 +342,9 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock); * "gup-pinned page range" refers to a range of pages that has had one of the * pin_user_pages() variants called on that page. * + * The page range must be truly contiguous: the page range corresponds + * to a contiguous PFN range and all pages can be iterated naturally. + * * For the page ranges defined by [page .. page+npages], make that range (or * its head pages, if a compound page) dirty, if @make_dirty is true, and if the * page range was previously listed as clean. @@ -359,6 +362,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, struct folio *folio; unsigned int nr;
+ VM_WARN_ON_ONCE(!page_range_contiguous(page, npages)); + for (i = 0; i < npages; i += nr) { folio = gup_folio_range_next(page, npages, i, &nr); if (make_dirty && !folio_test_dirty(folio)) {
We want to get rid of nth_page(), and kfence init code is the last user.
Unfortunately, we might actually walk a PFN range where the pages are not contiguous, because we might be allocating an area from memblock that could span memory sections in problematic kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP).
We could check whether the page range is contiguous using page_range_contiguous() and failing kfence init, or making kfence incompatible these problemtic kernel configs.
Let's keep it simple and simply use pfn_to_page() by iterating PFNs.
Cc: Alexander Potapenko glider@google.com Cc: Marco Elver elver@google.com Cc: Dmitry Vyukov dvyukov@google.com Signed-off-by: David Hildenbrand david@redhat.com --- mm/kfence/core.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 0ed3be100963a..793507c77f9e8 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -594,15 +594,15 @@ static void rcu_guarded_free(struct rcu_head *h) */ static unsigned long kfence_init_pool(void) { - unsigned long addr; - struct page *pages; + unsigned long addr, pfn, start_pfn, end_pfn; int i;
if (!arch_kfence_init_pool()) return (unsigned long)__kfence_pool;
addr = (unsigned long)__kfence_pool; - pages = virt_to_page(__kfence_pool); + start_pfn = PHYS_PFN(virt_to_phys(__kfence_pool)); + end_pfn = start_pfn + KFENCE_POOL_SIZE / PAGE_SIZE;
/* * Set up object pages: they must have PGTY_slab set to avoid freeing @@ -612,12 +612,13 @@ static unsigned long kfence_init_pool(void) * fast-path in SLUB, and therefore need to ensure kfree() correctly * enters __slab_free() slow-path. */ - for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { - struct slab *slab = page_slab(nth_page(pages, i)); + for (pfn = start_pfn; pfn != end_pfn; pfn++) { + struct slab *slab;
if (!i || (i % 2)) continue;
+ slab = page_slab(pfn_to_page(pfn)); __folio_set_slab(slab_folio(slab)); #ifdef CONFIG_MEMCG slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts | @@ -664,11 +665,13 @@ static unsigned long kfence_init_pool(void) return 0;
reset_slab: - for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { - struct slab *slab = page_slab(nth_page(pages, i)); + for (pfn = start_pfn; pfn != end_pfn; pfn++) { + struct slab *slab;
if (!i || (i % 2)) continue; + + slab = page_slab(pfn_to_page(pfn)); #ifdef CONFIG_MEMCG slab->obj_exts = 0; #endif
On 21.08.25 22:06, David Hildenbrand wrote:
We want to get rid of nth_page(), and kfence init code is the last user.
Unfortunately, we might actually walk a PFN range where the pages are not contiguous, because we might be allocating an area from memblock that could span memory sections in problematic kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP).
We could check whether the page range is contiguous using page_range_contiguous() and failing kfence init, or making kfence incompatible these problemtic kernel configs.
Let's keep it simple and simply use pfn_to_page() by iterating PFNs.
Fortunately this series is RFC due to lack of detailed testing :P
Something gives me a NULL-pointer pointer here (maybe the virt_to_phys()).
Will look into that tomorrow.
On 21.08.25 22:32, David Hildenbrand wrote:
On 21.08.25 22:06, David Hildenbrand wrote:
We want to get rid of nth_page(), and kfence init code is the last user.
Unfortunately, we might actually walk a PFN range where the pages are not contiguous, because we might be allocating an area from memblock that could span memory sections in problematic kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP).
We could check whether the page range is contiguous using page_range_contiguous() and failing kfence init, or making kfence incompatible these problemtic kernel configs.
Let's keep it simple and simply use pfn_to_page() by iterating PFNs.
Fortunately this series is RFC due to lack of detailed testing :P
Something gives me a NULL-pointer pointer here (maybe the virt_to_phys()).
Will look into that tomorrow.
Okay, easy: relying on i but not updating it /me facepalm
Ever since commit 858c708d9efb ("block: move the bi_size update out of __bio_try_merge_page"), page_is_mergeable() no longer exists, and the logic in bvec_try_merge_page() is now a simple page pointer comparison.
Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/bvec.h | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h index 0a80e1f9aa201..3fc0efa0825b1 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -22,11 +22,8 @@ struct page; * @bv_len: Number of bytes in the address range. * @bv_offset: Start of the address range relative to the start of @bv_page. * - * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len: - * - * nth_page(@bv_page, n) == @bv_page + n - * - * This holds because page_is_mergeable() checks the above property. + * All pages within a bio_vec starting from @bv_page are contiguous and + * can simply be iterated (see bvec_advance()). */ struct bio_vec { struct page *bv_page;
Now that all users are gone, let's remove it.
Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/mm.h | 2 -- tools/testing/scatterlist/linux/mm.h | 1 - 2 files changed, 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index f59ad1f9fc792..3ded0db8322f7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -210,9 +210,7 @@ extern unsigned long sysctl_admin_reserve_kbytes;
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) bool page_range_contiguous(const struct page *page, unsigned long nr_pages); -#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) #else -#define nth_page(page,n) ((page) + (n)) static inline bool page_range_contiguous(const struct page *page, unsigned long nr_pages) { diff --git a/tools/testing/scatterlist/linux/mm.h b/tools/testing/scatterlist/linux/mm.h index 5bd9e6e806254..121ae78d6e885 100644 --- a/tools/testing/scatterlist/linux/mm.h +++ b/tools/testing/scatterlist/linux/mm.h @@ -51,7 +51,6 @@ static inline unsigned long page_to_phys(struct page *page)
#define page_to_pfn(page) ((unsigned long)(page) / PAGE_SIZE) #define pfn_to_page(pfn) (void *)((pfn) * PAGE_SIZE) -#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
#define __min(t1, t2, min1, min2, x, y) ({ \ t1 min1 = (x); \
syzbot ci has tested the following series
[v1] mm: remove nth_page() https://lore.kernel.org/all/20250821200701.1329277-1-david@redhat.com * [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable * [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" * [PATCH RFC 03/35] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" * [PATCH RFC 04/35] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" * [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config * [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() * [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() * [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate * [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() * [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() * [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() * [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs * [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() * [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation * [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() * [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() * [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages * [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage * [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio * [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() * [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges * [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() * [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry * [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry * [PATCH RFC 25/35] drm/i915/gem: drop nth_page() usage within SG entry * [PATCH RFC 26/35] mspro_block: drop nth_page() usage within SG entry * [PATCH RFC 27/35] memstick: drop nth_page() usage within SG entry * [PATCH RFC 28/35] mmc: drop nth_page() usage within SG entry * [PATCH RFC 29/35] scsi: core: drop nth_page() usage within SG entry * [PATCH RFC 30/35] vfio/pci: drop nth_page() usage within SG entry * [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry * [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() * [PATCH RFC 33/35] kfence: drop nth_page() usage * [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() * [PATCH RFC 35/35] mm: remove nth_page()
and found the following issue: general protection fault in kfence_guarded_alloc
Full report is available here: https://ci.syzbot.org/series/f6f0aea1-9616-4675-8c80-f9b59ba3211c
***
general protection fault in kfence_guarded_alloc
tree: net-next URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git base: da114122b83149d1f1db0586b1d67947b651aa20 arch: amd64 compiler: Debian clang version 20.1.7 (++20250616065708+6146a88f6049-1~exp1~20250616065826.132), Debian LLD 20.1.7 config: https://ci.syzbot.org/builds/705b7862-eb10-40bd-a4cf-4820b4912466/config
smpboot: CPU0: Intel(R) Xeon(R) CPU @ 2.80GHz (family: 0x6, model: 0x55, stepping: 0x7) Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:kfence_guarded_alloc+0x643/0xc70 Code: 41 c1 e5 18 bf 00 00 00 f5 44 89 ee e8 a6 67 9c ff 45 31 f6 41 81 fd 00 00 00 f5 4c 0f 44 f3 49 8d 7e 08 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00 74 05 e8 f1 cb ff ff 4c 8b 6c 24 18 4d 89 6e 08 49 RSP: 0000:ffffc90000047740 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffea0004d90080 RCX: 0000000000000000 RDX: ffff88801c2e8000 RSI: 00000000ff000000 RDI: 0000000000000008 RBP: ffffc90000047850 R08: ffffffff99b2201b R09: 1ffffffff3364403 R10: dffffc0000000000 R11: fffffbfff3364404 R12: dffffc0000000000 R13: 00000000ff000000 R14: 0000000000000000 R15: ffff88813fec7068 FS: 0000000000000000(0000) GS:ffff8880b861c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88813ffff000 CR3: 000000000df36000 CR4: 0000000000350ef0 Call Trace: <TASK> __kfence_alloc+0x385/0x3b0 __kmalloc_noprof+0x440/0x4f0 __alloc_workqueue+0x103/0x1b70 alloc_workqueue_noprof+0xd4/0x210 init_mm_internals+0x17/0x140 kernel_init_freeable+0x307/0x4b0 kernel_init+0x1d/0x1d0 ret_from_fork+0x3f9/0x770 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:kfence_guarded_alloc+0x643/0xc70 Code: 41 c1 e5 18 bf 00 00 00 f5 44 89 ee e8 a6 67 9c ff 45 31 f6 41 81 fd 00 00 00 f5 4c 0f 44 f3 49 8d 7e 08 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00 74 05 e8 f1 cb ff ff 4c 8b 6c 24 18 4d 89 6e 08 49 RSP: 0000:ffffc90000047740 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffea0004d90080 RCX: 0000000000000000 RDX: ffff88801c2e8000 RSI: 00000000ff000000 RDI: 0000000000000008 RBP: ffffc90000047850 R08: ffffffff99b2201b R09: 1ffffffff3364403 R10: dffffc0000000000 R11: fffffbfff3364404 R12: dffffc0000000000 R13: 00000000ff000000 R14: 0000000000000000 R15: ffff88813fec7068 FS: 0000000000000000(0000) GS:ffff8880b861c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88813ffff000 CR3: 000000000df36000 CR4: 0000000000350ef0
***
If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com
--- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com.
On Thu, Aug 21, 2025 at 10:06:26PM +0200, David Hildenbrand wrote:
As discussed recently with Linus, nth_page() is just nasty and we would like to remove it.
To recap, the reason we currently need nth_page() within a folio is because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the memmap is allocated per memory section.
While buddy allocations cannot cross memory section boundaries, hugetlb and dax folios can.
So crossing a memory section means that "page++" could do the wrong thing. Instead, nth_page() on these problematic configs always goes from page->pfn, to the go from (++pfn)->page, which is rather nasty.
Likely, many people have no idea when nth_page() is required and when it might be dropped.
We refer to such problematic PFN ranges and "non-contiguous pages". If we only deal with "contiguous pages", there is not need for nth_page().
Besides that "obvious" folio case, we might end up using nth_page() within CMA allocations (again, could span memory sections), and in one corner case (kfence) when processing memblock allocations (again, could span memory sections).
I browsed the patches and it looks great to me, thanks for doing this
Jason
linux-kselftest-mirror@lists.linaro.org