Currently, guard regions are not visible to users except through /proc/$pid/pagemap, with no explicit visibility at the VMA level.
This makes the feature less useful, as it isn't entirely apparent which VMAs may have these entries present, especially when performing actions which walk through memory regions such as those performed by CRIU.
This series addresses this issue by introducing the VM_MAYBE_GUARD flag which fulfils this role, updating the smaps logic to display an entry for these.
The semantics of this flag are that a guard region MAY be present if set (we cannot be sure, as we can't efficiently track whether an MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if not set the VMA definitely does NOT have any guard regions present.
It's problematic to establish this flag without further action, because that means that VMAs with guard regions in them become non-mergeable with adjacent VMAs for no especially good reason.
To work around this, this series also introduces the concept of 'sticky' VMA flags - that is flags which:
a. if set in one VMA and not in another still permit those VMAs to be merged (if otherwise compatible).
b. When they are merged, the resultant VMA must have the flag set.
The VMA logic is updated to propagate these flags correctly.
Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve an issue with file-backed guard regions - previously these established an anon_vma object for file-backed mappings solely to have vma_needs_copy() correctly propagate guard region mappings to child processes.
We introduce a new flag alias VM_COPY_ON_FORK (which currently only specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly for this flag and to copy page tables if it is present, which resolves this issue.
Additionally, we add the ability for allow-listed VMA flags to be atomically writable with only mmap/VMA read locks held.
The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure does not cause any races by being allowed to do so.
This allows us to maintain guard region installation as a read-locked operation and not endure the overhead of obtaining a write lock here.
Finally we introduce extensive VMA userland tests to assert that the sticky VMA logic behaves correctly as well as guard region self tests to assert that smaps visibility is correctly implemented.
v2: * Separated out userland VMA tests for sticky behaviour as per Suren. * Added the concept of atomic writable VMA flags as per Pedro and Vlastimil. * Made VM_MAYBE_GUARD an atomic writable flag so we don't have to take a VMA write lock in madvise() as per Pedro and Vlastimil.
v1: https://lore.kernel.org/all/cover.1761756437.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (5): mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps mm: add atomic VMA flags, use VM_MAYBE_GUARD as such mm: implement sticky, copy on fork VMA flags tools/testing/vma: add VMA sticky userland tests selftests/mm/guard-regions: add smaps visibility test
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 58 ++++++++++ include/trace/events/mmflags.h | 1 + mm/madvise.c | 22 ++-- mm/memory.c | 3 + mm/vma.c | 22 ++-- tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + tools/testing/vma/vma.c | 89 +++++++++++++-- tools/testing/vma/vma_internal.h | 35 ++++++ 12 files changed, 330 insertions(+), 28 deletions(-)
-- 2.51.0
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 3 +++ include/trace/events/mmflags.h | 1 + mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 3 +++ 6 files changed, 13 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping + gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8a9894aefbca..a420dcf9ffbb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1147,6 +1147,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf", + [ilog2(VM_MAYBE_GUARD)] = "gu", [ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr", diff --git a/include/linux/mm.h b/include/linux/mm.h index 6e5ca5287e21..2a5516bff75a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -271,6 +271,8 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif
+#define VM_MAYBE_GUARD_BIT 11 + /* * vm_flags in vm_area_struct, see mm_types.h. * When changing, update also include/trace/events/mmflags.h @@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
#define VM_LOCKED 0x00002000 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index aa441f593e9a..a6e5a44c9b42 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3) {VM_UFFD_MISSING, "uffd_missing" }, \ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \ + {VM_MAYBE_GUARD, "maybe_guard" }, \ {VM_UFFD_WP, "uffd_wp" }, \ {VM_LOCKED, "locked" }, \ {VM_IO, "io" }, \ diff --git a/mm/memory.c b/mm/memory.c index 046579a6ec2f..334732ab6733 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1480,6 +1480,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
+ /* Guard regions have momdified page tables that require copying. */ + if (src_vma->vm_flags & VM_MAYBE_GUARD) + return true; + /* * Don't copy ptes where a page fault will fill them correctly. Fork * becomes much lighter when there are big shared or private readonly diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index c68d382dac81..ddf58a5e1add 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -46,6 +46,8 @@ extern unsigned long dac_mmap_min_addr;
#define MMF_HAS_MDWE 28
+#define VM_MAYBE_GUARD_BIT 11 + #define VM_NONE 0x00000000 #define VM_READ 0x00000001 #define VM_WRITE 0x00000002 @@ -56,6 +58,7 @@ extern unsigned long dac_mmap_min_addr; #define VM_MAYEXEC 0x00000040 #define VM_GROWSDOWN 0x00000100 #define VM_PFNMAP 0x00000400 +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */ #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 #define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
On 11/6/25 11:46, Lorenzo Stoakes wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Reviewed-by: Vlastimil Babka vbabka@suse.cz
On Thu, Nov 06, 2025 at 12:12:10PM +0100, Vlastimil Babka wrote:
On 11/6/25 11:46, Lorenzo Stoakes wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Reviewed-by: Vlastimil Babka vbabka@suse.cz
Thanks
On Thu, Nov 06, 2025 at 10:46:12AM +0000, Lorenzo Stoakes wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 3 +++ include/trace/events/mmflags.h | 1 + mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 3 +++ 6 files changed, 13 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping
- gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
The nittiest of nits: =============================================================
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8a9894aefbca..a420dcf9ffbb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1147,6 +1147,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr",[ilog2(VM_MAYBE_GUARD)] = "gu",diff --git a/include/linux/mm.h b/include/linux/mm.h index 6e5ca5287e21..2a5516bff75a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -271,6 +271,8 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif +#define VM_MAYBE_GUARD_BIT 11
/*
- vm_flags in vm_area_struct, see mm_types.h.
- When changing, update also include/trace/events/mmflags.h
@@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */
Don't we also need an adjustment on the rust side for this BIT()? Like we for f04aad36a07c ("mm/ksm: fix flag-dropping behavior in ksm_madvise").
In any case: Reviewed-by: Pedro Falcato pfalcato@suse.de
+cc Alice for rust stuff
On Thu, Nov 06, 2025 at 02:27:56PM +0000, Pedro Falcato wrote:
On Thu, Nov 06, 2025 at 10:46:12AM +0000, Lorenzo Stoakes wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 3 +++ include/trace/events/mmflags.h | 1 + mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 3 +++ 6 files changed, 13 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping
- gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
The nittiest of nits: =============================================================
Sigh :) OK will fix.
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8a9894aefbca..a420dcf9ffbb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1147,6 +1147,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr",[ilog2(VM_MAYBE_GUARD)] = "gu",diff --git a/include/linux/mm.h b/include/linux/mm.h index 6e5ca5287e21..2a5516bff75a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -271,6 +271,8 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif
+#define VM_MAYBE_GUARD_BIT 11
/*
- vm_flags in vm_area_struct, see mm_types.h.
- When changing, update also include/trace/events/mmflags.h
@@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */
Don't we also need an adjustment on the rust side for this BIT()? Like we for f04aad36a07c ("mm/ksm: fix flag-dropping behavior in ksm_madvise").
That's a bit unhelpful if rust can't cope with extremely basic assignments like that and we just have to know to add helpers :/
We do BIT() stuff for e.g. VM_HIGH_ARCH_n, VM_UFFD_MINOR_BIT, VM_ALLOW_ANY_UNCACHED_BIT, VM_DROPPABLE_BIT and VM_SEALED_BIT too and no such helpers there, So not sure if this is required?
Alice - why is it these 'non-trivial' defines were fine but VM_MERGEABLE was problematic? That seems strange.
I see [0], so let me build rust here and see if it moans, if it moans I'll add it.
[0]:https://lore.kernel.org/oe-kbuild-all/CANiq72kOhRdGtQe2UVYmDLdbw6VNkiMtdFzkQ...
In any case: Reviewed-by: Pedro Falcato pfalcato@suse.de
Thanks
-- Pedro
On Thu, Nov 06, 2025 at 02:54:33PM +0000, Lorenzo Stoakes wrote:
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8a9894aefbca..a420dcf9ffbb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1147,6 +1147,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr",[ilog2(VM_MAYBE_GUARD)] = "gu",diff --git a/include/linux/mm.h b/include/linux/mm.h index 6e5ca5287e21..2a5516bff75a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -271,6 +271,8 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif
+#define VM_MAYBE_GUARD_BIT 11
/*
- vm_flags in vm_area_struct, see mm_types.h.
- When changing, update also include/trace/events/mmflags.h
@@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */
Don't we also need an adjustment on the rust side for this BIT()? Like we for f04aad36a07c ("mm/ksm: fix flag-dropping behavior in ksm_madvise").
That's a bit unhelpful if rust can't cope with extremely basic assignments like that and we just have to know to add helpers :/
We do BIT() stuff for e.g. VM_HIGH_ARCH_n, VM_UFFD_MINOR_BIT, VM_ALLOW_ANY_UNCACHED_BIT, VM_DROPPABLE_BIT and VM_SEALED_BIT too and no such helpers there, So not sure if this is required?
Alice - why is it these 'non-trivial' defines were fine but VM_MERGEABLE was problematic? That seems strange.
I see [0], so let me build rust here and see if it moans, if it moans I'll add it.
I built with CONFIG_RUST=y and everything compiles ok so seems rust is fine with it?
Strange that we need it for some things but not others though?
On Thu, Nov 06, 2025 at 02:54:33PM +0000, Lorenzo Stoakes wrote:
+cc Alice for rust stuff
On Thu, Nov 06, 2025 at 02:27:56PM +0000, Pedro Falcato wrote:
On Thu, Nov 06, 2025 at 10:46:12AM +0000, Lorenzo Stoakes wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 3 +++ include/trace/events/mmflags.h | 1 + mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 3 +++ 6 files changed, 13 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping
- gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
The nittiest of nits: =============================================================
Sigh :) OK will fix.
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8a9894aefbca..a420dcf9ffbb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1147,6 +1147,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr",[ilog2(VM_MAYBE_GUARD)] = "gu",diff --git a/include/linux/mm.h b/include/linux/mm.h index 6e5ca5287e21..2a5516bff75a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -271,6 +271,8 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif
+#define VM_MAYBE_GUARD_BIT 11
/*
- vm_flags in vm_area_struct, see mm_types.h.
- When changing, update also include/trace/events/mmflags.h
@@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */
Don't we also need an adjustment on the rust side for this BIT()? Like we for f04aad36a07c ("mm/ksm: fix flag-dropping behavior in ksm_madvise").
That's a bit unhelpful if rust can't cope with extremely basic assignments like that and we just have to know to add helpers :/
We do BIT() stuff for e.g. VM_HIGH_ARCH_n, VM_UFFD_MINOR_BIT, VM_ALLOW_ANY_UNCACHED_BIT, VM_DROPPABLE_BIT and VM_SEALED_BIT too and no such helpers there, So not sure if this is required?
Alice - why is it these 'non-trivial' defines were fine but VM_MERGEABLE was problematic? That seems strange.
I see [0], so let me build rust here and see if it moans, if it moans I'll add it.
When you use #define to declare a constant whose right-hand-side contains a function-like macro such as BIT(), bindgen does not define a Rust version of that constant. However, VM_MAYBE_GUARD is not referenced in Rust anywhere, so that isn't a problem.
It was a problem with VM_MERGEABLE because rust/kernel/mm/virt.rs references it.
Note that it's only the combination of #define and function-like macro that triggers this condition. If the constant is defined using another mechanism such as enum {}, then bindgen will generate the constant no matter how complex the right-hand-side is. The problem is that bindgen can't tell whether a #define is just a constant or not.
Alice
On Fri, Nov 07, 2025 at 09:13:00AM +0000, Alice Ryhl wrote:
On Thu, Nov 06, 2025 at 02:54:33PM +0000, Lorenzo Stoakes wrote:
+cc Alice for rust stuff
On Thu, Nov 06, 2025 at 02:27:56PM +0000, Pedro Falcato wrote:
On Thu, Nov 06, 2025 at 10:46:12AM +0000, Lorenzo Stoakes wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 3 +++ include/trace/events/mmflags.h | 1 + mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 3 +++ 6 files changed, 13 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping
- gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
The nittiest of nits: =============================================================
Sigh :) OK will fix.
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8a9894aefbca..a420dcf9ffbb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1147,6 +1147,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr",[ilog2(VM_MAYBE_GUARD)] = "gu",diff --git a/include/linux/mm.h b/include/linux/mm.h index 6e5ca5287e21..2a5516bff75a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -271,6 +271,8 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif
+#define VM_MAYBE_GUARD_BIT 11
/*
- vm_flags in vm_area_struct, see mm_types.h.
- When changing, update also include/trace/events/mmflags.h
@@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */
Don't we also need an adjustment on the rust side for this BIT()? Like we for f04aad36a07c ("mm/ksm: fix flag-dropping behavior in ksm_madvise").
That's a bit unhelpful if rust can't cope with extremely basic assignments like that and we just have to know to add helpers :/
We do BIT() stuff for e.g. VM_HIGH_ARCH_n, VM_UFFD_MINOR_BIT, VM_ALLOW_ANY_UNCACHED_BIT, VM_DROPPABLE_BIT and VM_SEALED_BIT too and no such helpers there, So not sure if this is required?
Alice - why is it these 'non-trivial' defines were fine but VM_MERGEABLE was problematic? That seems strange.
I see [0], so let me build rust here and see if it moans, if it moans I'll add it.
When you use #define to declare a constant whose right-hand-side contains a function-like macro such as BIT(), bindgen does not define a Rust version of that constant. However, VM_MAYBE_GUARD is not referenced in Rust anywhere, so that isn't a problem.
It was a problem with VM_MERGEABLE because rust/kernel/mm/virt.rs references it.
Note that it's only the combination of #define and function-like macro that triggers this condition. If the constant is defined using another mechanism such as enum {}, then bindgen will generate the constant no matter how complex the right-hand-side is. The problem is that bindgen can't tell whether a #define is just a constant or not.
Alice
Thanks, I guess we can update as we go as rust needs. Or I can do a big update as part of my VMA flag series respin?
On Fri, Nov 07, 2025 at 09:44:22AM +0000, Lorenzo Stoakes wrote:
On Fri, Nov 07, 2025 at 09:13:00AM +0000, Alice Ryhl wrote:
On Thu, Nov 06, 2025 at 02:54:33PM +0000, Lorenzo Stoakes wrote:
+cc Alice for rust stuff
On Thu, Nov 06, 2025 at 02:27:56PM +0000, Pedro Falcato wrote:
On Thu, Nov 06, 2025 at 10:46:12AM +0000, Lorenzo Stoakes wrote:
/*
- vm_flags in vm_area_struct, see mm_types.h.
- When changing, update also include/trace/events/mmflags.h
@@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */
Don't we also need an adjustment on the rust side for this BIT()? Like we for f04aad36a07c ("mm/ksm: fix flag-dropping behavior in ksm_madvise").
That's a bit unhelpful if rust can't cope with extremely basic assignments like that and we just have to know to add helpers :/
We do BIT() stuff for e.g. VM_HIGH_ARCH_n, VM_UFFD_MINOR_BIT, VM_ALLOW_ANY_UNCACHED_BIT, VM_DROPPABLE_BIT and VM_SEALED_BIT too and no such helpers there, So not sure if this is required?
Alice - why is it these 'non-trivial' defines were fine but VM_MERGEABLE was problematic? That seems strange.
I see [0], so let me build rust here and see if it moans, if it moans I'll add it.
When you use #define to declare a constant whose right-hand-side contains a function-like macro such as BIT(), bindgen does not define a Rust version of that constant. However, VM_MAYBE_GUARD is not referenced in Rust anywhere, so that isn't a problem.
It was a problem with VM_MERGEABLE because rust/kernel/mm/virt.rs references it.
Note that it's only the combination of #define and function-like macro that triggers this condition. If the constant is defined using another mechanism such as enum {}, then bindgen will generate the constant no matter how complex the right-hand-side is. The problem is that bindgen can't tell whether a #define is just a constant or not.
Thanks, I guess we can update as we go as rust needs. Or I can do a big update as part of my VMA flag series respin?
Whenever you think is a good time works for me.
I think it would be nice to move those constants so they use enum {} instead of #define at some point.
Alice
On Fri, Nov 07, 2025 at 12:12:43PM +0000, Alice Ryhl wrote:
On Fri, Nov 07, 2025 at 09:44:22AM +0000, Lorenzo Stoakes wrote:
On Fri, Nov 07, 2025 at 09:13:00AM +0000, Alice Ryhl wrote:
On Thu, Nov 06, 2025 at 02:54:33PM +0000, Lorenzo Stoakes wrote:
+cc Alice for rust stuff
On Thu, Nov 06, 2025 at 02:27:56PM +0000, Pedro Falcato wrote:
On Thu, Nov 06, 2025 at 10:46:12AM +0000, Lorenzo Stoakes wrote:
/*
- vm_flags in vm_area_struct, see mm_types.h.
- When changing, update also include/trace/events/mmflags.h
@@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD BIT(VM_MAYBE_GUARD_BIT) /* The VMA maybe contains guard regions. */
Don't we also need an adjustment on the rust side for this BIT()? Like we for f04aad36a07c ("mm/ksm: fix flag-dropping behavior in ksm_madvise").
That's a bit unhelpful if rust can't cope with extremely basic assignments like that and we just have to know to add helpers :/
We do BIT() stuff for e.g. VM_HIGH_ARCH_n, VM_UFFD_MINOR_BIT, VM_ALLOW_ANY_UNCACHED_BIT, VM_DROPPABLE_BIT and VM_SEALED_BIT too and no such helpers there, So not sure if this is required?
Alice - why is it these 'non-trivial' defines were fine but VM_MERGEABLE was problematic? That seems strange.
I see [0], so let me build rust here and see if it moans, if it moans I'll add it.
When you use #define to declare a constant whose right-hand-side contains a function-like macro such as BIT(), bindgen does not define a Rust version of that constant. However, VM_MAYBE_GUARD is not referenced in Rust anywhere, so that isn't a problem.
It was a problem with VM_MERGEABLE because rust/kernel/mm/virt.rs references it.
Note that it's only the combination of #define and function-like macro that triggers this condition. If the constant is defined using another mechanism such as enum {}, then bindgen will generate the constant no matter how complex the right-hand-side is. The problem is that bindgen can't tell whether a #define is just a constant or not.
Thanks, I guess we can update as we go as rust needs. Or I can do a big update as part of my VMA flag series respin?
Whenever you think is a good time works for me.
I think it would be nice to move those constants so they use enum {} instead of #define at some point.
Yeah I will do as part of my VMA series :) which actually is a neater solution to this in general (and can drop the existing binding helpers then actually).
Alice
Thanks, Lorenzo
This patch adds the ability to atomically set VMA flags with only the mmap read/VMA read lock held.
As this could be hugely problematic for VMA flags in general given that all other accesses are non-atomic and serialised by the mmap/VMA locks, we implement this with a strict allow-list - that is, only designated flags are allowed to do this.
We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap read flag upon guard region installation.
The places where this flag is used currently and matter are:
* VMA merge - performed under mmap/VMA write lock, therefore excluding racing writes.
* /proc/$pid/smaps - can race the write, however this isn't meaningful as the flag write is performed at the point of the guard region being established, and thus an smaps reader can't reasonably expect to avoid races. Due to atomicity, a reader will observe either the flag being set or not. Therefore consistency will be maintained.
In all other cases the flag being set is irrelevant and atomicity guarantees other flags will be read correctly.
We additionally update madvise_guard_install() to ensure that anon_vma_prepare() is set for anonymous VMAs to maintain consistency with the assumption that any anonymous VMA with page tables will have an anon_vma set, and any with an anon_vma unset will not have page tables established.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- include/linux/mm.h | 23 +++++++++++++++++++++++ mm/madvise.c | 22 ++++++++++++++-------- 2 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2a5516bff75a..2ea65c646212 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp); /* This mask represents all the VMA flag bits used by mlock */ #define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
+/* These flags can be updated atomically via VMA/mmap read lock. */ +#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD + /* Arch-specific flags to clear when updating VM flags on protection change */ #ifndef VM_ARCH_CLEAR # define VM_ARCH_CLEAR VM_NONE @@ -860,6 +863,26 @@ static inline void vm_flags_mod(struct vm_area_struct *vma, __vm_flags_mod(vma, set, clear); }
+/* + * Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific + * valid flags are allowed to do this. + */ +static inline void vma_flag_set_atomic(struct vm_area_struct *vma, + int bit) +{ + const vm_flags_t mask = BIT(bit); + + /* mmap read lock/VMA read lock must be held. */ + if (!rwsem_is_locked(&vma->vm_mm->mmap_lock)) + vma_assert_locked(vma); + + /* Only specific flags are permitted */ + if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED))) + return; + + set_bit(bit, &vma->__vm_flags); +} + static inline void vma_set_anonymous(struct vm_area_struct *vma) { vma->vm_ops = NULL; diff --git a/mm/madvise.c b/mm/madvise.c index 67bdfcb315b3..de918b107cfc 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1139,15 +1139,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL;
/* - * If we install guard markers, then the range is no longer - * empty from a page table perspective and therefore it's - * appropriate to have an anon_vma. - * - * This ensures that on fork, we copy page tables correctly. + * Set atomically under read lock. All pertinent readers will need to + * acquire an mmap/VMA write lock to read it. All remaining readers may + * or may not see the flag set, but we don't care. + */ + vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT); + + /* + * If anonymous and we are establishing page tables the VMA ought to + * have an anon_vma associated with it. */ - err = anon_vma_prepare(vma); - if (err) - return err; + if (vma_is_anonymous(vma)) { + err = anon_vma_prepare(vma); + if (err) + return err; + }
/* * Optimistically try to install the guard marker pages first. If any
On 11/6/25 11:46, Lorenzo Stoakes wrote:
This patch adds the ability to atomically set VMA flags with only the mmap read/VMA read lock held.
As this could be hugely problematic for VMA flags in general given that all other accesses are non-atomic and serialised by the mmap/VMA locks, we implement this with a strict allow-list - that is, only designated flags are allowed to do this.
We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap read flag upon guard region installation.
The places where this flag is used currently and matter are:
VMA merge - performed under mmap/VMA write lock, therefore excluding racing writes.
/proc/$pid/smaps - can race the write, however this isn't meaningful as the flag write is performed at the point of the guard region being established, and thus an smaps reader can't reasonably expect to avoid races. Due to atomicity, a reader will observe either the flag being set or not. Therefore consistency will be maintained.
In all other cases the flag being set is irrelevant and atomicity guarantees other flags will be read correctly.
Could we maybe also spell out that we rely on the read mmap/VMA lock to exclude with writers that have write lock and then use non-atomic updates to update completely different flags than VM_MAYBE_GUARD? Those non-atomic updates could cause RMW races when only our side uses an atomic update, but the trick is that the read lock excludes with the write lock.
We additionally update madvise_guard_install() to ensure that anon_vma_prepare() is set for anonymous VMAs to maintain consistency with the assumption that any anonymous VMA with page tables will have an anon_vma set, and any with an anon_vma unset will not have page tables established.
Could we more obviously say that we did anon_vma_prepare() unconditionally before this patch to trigger the page table copying in fork, but it's not needed anymore because fork now checks also VM_MAYBE_GUARD that we're setting here. Maybe it would be even more obvious to move that vma_needs_copy() hunk from previous patch to this one, but doesn't matter that much.
Also we could mention that this patch alone will prevent merging of VMAs in some situations, but that's addressed next. I don't think it's such a bisect hazard to need reordering or combining changes, just mention perhaps.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Otherwise LGTM.
Reviewed-by: Vlastimil Babka vbabka@suse.cz
On Thu, Nov 06, 2025 at 12:31:29PM +0100, Vlastimil Babka wrote:
On 11/6/25 11:46, Lorenzo Stoakes wrote:
This patch adds the ability to atomically set VMA flags with only the mmap read/VMA read lock held.
As this could be hugely problematic for VMA flags in general given that all other accesses are non-atomic and serialised by the mmap/VMA locks, we implement this with a strict allow-list - that is, only designated flags are allowed to do this.
We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap read flag upon guard region installation.
The places where this flag is used currently and matter are:
VMA merge - performed under mmap/VMA write lock, therefore excluding racing writes.
/proc/$pid/smaps - can race the write, however this isn't meaningful as the flag write is performed at the point of the guard region being established, and thus an smaps reader can't reasonably expect to avoid races. Due to atomicity, a reader will observe either the flag being set or not. Therefore consistency will be maintained.
In all other cases the flag being set is irrelevant and atomicity guarantees other flags will be read correctly.
Could we maybe also spell out that we rely on the read mmap/VMA lock to exclude with writers that have write lock and then use non-atomic updates to update completely different flags than VM_MAYBE_GUARD? Those non-atomic updates could cause RMW races when only our side uses an atomic update, but the trick is that the read lock excludes with the write lock.
I thought this was implicit, I guess I can spell that out.
We additionally update madvise_guard_install() to ensure that anon_vma_prepare() is set for anonymous VMAs to maintain consistency with the assumption that any anonymous VMA with page tables will have an anon_vma set, and any with an anon_vma unset will not have page tables established.
Could we more obviously say that we did anon_vma_prepare() unconditionally before this patch to trigger the page table copying in fork, but it's not needed anymore because fork now checks also VM_MAYBE_GUARD that we're setting here. Maybe it would be even more obvious to move that vma_needs_copy() hunk from previous patch to this one, but doesn't matter that much.
I thought that was covered between the comment, the previous patch and this but I can spell it out also.
Also we could mention that this patch alone will prevent merging of VMAs in some situations, but that's addressed next. I don't think it's such a bisect hazard to need reordering or combining changes, just mention perhaps.
A little pedantic but sure :)
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Otherwise LGTM.
Reviewed-by: Vlastimil Babka vbabka@suse.cz
Thanks!
On Thu, Nov 06, 2025 at 10:46:13AM +0000, Lorenzo Stoakes wrote:
This patch adds the ability to atomically set VMA flags with only the mmap read/VMA read lock held.
As this could be hugely problematic for VMA flags in general given that all other accesses are non-atomic and serialised by the mmap/VMA locks, we implement this with a strict allow-list - that is, only designated flags are allowed to do this.
We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap read flag upon guard region installation.
The places where this flag is used currently and matter are:
VMA merge - performed under mmap/VMA write lock, therefore excluding racing writes.
/proc/$pid/smaps - can race the write, however this isn't meaningful as the flag write is performed at the point of the guard region being established, and thus an smaps reader can't reasonably expect to avoid races. Due to atomicity, a reader will observe either the flag being set or not. Therefore consistency will be maintained.
In all other cases the flag being set is irrelevant and atomicity guarantees other flags will be read correctly.
Probably important to write down that the only reason why this doesn't make KCSAN have a small stroke is that we are only changing one bit. i.e we can only have one bit of atomic flags before annotating every reader.
(Source: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kern...)
We additionally update madvise_guard_install() to ensure that anon_vma_prepare() is set for anonymous VMAs to maintain consistency with the assumption that any anonymous VMA with page tables will have an anon_vma set, and any with an anon_vma unset will not have page tables established.
Isn't that what we already had? Or do you mean "*only* set for anonymous VMAs"?
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
With the nits below and above addressed: Reviewed-by: Pedro Falcato pfalcato@suse.de
include/linux/mm.h | 23 +++++++++++++++++++++++ mm/madvise.c | 22 ++++++++++++++-------- 2 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2a5516bff75a..2ea65c646212 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp); /* This mask represents all the VMA flag bits used by mlock */ #define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT) +/* These flags can be updated atomically via VMA/mmap read lock. */ +#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
/* Arch-specific flags to clear when updating VM flags on protection change */ #ifndef VM_ARCH_CLEAR # define VM_ARCH_CLEAR VM_NONE @@ -860,6 +863,26 @@ static inline void vm_flags_mod(struct vm_area_struct *vma, __vm_flags_mod(vma, set, clear); } +/*
- Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
- valid flags are allowed to do this.
- */
+static inline void vma_flag_set_atomic(struct vm_area_struct *vma,
int bit)+{
- const vm_flags_t mask = BIT(bit);
- /* mmap read lock/VMA read lock must be held. */
- if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
vma_assert_locked(vma);- /* Only specific flags are permitted */
- if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED)))
return;
VM_WARN_ON_ONCE?
On Thu, Nov 06, 2025 at 02:45:06PM +0000, Pedro Falcato wrote:
On Thu, Nov 06, 2025 at 10:46:13AM +0000, Lorenzo Stoakes wrote:
This patch adds the ability to atomically set VMA flags with only the mmap read/VMA read lock held.
As this could be hugely problematic for VMA flags in general given that all other accesses are non-atomic and serialised by the mmap/VMA locks, we implement this with a strict allow-list - that is, only designated flags are allowed to do this.
We make VM_MAYBE_GUARD one of these flags, and then set it under the mmap read flag upon guard region installation.
The places where this flag is used currently and matter are:
VMA merge - performed under mmap/VMA write lock, therefore excluding racing writes.
/proc/$pid/smaps - can race the write, however this isn't meaningful as the flag write is performed at the point of the guard region being established, and thus an smaps reader can't reasonably expect to avoid races. Due to atomicity, a reader will observe either the flag being set or not. Therefore consistency will be maintained.
In all other cases the flag being set is irrelevant and atomicity guarantees other flags will be read correctly.
Probably important to write down that the only reason why this doesn't make KCSAN have a small stroke is that we are only changing one bit. i.e we can only have one bit of atomic flags before annotating every reader.
(Source: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kern...)
That seems a bit specific and technical though? I guess since Vlasta is asking for maximum commit message pedantry here the more the merrier...
We additionally update madvise_guard_install() to ensure that anon_vma_prepare() is set for anonymous VMAs to maintain consistency with the assumption that any anonymous VMA with page tables will have an anon_vma set, and any with an anon_vma unset will not have page tables established.
Isn't that what we already had? Or do you mean "*only* set for anonymous VMAs"?
Yes... I'm going to expand on this explanation as per Vlasta to make it extremely clear anyway.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
With the nits below and above addressed: Reviewed-by: Pedro Falcato pfalcato@suse.de
Thanks, though I disagree with nit below.
include/linux/mm.h | 23 +++++++++++++++++++++++ mm/madvise.c | 22 ++++++++++++++-------- 2 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2a5516bff75a..2ea65c646212 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp); /* This mask represents all the VMA flag bits used by mlock */ #define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
+/* These flags can be updated atomically via VMA/mmap read lock. */ +#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
/* Arch-specific flags to clear when updating VM flags on protection change */ #ifndef VM_ARCH_CLEAR # define VM_ARCH_CLEAR VM_NONE @@ -860,6 +863,26 @@ static inline void vm_flags_mod(struct vm_area_struct *vma, __vm_flags_mod(vma, set, clear); }
+/*
- Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
- valid flags are allowed to do this.
- */
+static inline void vma_flag_set_atomic(struct vm_area_struct *vma,
int bit)+{
- const vm_flags_t mask = BIT(bit);
- /* mmap read lock/VMA read lock must be held. */
- if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
vma_assert_locked(vma);- /* Only specific flags are permitted */
- if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED)))
return;VM_WARN_ON_ONCE?
No, this was on puurpose - I don't want drivers (incl. out of tree) abusing this so I think this should be runtime and explicitly clear. See Suren's comment on last revision of series.
Obviously we should never be giving drivers naked vma pointers where this matters (and actually not sure exactly where it would), my mmap_prepare series is working to mitigate this though it's in a situation where the locking doesn't matter.
Also you can't use VM_WARN_ON_ONCE() that way, for some reason we don't have it return a value, go figure.
-- Pedro
Thanks, Lorenzo
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:
* The destination VMA has VM_UFFD_WP set * The mapping is a PFN or mixed map * The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.
We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.
Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.
In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.
Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.
This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.
As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.
As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.
We also update the VMA userland tests to account for the changes.
Note that VM_MAYBE_GUARD being set atomically remains correct as vma_needs_copy() is invoked with the mmap and VMA write locks held, excluding any race with madvise_guard_install().
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++ mm/memory.c | 3 +-- mm/vma.c | 22 ++++++++++++---------- tools/testing/vma/vma_internal.h | 32 ++++++++++++++++++++++++++++++++ 4 files changed, 77 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2ea65c646212..4d80eaf4ef3b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -527,6 +527,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD + +/* + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA + * possesses it but the other does not, the merged VMA should nonetheless have + * applied to it: + * + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains + * metadata which should be unconditionally propagated upon + * fork. When merging two VMAs, we encapsulate this range in + * the merged VMA, so the flag should be 'sticky' as a result. + */ +#define VM_STICKY VM_COPY_ON_FORK + +/* + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one + * of these flags and the other not does not preclude a merge. + * + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but + * dirty bit -- the caller should mark merged VMA as dirty. If + * dirty bit won't be excluded from comparison, we increase + * pressure on the memory system forcing the kernel to generate + * new VMAs when old one could be extended instead. + * + * VM_STICKY - If one VMA has flags which most be 'sticky', that is ones + * which should propagate to all VMAs, but the other does not, + * the merge should still proceed with the merge logic applying + * sticky flags to the final VMA. + */ +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY) + /* * mapping from the currently active vm_flags protection bits (the * low four bits) to a page protection mask.. diff --git a/mm/memory.c b/mm/memory.c index 334732ab6733..7582a88f5332 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1480,8 +1480,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
- /* Guard regions have momdified page tables that require copying. */ - if (src_vma->vm_flags & VM_MAYBE_GUARD) + if (src_vma->vm_flags & VM_COPY_ON_FORK) return true;
/* diff --git a/mm/vma.c b/mm/vma.c index 0c5e391fe2e2..6cb082bc5e29 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
if (!mpol_equal(vmg->policy, vma_policy(vma))) return false; - /* - * VM_SOFTDIRTY should not prevent from VMA merging, if we - * match the flags but dirty bit -- the caller should mark - * merged VMA as dirty. If dirty bit won't be excluded from - * comparison, we increase pressure on the memory system forcing - * the kernel to generate new VMAs when old one could be - * extended instead. - */ - if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY) + if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE) return false; if (vma->vm_file != vmg->file) return false; @@ -808,6 +800,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) { + vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY; struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *next; @@ -900,11 +893,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next; + sticky_flags |= (next->vm_flags & VM_STICKY); }
if (merge_left) { vma_start_write(prev); vmg->target = prev; + sticky_flags |= (prev->vm_flags & VM_STICKY); }
if (merge_both) { @@ -974,6 +969,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;
+ vm_flags_set(vmg->target, sticky_flags); khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target; @@ -1124,6 +1120,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next; + vm_flags_t sticky_flags; + + sticky_flags = vmg->vm_flags & VM_STICKY; + sticky_flags |= target->vm_flags & VM_STICKY;
VM_WARN_ON_VMG(!target, vmg);
@@ -1133,6 +1133,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;
+ sticky_flags |= next->vm_flags & VM_STICKY; remove_next = true; /* This should already have been checked by this point. */ VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg); @@ -1159,6 +1160,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;
+ vm_flags_set(target, sticky_flags); return 0;
nomem: @@ -1902,7 +1904,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file && - !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) && + !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) && b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index ddf58a5e1add..984307a64ee9 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -119,6 +119,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD + +/* + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA + * possesses it but the other does not, the merged VMA should nonetheless have + * applied to it: + * + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains + * metadata which should be unconditionally propagated upon + * fork. When merging two VMAs, we encapsulate this range in + * the merged VMA, so the flag should be 'sticky' as a result. + */ +#define VM_STICKY VM_COPY_ON_FORK + +/* + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one + * of these flags and the other not does not preclude a merge. + * + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but + * dirty bit -- the caller should mark merged VMA as dirty. If + * dirty bit won't be excluded from comparison, we increase + * pressure on the memory system forcing the kernel to generate + * new VMAs when old one could be extended instead. + * + * VM_STICKY - If one VMA has flags which must be 'sticky', that is ones + * which should propagate to all VMAs, but the other does not, + * the merge should still proceed with the merge logic applying + * sticky flags to the final VMA. + */ +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY) + #define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL
On 11/6/25 11:46, Lorenzo Stoakes wrote:
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:
- The destination VMA has VM_UFFD_WP set
- The mapping is a PFN or mixed map
- The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.
We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.
Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.
In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.
Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.
This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.
As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.
As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.
We also update the VMA userland tests to account for the changes.
Note that VM_MAYBE_GUARD being set atomically remains correct as vma_needs_copy() is invoked with the mmap and VMA write locks held, excluding any race with madvise_guard_install().
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++ mm/memory.c | 3 +-- mm/vma.c | 22 ++++++++++++---------- tools/testing/vma/vma_internal.h | 32 ++++++++++++++++++++++++++++++++ 4 files changed, 77 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2ea65c646212..4d80eaf4ef3b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -527,6 +527,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR) +/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.- */
+#define VM_STICKY VM_COPY_ON_FORK
TBH I don't see why there should be always an implication that copying on fork implies stickiness in merging. Yeah, VM_MAYBE_GUARD is both, but in general, is there any underlying property that makes this a rule?
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.
So I wonder if VM_SOFTDIRTY should be actually also sticky and not just VM_IGNORE_MERGE. The way I understand the flag suggests it should. Right now AFAICS its rather undefined if the result of vma merge has the flag - depending on which of the two VMA's stays and which is removed by the merge. "the caller should mark merged VMA as dirty" in the comment you're moving here seems not really happening or I'm missing it. __mmap_complete() and do_brk_flags() do it, so any new areas are marked, but on pure merge of two vma's due to e.g. mprotect() this is really nondetermintic? AFAICT the sticky flag behavior would work perfectly for VM_SOFTDIRTY.
- VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
which should propagate to all VMAs, but the other does not,
the merge should still proceed with the merge logic applying
sticky flags to the final VMA.- */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
/*
- mapping from the currently active vm_flags protection bits (the
- low four bits) to a page protection mask..
diff --git a/mm/memory.c b/mm/memory.c index 334732ab6733..7582a88f5332 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1480,8 +1480,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
- /* Guard regions have momdified page tables that require copying. */
- if (src_vma->vm_flags & VM_MAYBE_GUARD)
- if (src_vma->vm_flags & VM_COPY_ON_FORK) return true;
/* diff --git a/mm/vma.c b/mm/vma.c index 0c5e391fe2e2..6cb082bc5e29 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex if (!mpol_equal(vmg->policy, vma_policy(vma))) return false;
- /*
* VM_SOFTDIRTY should not prevent from VMA merging, if we* match the flags but dirty bit -- the caller should mark* merged VMA as dirty. If dirty bit won't be excluded from* comparison, we increase pressure on the memory system forcing* the kernel to generate new VMAs when old one could be* extended instead.*/- if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
- if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE) return false; if (vma->vm_file != vmg->file) return false;
@@ -808,6 +800,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) {
- vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY; struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *next;
@@ -900,11 +893,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next;
}sticky_flags |= (next->vm_flags & VM_STICKY);if (merge_left) { vma_start_write(prev); vmg->target = prev;
}sticky_flags |= (prev->vm_flags & VM_STICKY);if (merge_both) { @@ -974,6 +969,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;
- vm_flags_set(vmg->target, sticky_flags); khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target;
@@ -1124,6 +1120,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next;
- vm_flags_t sticky_flags;
- sticky_flags = vmg->vm_flags & VM_STICKY;
- sticky_flags |= target->vm_flags & VM_STICKY;
VM_WARN_ON_VMG(!target, vmg); @@ -1133,6 +1133,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;
remove_next = true; /* This should already have been checked by this point. */ VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);sticky_flags |= next->vm_flags & VM_STICKY;@@ -1159,6 +1160,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;
- vm_flags_set(target, sticky_flags); return 0;
nomem: @@ -1902,7 +1904,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file &&
!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) &&} diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index ddf58a5e1add..984307a64ee9 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -119,6 +119,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif +/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.- */
+#define VM_STICKY VM_COPY_ON_FORK
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.
- VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
which should propagate to all VMAs, but the other does not,
the merge should still proceed with the merge logic applying
sticky flags to the final VMA.- */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
#define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL
On Thu, Nov 06, 2025 at 02:46:38PM +0100, Vlastimil Babka wrote:
On 11/6/25 11:46, Lorenzo Stoakes wrote:
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:
- The destination VMA has VM_UFFD_WP set
- The mapping is a PFN or mixed map
- The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.
We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.
Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.
In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.
Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.
This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.
As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.
As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.
We also update the VMA userland tests to account for the changes.
Note that VM_MAYBE_GUARD being set atomically remains correct as vma_needs_copy() is invoked with the mmap and VMA write locks held, excluding any race with madvise_guard_install().
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++ mm/memory.c | 3 +-- mm/vma.c | 22 ++++++++++++---------- tools/testing/vma/vma_internal.h | 32 ++++++++++++++++++++++++++++++++ 4 files changed, 77 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2ea65c646212..4d80eaf4ef3b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -527,6 +527,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.- */
+#define VM_STICKY VM_COPY_ON_FORK
TBH I don't see why there should be always an implication that copying on fork implies stickiness in merging. Yeah, VM_MAYBE_GUARD is both, but in general, is there any underlying property that makes this a rule?
Why do you copy on fork? It's because the page tables contain data that won't be reconstructed on fault.
If that is the case, that applies to any VMA which is merged, and also - since you can't be sure precisely which page tables contain the data we need to propagate - on split too.
This is why copy on fork implies sticky IMO.
I can update the commit message to make this clear if this makes sense?
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.
Note that I'm literally just moving the comment from is_mergeable_vma():
- * VM_SOFTDIRTY should not prevent from VMA merging, if we - * match the flags but dirty bit -- the caller should mark - * merged VMA as dirty. If dirty bit won't be excluded from - * comparison, we increase pressure on the memory system forcing - * the kernel to generate new VMAs when old one could be - * extended instead.
(OK I see you realised that below :P)
So I wonder if VM_SOFTDIRTY should be actually also sticky and not just VM_IGNORE_MERGE. The way I understand the flag suggests it should. Right now AFAICS its rather undefined if the result of vma merge has the flag - depending on which of the two VMA's stays and which is removed by the merge. "the caller should mark merged VMA as dirty" in the comment you're moving here seems not really happening or I'm missing it. __mmap_complete()
No it's not happening, but I can't be blamed for existing incorrect comments :)
and do_brk_flags() do it, so any new areas are marked, but on pure merge of two vma's due to e.g. mprotect() this is really nondetermintic? AFAICT the sticky flag behavior would work perfectly for VM_SOFTDIRTY.
Maybe we inavertantly changed this somehow or maybe it was just wrong, but we're not doing this on merge in general afaict.
I think you're right that we should make this sticky, but I'd rather deal with that in a follow-up series/patch as this is out of scope here.
Equally so I'd rather fix the comment in a follow up too for the same reason.
- VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
which should propagate to all VMAs, but the other does not,
the merge should still proceed with the merge logic applying
sticky flags to the final VMA.- */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
/*
- mapping from the currently active vm_flags protection bits (the
- low four bits) to a page protection mask..
diff --git a/mm/memory.c b/mm/memory.c index 334732ab6733..7582a88f5332 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1480,8 +1480,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
- /* Guard regions have momdified page tables that require copying. */
- if (src_vma->vm_flags & VM_MAYBE_GUARD)
if (src_vma->vm_flags & VM_COPY_ON_FORK) return true;
/*
diff --git a/mm/vma.c b/mm/vma.c index 0c5e391fe2e2..6cb082bc5e29 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
if (!mpol_equal(vmg->policy, vma_policy(vma))) return false;
- /*
* VM_SOFTDIRTY should not prevent from VMA merging, if we* match the flags but dirty bit -- the caller should mark* merged VMA as dirty. If dirty bit won't be excluded from* comparison, we increase pressure on the memory system forcing* the kernel to generate new VMAs when old one could be* extended instead.*/- if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
- if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE) return false; if (vma->vm_file != vmg->file) return false;
@@ -808,6 +800,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) {
- vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY; struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *next;
@@ -900,11 +893,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next;
sticky_flags |= (next->vm_flags & VM_STICKY);}
if (merge_left) { vma_start_write(prev); vmg->target = prev;
sticky_flags |= (prev->vm_flags & VM_STICKY);}
if (merge_both) {
@@ -974,6 +969,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;
- vm_flags_set(vmg->target, sticky_flags); khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target;
@@ -1124,6 +1120,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next;
vm_flags_t sticky_flags;
sticky_flags = vmg->vm_flags & VM_STICKY;
sticky_flags |= target->vm_flags & VM_STICKY;
VM_WARN_ON_VMG(!target, vmg);
@@ -1133,6 +1133,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;
remove_next = true; /* This should already have been checked by this point. */ VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);sticky_flags |= next->vm_flags & VM_STICKY;@@ -1159,6 +1160,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;
- vm_flags_set(target, sticky_flags); return 0;
nomem: @@ -1902,7 +1904,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file &&
!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) &&}
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index ddf58a5e1add..984307a64ee9 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -119,6 +119,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.- */
+#define VM_STICKY VM_COPY_ON_FORK
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.
- VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
which should propagate to all VMAs, but the other does not,
the merge should still proceed with the merge logic applying
sticky flags to the final VMA.- */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
#define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL
On 11/6/25 15:18, Lorenzo Stoakes wrote:
On Thu, Nov 06, 2025 at 02:46:38PM +0100, Vlastimil Babka wrote:
On 11/6/25 11:46, Lorenzo Stoakes wrote:
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2ea65c646212..4d80eaf4ef3b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -527,6 +527,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.- */
+#define VM_STICKY VM_COPY_ON_FORK
TBH I don't see why there should be always an implication that copying on fork implies stickiness in merging. Yeah, VM_MAYBE_GUARD is both, but in general, is there any underlying property that makes this a rule?
Why do you copy on fork? It's because the page tables contain data that won't be reconstructed on fault.
If that is the case, that applies to any VMA which is merged, and also - since you can't be sure precisely which page tables contain the data we need to propagate - on split too.
This is why copy on fork implies sticky IMO.
Hmm I guess that makes some sense.
I can update the commit message to make this clear if this makes sense?
It would help, thanks. Let's see if future will surprise us with some flag where this won't be true :)
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.Note that I'm literally just moving the comment from is_mergeable_vma():
* VM_SOFTDIRTY should not prevent from VMA merging, if we* match the flags but dirty bit -- the caller should mark* merged VMA as dirty. If dirty bit won't be excluded from* comparison, we increase pressure on the memory system forcing* the kernel to generate new VMAs when old one could be* extended instead.(OK I see you realised that below :P)
So I wonder if VM_SOFTDIRTY should be actually also sticky and not just VM_IGNORE_MERGE. The way I understand the flag suggests it should. Right now AFAICS its rather undefined if the result of vma merge has the flag - depending on which of the two VMA's stays and which is removed by the merge. "the caller should mark merged VMA as dirty" in the comment you're moving here seems not really happening or I'm missing it. __mmap_complete()
No it's not happening, but I can't be blamed for existing incorrect comments :)
and do_brk_flags() do it, so any new areas are marked, but on pure merge of two vma's due to e.g. mprotect() this is really nondetermintic? AFAICT the sticky flag behavior would work perfectly for VM_SOFTDIRTY.
Maybe we inavertantly changed this somehow or maybe it was just wrong, but we're not doing this on merge in general afaict.
Yeah wouldn't surprised me if we subtly changed it during some refactoring and it's not causing such obvious issues to be noticed easily.
I think you're right that we should make this sticky, but I'd rather deal with that in a follow-up series/patch as this is out of scope here.
Equally so I'd rather fix the comment in a follow up too for the same reason.
Sure it's just something I noticed and seems like a good fit for the new concept.
On Thu, Nov 06, 2025 at 10:46:14AM +0000, Lorenzo Stoakes wrote:
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:
- The destination VMA has VM_UFFD_WP set
- The mapping is a PFN or mixed map
- The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.
We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.
Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.
In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.
Perhaps we should separate this patch into two? It looks like we're doing two things at once for no great reason. But it's a bit of a sticky situation...
Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.
This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.
As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.
As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.
We also update the VMA userland tests to account for the changes.
Note that VM_MAYBE_GUARD being set atomically remains correct as vma_needs_copy() is invoked with the mmap and VMA write locks held, excluding any race with madvise_guard_install().
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Overall the patch LGTM.
Feel free to add: Reviewed-by: Pedro Falcato pfalcato@suse.de
and maybe print it out on a sticker.
Modify existing merge new/existing userland VMA tests to assert that sticky VMA flags behave as expected.
We do so by generating every possible permutation of VMAs being manipulated being sticky/not sticky and asserting that VMA flags with this property retain are retained upon merge.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/vma/vma.c | 89 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 79 insertions(+), 10 deletions(-)
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..ee9d3547c421 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma; #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))
+#define IS_SET(_val, _flags) ((_val & _flags) == _flags) + static struct task_struct __current;
struct task_struct *get_current(void) @@ -441,7 +443,7 @@ static bool test_simple_shrink(void) return true; }
-static bool test_merge_new(void) +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; @@ -469,23 +471,32 @@ static bool test_merge_new(void) struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; bool merged;
+ if (is_sticky) + vm_flags |= VM_STICKY; + /* * 0123456789abc * AA B CC */ vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); ASSERT_NE(vma_a, NULL); + if (a_is_sticky) + vm_flags_set(vma_a, VM_STICKY); /* We give each VMA a single avc so we can test anon_vma duplication. */ INIT_LIST_HEAD(&vma_a->anon_vma_chain); list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);
vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags); ASSERT_NE(vma_b, NULL); + if (b_is_sticky) + vm_flags_set(vma_b, VM_STICKY); INIT_LIST_HEAD(&vma_b->anon_vma_chain); list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);
vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags); ASSERT_NE(vma_c, NULL); + if (c_is_sticky) + vm_flags_set(vma_c, VM_STICKY); INIT_LIST_HEAD(&vma_c->anon_vma_chain); list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);
@@ -520,6 +531,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky || a_is_sticky || b_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge to PREVIOUS VMA. @@ -537,6 +550,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky || a_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge to NEXT VMA. @@ -556,6 +571,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky) /* D uses is_sticky. */ + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge BOTH sides. @@ -574,6 +591,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (is_sticky || a_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge to NEXT VMA. @@ -592,6 +611,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (is_sticky || c_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge BOTH sides. @@ -609,6 +630,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 1); + if (is_sticky || a_is_sticky || c_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Final state. @@ -637,6 +660,20 @@ static bool test_merge_new(void) return true; }
+static bool test_merge_new(void) +{ + int i, j, k, l; + + /* Generate every possible permutation of sticky flags. */ + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + for (k = 0; k < 2; k++) + for (l = 0; l < 2; l++) + ASSERT_TRUE(__test_merge_new(i, j, k, l)); + + return true; +} + static bool test_vma_merge_special_flags(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void) return true; }
-static bool test_merge_existing(void) +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t prev_flags = vm_flags; + vm_flags_t next_flags = vm_flags; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma, *vma_prev, *vma_next; @@ -988,6 +1027,13 @@ static bool test_merge_existing(void) }; struct anon_vma_chain avc = {};
+ if (prev_is_sticky) + prev_flags |= VM_STICKY; + if (middle_is_sticky) + vm_flags |= VM_STICKY; + if (next_is_sticky) + next_flags |= VM_STICKY; + /* * Merge right case - partial span. * @@ -1000,7 +1046,7 @@ static bool test_merge_existing(void) */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma); vmg.middle = vma; @@ -1018,6 +1064,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 2); + if (middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1033,7 +1081,7 @@ static bool test_merge_existing(void) * NNNNNNN */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma); vmg.middle = vma; @@ -1046,6 +1094,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 1); + if (middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1060,7 +1110,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPV */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ @@ -1080,6 +1130,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (prev_is_sticky || middle_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1094,7 +1146,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPP */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); @@ -1109,6 +1161,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1); + if (prev_is_sticky || middle_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1123,10 +1177,10 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPPPPP */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; @@ -1139,6 +1193,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1); + if (prev_is_sticky || middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
/* Clear down and reset. We should have deleted prev and next. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1158,9 +1214,9 @@ static bool test_merge_existing(void) * PPPVVVVVNNN */
- vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);
vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags); vmg.prev = vma; @@ -1203,6 +1259,19 @@ static bool test_merge_existing(void) return true; }
+static bool test_merge_existing(void) +{ + int i, j, k; + + /* Generate every possible permutation of sticky flags. */ + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + for (k = 0; k < 2; k++) + ASSERT_TRUE(__test_merge_existing(i, j, k)); + + return true; +} + static bool test_anon_vma_non_mergeable(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
Assert that we observe guard regions appearing in /proc/$pid/smaps as expected, and when split/merge is performed too (with expected sticky behaviour).
Also add handling for file systems which don't sanely handle mmap() VMA merging so we don't incorrectly encounter a test failure in this situation.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + 3 files changed, 126 insertions(+)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 8dd81c0a4a5a..a9be11e03a6a 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -94,6 +94,7 @@ static void *mmap_(FIXTURE_DATA(guard_regions) * self, case ANON_BACKED: flags |= MAP_PRIVATE | MAP_ANON; fd = -1; + offset = 0; break; case SHMEM_BACKED: case LOCAL_FILE_BACKED: @@ -260,6 +261,54 @@ static bool is_buf_eq(char *buf, size_t size, char chr) return true; }
+/* + * Some file systems have issues with merging due to changing merge-sensitive + * parameters in the .mmap callback, and prior to .mmap_prepare being + * implemented everywhere this will now result in an unexpected failure to + * merge (e.g. - overlayfs). + * + * Perform a simple test to see if the local file system suffers from this, if + * it does then we can skip test logic that assumes local file system merging is + * sane. + */ +static bool local_fs_has_sane_mmap(FIXTURE_DATA(guard_regions) * self, + const FIXTURE_VARIANT(guard_regions) * variant) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr2; + struct procmap_fd procmap; + + if (variant->backing != LOCAL_FILE_BACKED) + return true; + + /* Map 10 pages. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0); + if (ptr == MAP_FAILED) + return false; + /* Unmap the middle. */ + munmap(&ptr[5 * page_size], page_size); + + /* Map again. */ + ptr2 = mmap_(self, variant, &ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED, 5 * page_size); + + if (ptr2 == MAP_FAILED) + return false; + + /* Now make sure they all merged. */ + if (open_self_procmap(&procmap) != 0) + return false; + if (!find_vma_procmap(&procmap, ptr)) + return false; + if (procmap.query.vma_start != (unsigned long)ptr) + return false; + if (procmap.query.vma_end != (unsigned long)ptr + 10 * page_size) + return false; + close_procmap(&procmap); + + return true; +} + FIXTURE_SETUP(guard_regions) { self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); @@ -2138,4 +2187,75 @@ TEST_F(guard_regions, pagemap_scan) ASSERT_EQ(munmap(ptr, 10 * page_size), 0); }
+TEST_F(guard_regions, smaps) +{ + const unsigned long page_size = self->page_size; + struct procmap_fd procmap; + char *ptr, *ptr2; + int i; + + /* Map a region. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* We shouldn't yet see a guard flag. */ + ASSERT_FALSE(check_vmflag_guard(ptr)); + + /* Install a single guard region. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0); + + /* Now we should see a guard flag. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + + /* + * Removing the guard region should not change things because we simply + * cannot accurately track whether a given VMA has had all of its guard + * regions removed. + */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_REMOVE), 0); + ASSERT_TRUE(check_vmflag_guard(ptr)); + + /* Install guard regions throughout. */ + for (i = 0; i < 10; i++) { + ASSERT_EQ(madvise(&ptr[i * page_size], page_size, MADV_GUARD_INSTALL), 0); + /* We should always see the guard region flag. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + } + + /* Split into two VMAs. */ + ASSERT_EQ(munmap(&ptr[4 * page_size], page_size), 0); + + /* Both VMAs should have the guard flag set. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + ASSERT_TRUE(check_vmflag_guard(&ptr[5 * page_size])); + + /* + * If the local file system is unable to merge VMAs due to having + * unusual characteristics, there is no point in asserting merge + * behaviour. + */ + if (!local_fs_has_sane_mmap(self, variant)) { + TH_LOG("local filesystem does not support sane merging skipping merge test"); + return; + } + + /* Map a fresh VMA between the two split VMAs. */ + ptr2 = mmap_(self, variant, &ptr[4 * page_size], page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 4 * page_size); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Check the procmap to ensure that this VMA merged with the adjacent + * two. The guard region flag is 'sticky' so should not preclude + * merging. + */ + ASSERT_EQ(open_self_procmap(&procmap), 0); + ASSERT_TRUE(find_vma_procmap(&procmap, ptr)); + ASSERT_EQ(procmap.query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap.query.vma_end, (unsigned long)ptr + 10 * page_size); + ASSERT_EQ(close_procmap(&procmap), 0); + /* And, of course, this VMA should have the guard flag set. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index e33cda301dad..605cb58ea5c3 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -449,6 +449,11 @@ bool check_vmflag_pfnmap(void *addr) return check_vmflag(addr, "pf"); }
+bool check_vmflag_guard(void *addr) +{ + return check_vmflag(addr, "gu"); +} + bool softdirty_supported(void) { char *addr; diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 26c30fdc0241..a8abdf414d46 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -98,6 +98,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len, unsigned long get_free_hugepages(void); bool check_vmflag_io(void *addr); bool check_vmflag_pfnmap(void *addr); +bool check_vmflag_guard(void *addr); int open_procmap(pid_t pid, struct procmap_fd *procmap_out); int query_procmap(struct procmap_fd *procmap); bool find_vma_procmap(struct procmap_fd *procmap, void *address);
linux-kselftest-mirror@lists.linaro.org