[PATCH 0/3] introduce VM_MAYBE_GUARD and make it sticky

List overview All Threads
Download

newer

older

/Re,

[bpf-next] selftests/bpf: refactor...

Lorenzo Stoakes

29 Oct 2025 29 Oct '25

4:50 p.m.

Currently, guard regions are not visible to users except through /proc/$pid/pagemap, with no explicit visibility at the VMA level.

This makes the feature less useful, as it isn't entirely apparent which VMAs may have these entries present, especially when performing actions which walk through memory regions such as those performed by CRIU.

This series addresses this issue by introducing the VM_MAYBE_GUARD flag which fulfils this role, updating the smaps logic to display an entry for these.

The semantics of this flag are that a guard region MAY be present if set (we cannot be sure, as we can't efficiently track whether an MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if not set the VMA definitely does NOT have any guard regions present.

It's problematic to establish this flag without further action, because that means that VMAs with guard regions in them become non-mergeable with adjacent VMAs for no especially good reason.

To work around this, this series also introduces the concept of 'sticky' VMA flags - that is flags which:

a. if set in one VMA and not in another still permit those VMAs to be merged (if otherwise compatible).

b. When they are merged, the resultant VMA must have the flag set.

The VMA logic is updated to propagate these flags correctly.

Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve an issue with file-backed guard regions - previously these established an anon_vma object for file-backed mappings solely to have vma_needs_copy() correctly propagate guard region mappings to child processes.

We introduce a new flag alias VM_COPY_ON_FORK (which currently only specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly for this flag and to copy page tables if it is present, which resolves this issue.

Finally we introduce extensive VMA userland tests to assert that the sticky VMA logic behaves correctly as well as guard region self tests to assert that smaps visibility is correctly implemented.

Lorenzo Stoakes (3): mm: introduce VM_MAYBE_GUARD and make visible for guard regions mm: implement sticky, copy on fork VMA flags selftests/mm/guard-regions: add smaps visibility test

-- 2.51.0

Show replies by date

Lorenzo Stoakes

29 Oct 29 Oct

4:50 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

We also update the smaps logic and documentation to identify these VMAs.

Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.

For anonymous mappings this is inherent, however since commit f807123d578d ("mm: allow guard regions in file-backed and read-only mappings") which allowed file-backed guard regions, we have unfortunately had to enforce this behaviour by settings vma->anon_vma to force page table copying.

The existence of this flag removes the need for this, so we simply update vma_needs_copy() to check for this flag instead.

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping + gu maybe contains guard regions (if not set, definitely doesn't) == =======================================

Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index fc35a0543f01..db16ed91c269 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf", + [ilog2(VM_MAYBE_GUARD)] = "gu", [ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr", diff --git a/include/linux/mm.h b/include/linux/mm.h index aada935c4950..f963afa1b9de 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD 0x00000800 /* The VMA maybe contains guard regions. */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */

#define VM_LOCKED 0x00002000 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index aa441f593e9a..a6e5a44c9b42 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3) {VM_UFFD_MISSING, "uffd_missing" }, \ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \ + {VM_MAYBE_GUARD, "maybe_guard" }, \ {VM_UFFD_WP, "uffd_wp" }, \ {VM_LOCKED, "locked" }, \ {VM_IO, "io" }, \ diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..216ae6ed344e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL;

/* - * If we install guard markers, then the range is no longer - * empty from a page table perspective and therefore it's - * appropriate to have an anon_vma. + * It would be confusing for anonymous mappings to have page table + * entries but no anon_vma established, so ensure that it is. + */ + if (vma_is_anonymous(vma)) + anon_vma_prepare(vma); + + /* + * Indicate that the VMA may contain guard regions, making it visible to + * the user that a VMA may contain these, narrowing down the range which + * must be scanned in order to detect them. * - * This ensures that on fork, we copy page tables correctly. + * This additionally causes page tables to be copied on fork regardless + * of whether the VMA is anonymous or not, correctly preserving the + * guard region page table entries. */ - err = anon_vma_prepare(vma); - if (err) - return err; + vm_flags_set(vma, VM_MAYBE_GUARD);

/* * Optimistically try to install the guard marker pages first. If any @@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: - case MADV_GUARD_INSTALL: case MADV_GUARD_REMOVE: return MADVISE_MMAP_READ_LOCK; case MADV_DONTNEED: diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;

+ /* Guard regions have momdified page tables that require copying. */ + if (src_vma->vm_flags & VM_MAYBE_GUARD) + return true; + /* * Don't copy ptes where a page fault will fill them correctly. Fork * becomes much lighter when there are big shared or private readonly diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index d873667704e8..e40c93edc5a7 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr; #define VM_MAYEXEC 0x00000040 #define VM_GROWSDOWN 0x00000100 #define VM_PFNMAP 0x00000400 +#define VM_MAYBE_GUARD 0x00000800 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 #define VM_SEQ_READ 0x00008000 /* App will access data sequentially */

-- 2.51.0

Randy Dunlap

7:50 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On 10/29/25 9:50 AM, Lorenzo Stoakes wrote:

...

diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;

/* Guard regions have momdified page tables that require copying. */

modified

...

if (src_vma->vm_flags & VM_MAYBE_GUARD)
return true;
/*

Don't copy ptes where a page fault will fill them correctly. Fork

becomes much lighter when there are big shared or private readonly

-- ~Randy

Lorenzo Stoakes

30 Oct 30 Oct

8:13 a.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Wed, Oct 29, 2025 at 12:50:18PM -0700, Randy Dunlap wrote:

...

On 10/29/25 9:50 AM, Lorenzo Stoakes wrote:

...
diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;

/* Guard regions have momdified page tables that require copying. */
                     modified

I appear to have entered some kind of typo nirvana in this series it seems :)

Thanks, will fix!

...

...
if (src_vma->vm_flags & VM_MAYBE_GUARD)
return true;
/*

Don't copy ptes where a page fault will fill them correctly. Fork

becomes much lighter when there are big shared or private readonly
-- ~Randy

Cheers, Lorenzo

Suren Baghdasaryan

1:05 a.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...

Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would better represent the meaning.

...

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

s/bee/been but I'm not even sure the above paragraph has to be included in the changelog. It's a technical detail IMHO.

...

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

I guess this is because it is modifying vm_flags now?

...

We also update the smaps logic and documentation to identify these VMAs.

Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.

For anonymous mappings this is inherent, however since commit f807123d578d ("mm: allow guard regions in file-backed and read-only mappings") which allowed file-backed guard regions, we have unfortunately had to enforce this behaviour by settings vma->anon_vma to force page table copying.

The existence of this flag removes the need for this, so we simply update vma_needs_copy() to check for this flag instead.

Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com

Overall, makes sense to me and I think we could use it.

Reviewed-by: Suren Baghdasaryan surenb@google.com

It would be nice to have a way for userspace to reset this flag if it confirms that the VMA does not really have any guards (using say PAGEMAP_SCAN) but I think such an API can be abused.

...

Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 1 + include/trace/events/mmflags.h | 1 + mm/madvise.c | 22 ++++++++++++++-------- mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 1 + 7 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping

gu maybe contains guard regions (if not set, definitely doesn't) == =======================================

Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index fc35a0543f01..db16ed91c269 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
          [ilog2(VM_MAYBE_GUARD)] = "gu",
          [ilog2(VM_LOCKED)]      = "lo",
          [ilog2(VM_IO)]          = "io",
          [ilog2(VM_SEQ_READ)]    = "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h index aada935c4950..f963afa1b9de 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD 0x00000800 /* The VMA maybe contains guard regions. */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */

#define VM_LOCKED 0x00002000 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index aa441f593e9a..a6e5a44c9b42 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3) {VM_UFFD_MISSING, "uffd_missing" }, \ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \
  {VM_MAYBE_GUARD,                "maybe_guard"   },              \
  {VM_UFFD_WP,                    "uffd_wp"       },              \
  {VM_LOCKED,                     "locked"        },              \
  {VM_IO,                         "io"            },              \
diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..216ae6ed344e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL;
    /*
   * If we install guard markers, then the range is no longer
   * empty from a page table perspective and therefore it's
   * appropriate to have an anon_vma.
   * It would be confusing for anonymous mappings to have page table
   * entries but no anon_vma established, so ensure that it is.
   */
  if (vma_is_anonymous(vma))
          anon_vma_prepare(vma);
  /*
   * Indicate that the VMA may contain guard regions, making it visible to
   * the user that a VMA may contain these, narrowing down the range which
   * must be scanned in order to detect them.
   *
   * This ensures that on fork, we copy page tables correctly.
   * This additionally causes page tables to be copied on fork regardless
   * of whether the VMA is anonymous or not, correctly preserving the
   * guard region page table entries.
   */
  err = anon_vma_prepare(vma);
  if (err)
          return err;
  vm_flags_set(vma, VM_MAYBE_GUARD);

  /*
   * Optimistically try to install the guard marker pages first. If any
@@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE:
  case MADV_GUARD_INSTALL:
  case MADV_GUARD_REMOVE:
          return MADVISE_MMAP_READ_LOCK;
  case MADV_DONTNEED:
diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
  /* Guard regions have momdified page tables that require copying. */
  if (src_vma->vm_flags & VM_MAYBE_GUARD)
          return true;
  /*
   * Don't copy ptes where a page fault will fill them correctly.  Fork
   * becomes much lighter when there are big shared or private readonly
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index d873667704e8..e40c93edc5a7 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr; #define VM_MAYEXEC 0x00000040 #define VM_GROWSDOWN 0x00000100 #define VM_PFNMAP 0x00000400 +#define VM_MAYBE_GUARD 0x00000800 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000

#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */

2.51.0

Lorenzo Stoakes

8:22 a.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Wed, Oct 29, 2025 at 06:05:54PM -0700, Suren Baghdasaryan wrote:

...

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would better represent the meaning.

We all suck at naming :) it's the hardest bit! :P

Hm I don't love that, bit overwrought, I do think 'maybe guard' is a better shorthand for this flag name.

I am open to other suggestions but I think the original wins on succinctness here!

...

...
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

s/bee/been

Yeah this series appears to be a bonanza of typos... not sure why :) will fix.

...

but I'm not even sure the above paragraph has to be included in the changelog. It's a technical detail IMHO.

Well, I think it's actually important to highlight that we have a VMA flag free and why. I know it's bordering on extraneous, but I don't think there's any harm in mentioning it.

Otherwise people might wonder 'oh is this flag used elsewhere somehow' etc.

...

...
The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

I guess this is because it is modifying vm_flags now?

Yes

...

...
We also update the smaps logic and documentation to identify these VMAs.

Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.

For anonymous mappings this is inherent, however since commit f807123d578d ("mm: allow guard regions in file-backed and read-only mappings") which allowed file-backed guard regions, we have unfortunately had to enforce this behaviour by settings vma->anon_vma to force page table copying.

The existence of this flag removes the need for this, so we simply update vma_needs_copy() to check for this flag instead.

Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com

Overall, makes sense to me and I think we could use it.

Reviewed-by: Suren Baghdasaryan surenb@google.com

Thanks!

...

It would be nice to have a way for userspace to reset this flag if it confirms that the VMA does not really have any guards (using say PAGEMAP_SCAN) but I think such an API can be abused.

Yeah, I'd rather not for that reason.

...

...
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 1 + include/trace/events/mmflags.h | 1 + mm/madvise.c | 22 ++++++++++++++-------- mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 1 + 7 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping

gu maybe contains guard regions (if not set, definitely doesn't) == =======================================

Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index fc35a0543f01..db16ed91c269 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
          [ilog2(VM_MAYBE_GUARD)] = "gu",
          [ilog2(VM_LOCKED)]      = "lo",
          [ilog2(VM_IO)]          = "io",
          [ilog2(VM_SEQ_READ)]    = "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h index aada935c4950..f963afa1b9de 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD 0x00000800 /* The VMA maybe contains guard regions. */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */

#define VM_LOCKED 0x00002000 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index aa441f593e9a..a6e5a44c9b42 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3) {VM_UFFD_MISSING, "uffd_missing" }, \ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \
  {VM_MAYBE_GUARD,                "maybe_guard"   },              \
  {VM_UFFD_WP,                    "uffd_wp"       },              \
  {VM_LOCKED,                     "locked"        },              \
  {VM_IO,                         "io"            },              \
diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..216ae6ed344e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL;
    /*
   * If we install guard markers, then the range is no longer
   * empty from a page table perspective and therefore it's
   * appropriate to have an anon_vma.
   * It would be confusing for anonymous mappings to have page table
   * entries but no anon_vma established, so ensure that it is.
   */
  if (vma_is_anonymous(vma))
          anon_vma_prepare(vma);
  /*
   * Indicate that the VMA may contain guard regions, making it visible to
   * the user that a VMA may contain these, narrowing down the range which
   * must be scanned in order to detect them.
   *
   * This ensures that on fork, we copy page tables correctly.
   * This additionally causes page tables to be copied on fork regardless
   * of whether the VMA is anonymous or not, correctly preserving the
   * guard region page table entries.
   */
  err = anon_vma_prepare(vma);
  if (err)
          return err;
  vm_flags_set(vma, VM_MAYBE_GUARD);

  /*
   * Optimistically try to install the guard marker pages first. If any
@@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE:
  case MADV_GUARD_INSTALL:
  case MADV_GUARD_REMOVE:
          return MADVISE_MMAP_READ_LOCK;
  case MADV_DONTNEED:
diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
  /* Guard regions have momdified page tables that require copying. */
  if (src_vma->vm_flags & VM_MAYBE_GUARD)
          return true;
  /*
   * Don't copy ptes where a page fault will fill them correctly.  Fork
   * becomes much lighter when there are big shared or private readonly
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index d873667704e8..e40c93edc5a7 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr; #define VM_MAYEXEC 0x00000040 #define VM_GROWSDOWN 0x00000100 #define VM_PFNMAP 0x00000400 +#define VM_MAYBE_GUARD 0x00000800 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000

#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */

2.51.0

Pedro Falcato

4:16 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:

...

Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly define what "may have a guard page"), but it sounds much better than introducing lock contention. I don't think it is reasonable to add a write lock to a feature that may be used by such things as thread stack allocation, malloc, etc.

-- Pedro

Lorenzo Stoakes

4:23 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...

On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:

...
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

I also don't think it's at all necessary, see below.

...

define what "may have a guard page"), but it sounds much better than introducing lock contention. I don't think it is reasonable to add a write lock to a feature that may be used by such things as thread stack allocation, malloc, etc.

What lock contention? It's per-VMA so the contention is limited to the VMA in question, and only over the span of time you are setting the gaurd region.

When allocating thread stacks you'll be mapping things into memory which... take the write lock. malloc() if it goes to the kernel will also take the write lock.

So I think you're overly worried about an operation that a. isn't going to be something that happens all that often, b. when it does, it's at a time when you'd be taking write locks anyway and c. won't contend important stuff like page faults for any VMA other than the one having the the guard region installed.

...

-- Pedro

Thanks, Lorenzo

Pedro Falcato

4:31 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:

...

On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:

...
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).

...

I also don't think it's at all necessary, see below.

...
define what "may have a guard page"), but it sounds much better than introducing lock contention. I don't think it is reasonable to add a write lock to a feature that may be used by such things as thread stack allocation, malloc, etc.

What lock contention? It's per-VMA so the contention is limited to the VMA in question, and only over the span of time you are setting the gaurd region.

Don't we always need to take the mmap write lock when grabbing a VMA write lock as well?

...

When allocating thread stacks you'll be mapping things into memory which... take the write lock. malloc() if it goes to the kernel will also take the write lock.

But yes, good point, you're already serializing anyway. I don't think this is a big deal.

...

So I think you're overly worried about an operation that a. isn't going to be something that happens all that often, b. when it does, it's at a time when you'd be taking write locks anyway and c. won't contend important stuff like page faults for any VMA other than the one having the the guard region installed.

Yep, thanks.

-- Pedro

Lorenzo Stoakes

4:43 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:

...

On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:

...
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated on fork.

...

...
I also don't think it's at all necessary, see below.

...
define what "may have a guard page"), but it sounds much better than introducing lock contention. I don't think it is reasonable to add a write lock to a feature that may be used by such things as thread stack allocation, malloc, etc.

What lock contention? It's per-VMA so the contention is limited to the VMA in question, and only over the span of time you are setting the gaurd region.

Don't we always need to take the mmap write lock when grabbing a VMA write lock as well?

Yup. But at the same time you're doing the kind of operations that'd use this you'd already be taking the lock anyway.

You don't hold it for long and you won't be doing this any more often than you'd be doing other write operations, which you're also not going to be holding up faults on other VMAs either (they can access other VMAs despite mmap write lock being held), so I don't think there's ay issue here.

...

...
When allocating thread stacks you'll be mapping things into memory which... take the write lock. malloc() if it goes to the kernel will also take the write lock.

But yes, good point, you're already serializing anyway. I don't think this is a big deal.

Indeed

...

...
So I think you're overly worried about an operation that a. isn't going to be something that happens all that often, b. when it does, it's at a time when you'd be taking write locks anyway and c. won't contend important stuff like page faults for any VMA other than the one having the the guard region installed.

Yep, thanks.

No problemo, you can get yourself into sticky situations with lock contention but I think this is not one! :)

...

-- Pedro

Cheers, Lorenzo

Vlastimil Babka

6:31 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On 10/30/25 17:43, Lorenzo Stoakes wrote:

...

On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:

...
On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:

...
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated on fork.

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we could think of something to deal with this special case...

...

...
...
I also don't think it's at all necessary, see below.

...
define what "may have a guard page"), but it sounds much better than introducing lock contention. I don't think it is reasonable to add a write lock to a feature that may be used by such things as thread stack allocation, malloc, etc.

What lock contention? It's per-VMA so the contention is limited to the VMA in question, and only over the span of time you are setting the gaurd region.

Don't we always need to take the mmap write lock when grabbing a VMA write lock as well?

Yup. But at the same time you're doing the kind of operations that'd use this you'd already be taking the lock anyway.

You don't hold it for long and you won't be doing this any more often than you'd be doing other write operations, which you're also not going to be holding up faults on other VMAs either (they can access other VMAs despite mmap write lock being held), so I don't think there's ay issue here.

...
...
When allocating thread stacks you'll be mapping things into memory which... take the write lock. malloc() if it goes to the kernel will also take the write lock.

But yes, good point, you're already serializing anyway. I don't think this is a big deal.

Indeed

Besides thread stacks, I'm thinking of the userspace allocators usecase (jemalloc etc) where guard regions were supposed to allow a cheap use-after-free protection by replacing their existing MADV_DONTNEED/MADV_FREE use with MADV_GUARD_INSTALL. Now MADV_DONTNEED/MADV_FREE use MADVISE_VMA_READ_LOCK, and guard regions moves from (worse but still reasonable) MADVISE_MMAP_READ_LOCK to the heavy MADVISE_MMAP_WRITE_LOCK. I'm afraid this might be too heavy price for that usecase :(

...

...
...
So I think you're overly worried about an operation that a. isn't going to be something that happens all that often, b. when it does, it's at a time when you'd be taking write locks anyway and c. won't contend important stuff like page faults for any VMA other than the one having the the guard region installed.

Yep, thanks.

No problemo, you can get yourself into sticky situations with lock contention but I think this is not one! :)

...
-- Pedro

Cheers, Lorenzo

Vlastimil Babka

6:47 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On 10/30/25 19:31, Vlastimil Babka wrote:

...

On 10/30/25 17:43, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:

...
On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:

...
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated on fork.

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we could think of something to deal with this special case...

During discussion with Pedro off-list I realized fork takes mmap lock for write on the old mm, so if we kept taking mmap sem for read, then vma lock for read in addition (which should be cheap enough, also we'd only need it in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps that would cover all non-bening races?

Lorenzo Stoakes

7:47 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:

...

On 10/30/25 19:31, Vlastimil Babka wrote:

...
On 10/30/25 17:43, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:

...
On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote: > Currently, if a user needs to determine if guard regions are present in a > range, they have to scan all VMAs (or have knowledge of which ones might > have guard regions). > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to > pagemap") and the related commit a516403787e0 ("fs/proc: extend the > PAGEMAP_SCAN ioctl to report guard regions"), users can use either > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this > operation at a virtual address level. > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level > that guard regions exist in ranges. > > This patch remedies the situation by establishing a new VMA flag, > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is > uncertain because we cannot reasonably determine whether a > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and > additionally VMAs may change across merge/split). > > We utilise 0x800 for this flag which makes it available to 32-bit > architectures also, a flag that was previously used by VM_DENYWRITE, which > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't > bee reused yet. > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write > lock (and also VMA write lock) whereas previously it did not, but this > seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated on fork.

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we could think of something to deal with this special case...

During discussion with Pedro off-list I realized fork takes mmap lock for write on the old mm, so if we kept taking mmap sem for read, then vma lock for read in addition (which should be cheap enough, also we'd only need it in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps that would cover all non-bening races?

We take VMA write lock in dup_mmap() on each mpnt (old VMA).

We take the VMA write lock (vma_start_write()) for each mpnt.

We then vm_area_dup() the mpnt to the new VMA before calling:

copy_page_range() -> vma_needs_copy()

Which is where the check is done.

So we are holding the VMA write lock, so a VMA read lock should suffice no?

For belts + braces we could atomically read the flag in vma_needs_copy(), though note it's intended VM_COPY_ON_FORK could have more than one flag.

We could drop that for now and be explicit.

Vlastimil Babka

9:48 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On 10/30/25 20:47, Lorenzo Stoakes wrote:

...

On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:

...
...
Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we could think of something to deal with this special case...

During discussion with Pedro off-list I realized fork takes mmap lock for write on the old mm, so if we kept taking mmap sem for read, then vma lock for read in addition (which should be cheap enough, also we'd only need it in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps that would cover all non-bening races?

We take VMA write lock in dup_mmap() on each mpnt (old VMA).

Ah yes I thought it was the new one.

...

We take the VMA write lock (vma_start_write()) for each mpnt.

We then vm_area_dup() the mpnt to the new VMA before calling:

copy_page_range() -> vma_needs_copy()

Which is where the check is done.

So we are holding the VMA write lock, so a VMA read lock should suffice no?

Yeah, even better!

...

For belts + braces we could atomically read the flag in vma_needs_copy(), though note it's intended VM_COPY_ON_FORK could have more than one flag.

We could drop that for now and be explicit.

Great!

Suren Baghdasaryan

31 Oct 31 Oct

11:12 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Thu, Oct 30, 2025 at 2:48 PM Vlastimil Babka vbabka@suse.cz wrote:

...

On 10/30/25 20:47, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:

...
...
Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we could think of something to deal with this special case...

During discussion with Pedro off-list I realized fork takes mmap lock for write on the old mm, so if we kept taking mmap sem for read, then vma lock for read in addition (which should be cheap enough, also we'd only need it in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps that would cover all non-bening races?

We take VMA write lock in dup_mmap() on each mpnt (old VMA).

Ah yes I thought it was the new one.

...
We take the VMA write lock (vma_start_write()) for each mpnt.

We then vm_area_dup() the mpnt to the new VMA before calling:

copy_page_range() -> vma_needs_copy()

Which is where the check is done.

So we are holding the VMA write lock, so a VMA read lock should suffice no?

Yeah, even better!

...
For belts + braces we could atomically read the flag in vma_needs_copy(), though note it's intended VM_COPY_ON_FORK could have more than one flag.

We could drop that for now and be explicit.

Great!

Overall, I think it should be possible to set this flag atomically under VMA read-lock. However, if you introduce new vm_flags manipulation functions, please make sure they can't be used for other vm_flags. In Android I've seen several "interesting" attempts to update vm_flags under a read-lock (specifically in the page-fault path) and had to explain why that's a bad idea.

Lorenzo Stoakes

3 Nov 3 Nov

9:34 a.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Fri, Oct 31, 2025 at 04:12:51PM -0700, Suren Baghdasaryan wrote:

...

Overall, I think it should be possible to set this flag atomically under VMA read-lock. However, if you introduce new vm_flags manipulation functions, please make sure they can't be used for other vm_flags. In Android I've seen several "interesting" attempts to update vm_flags under a read-lock (specifically in the page-fault path) and had to explain why that's a bad idea.

Yeah agreed, so the idea would be to absolutely ring-fence any flag we do this with entirely. Probably a VM_WARN_ON_ONCE() for anybody trying it with other flags so bots can catch anybody being naughty.

That kind of 'interesting' stuff is another reason I prefer to really limit what drivers can do btw ;)

Will have a respin with this idea applied relatively soon hopefully.

Cheers, Lorenzo

Lorenzo Stoakes

5 Nov 5 Nov

7:48 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:

...

On 10/30/25 19:31, Vlastimil Babka wrote:

...
On 10/30/25 17:43, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:

...
On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote: > Currently, if a user needs to determine if guard regions are present in a > range, they have to scan all VMAs (or have knowledge of which ones might > have guard regions). > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to > pagemap") and the related commit a516403787e0 ("fs/proc: extend the > PAGEMAP_SCAN ioctl to report guard regions"), users can use either > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this > operation at a virtual address level. > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level > that guard regions exist in ranges. > > This patch remedies the situation by establishing a new VMA flag, > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is > uncertain because we cannot reasonably determine whether a > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and > additionally VMAs may change across merge/split). > > We utilise 0x800 for this flag which makes it available to 32-bit > architectures also, a flag that was previously used by VM_DENYWRITE, which > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't > bee reused yet. > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write > lock (and also VMA write lock) whereas previously it did not, but this > seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated on fork.

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we could think of something to deal with this special case...

During discussion with Pedro off-list I realized fork takes mmap lock for write on the old mm, so if we kept taking mmap sem for read, then vma lock for read in addition (which should be cheap enough, also we'd only need it in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps that would cover all non-bening races?

So thinking about this again, taking mmap read lock will exclude any VMA write locks (which must hold mmap write lock), so no need to additionally take VMA read lock.

Also as mentioned later in thread, the invocation of vma_needs_copy() is performed under VMA write lock (and this mmap write lock) so no need to read atomically there either.

As per Pedro, we can treat other races as benign.

Merge/Split etc. are done under VMA write lock so there's no race that matters there.

The only other place we even care about this flag in is for /proc/$pid/smaps, but there it'd be benign as you'd happen not to observe the flag being set up at the point at which a concurrent guard region install is happening - something that you could race with _anyway_.

As Pedro says, remaining cases where you read flags would be benign, as the readers should never observe a meaningful torn read being a bitmap and given this flag only matters on fork and smaps.

So I think just having something that sets atomically with an allow list is fine, but making that very strict so literally just this flag is allowed (currently).

Vlastimil Babka

6 Nov 6 Nov

7:34 a.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On 11/5/25 20:48, Lorenzo Stoakes wrote:

...

On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:

...
On 10/30/25 19:31, Vlastimil Babka wrote:

...
On 10/30/25 17:43, Lorenzo Stoakes wrote:

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we could think of something to deal with this special case...

During discussion with Pedro off-list I realized fork takes mmap lock for write on the old mm, so if we kept taking mmap sem for read, then vma lock for read in addition (which should be cheap enough, also we'd only need it in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps that would cover all non-bening races?

So thinking about this again, taking mmap read lock will exclude any VMA write locks (which must hold mmap write lock), so no need to additionally take VMA read lock.

Right. Of course it would be nice if we could later also replace the mmap read lock with vma read lock. Hopefully should be feasible because MADV_DONTNEED can do it and guard ranges perform similar kind of operations (with more complex page table handling necessary, IIRC but that should hopefully be still fine with vma read lock).

It's obviously not in scope of the series here, maybe just to keep in mind if that next step will be compatible with anything done here.

...

Also as mentioned later in thread, the invocation of vma_needs_copy() is performed under VMA write lock (and this mmap write lock) so no need to read atomically there either.

As per Pedro, we can treat other races as benign.

Merge/Split etc. are done under VMA write lock so there's no race that matters there.

The only other place we even care about this flag in is for /proc/$pid/smaps, but there it'd be benign as you'd happen not to observe the flag being set up at the point at which a concurrent guard region install is happening - something that you could race with _anyway_.

Right.

...

As Pedro says, remaining cases where you read flags would be benign, as the readers should never observe a meaningful torn read being a bitmap and given this flag only matters on fork and smaps.

So I think just having something that sets atomically with an allow list is fine, but making that very strict so literally just this flag is allowed (currently).

Ack.

Lorenzo Stoakes

30 Oct 30 Oct

7:16 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Thu, Oct 30, 2025 at 07:31:26PM +0100, Vlastimil Babka wrote:

...

On 10/30/25 17:43, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:

...
On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:

...
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated on fork.

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma

I mean, I just spoke about why I didn't think introducing an entirely new (afaik) one-sided atomic VMA flag write, so maybe deal with that first before proposing something new...

We probably can use VMA read lock (actually introduced _after_ I introduced guard regions I believe) if we do not need the write lock.

But I need to hear a compelling non-hand waving reason for introducing an entirely new mechanism like that which is going to violate an assumption we have about VMA flags everywhere...

...

write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for the oldmm) and it probably would't make sense to start doing it. Maybe we

I mean, the entire point is fork behaviour, so that makes this a non-starter...

...

could think of something to deal with this special case...

If we implemented that I guess we'd have to atomically read the VMA flag, but a race between the two would be very problematic.

Note that we'd also have to have a new mechanism for writing a VMA flag (this is entirely novel now) because vm_flags_set() itself _literally_ has vma_start_write() in it.

We'd also have to audit any and all cases where VMA flags are accessed to ensure that this new novel method does not cause any issues.

We're also inviting others to do this kind of thing (e.g. drivers).

Seems dangerous in itself.

In any case, since you're now asking for something entirely novel, you really have to provide a compelling argument as to why we can't write lock.

We can't allow races there or we break forking.

...

...
...
...
I also don't think it's at all necessary, see below.

...
define what "may have a guard page"), but it sounds much better than introducing lock contention. I don't think it is reasonable to add a write lock to a feature that may be used by such things as thread stack allocation, malloc, etc.

What lock contention? It's per-VMA so the contention is limited to the VMA in question, and only over the span of time you are setting the gaurd region.

Don't we always need to take the mmap write lock when grabbing a VMA write lock as well?

Yup. But at the same time you're doing the kind of operations that'd use this you'd already be taking the lock anyway.

You don't hold it for long and you won't be doing this any more often than you'd be doing other write operations, which you're also not going to be holding up faults on other VMAs either (they can access other VMAs despite mmap write lock being held), so I don't think there's ay issue here.

...
...
When allocating thread stacks you'll be mapping things into memory which... take the write lock. malloc() if it goes to the kernel will also take the write lock.

But yes, good point, you're already serializing anyway. I don't think this is a big deal.

Indeed

Besides thread stacks, I'm thinking of the userspace allocators usecase

Thread stacks require mmap/VMA write lock to establish no?

...

(jemalloc etc) where guard regions were supposed to allow a cheap use-after-free protection by replacing their existing MADV_DONTNEED/MADV_FREE use with MADV_GUARD_INSTALL.

First I'm hearing of this as far as I can recall... maybe worth mentioning at the time?

...

Now MADV_DONTNEED/MADV_FREE use MADVISE_VMA_READ_LOCK, and guard regions moves from (worse but still reasonable) MADVISE_MMAP_READ_LOCK to the heavy

I think you're hand waving a lot here, do you have data to back anything up here?

Be good to get some timings with the different locks and some indication that the contention is meaningful or that this installation is in a hot-path.

Are these users not in any way manipulating VMAs (seems odd for a malloc but anyway)? Because then you're already having to take the write lock to do so.

process_madvise() can be used to hold the lock over the whole operation more efficiently for multiple guard region installations.

...

MADVISE_MMAP_WRITE_LOCK. I'm afraid this might be too heavy price for that usecase :(

Based on what? Provide some data or argument to back this up please. Since you're essentially asking me to either throw away my series or implement a risky novel means of manipulating VMA flags I think you're going to have to do better than hand waving.

Thanks.

Lorenzo Stoakes

7:37 p.m.

New subject: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions

On Thu, Oct 30, 2025 at 07:16:56PM +0000, Lorenzo Stoakes wrote:

...

On Thu, Oct 30, 2025 at 07:31:26PM +0100, Vlastimil Babka wrote:

...
On 10/30/25 17:43, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:

...
On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:

...
On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:

...
On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote: > Currently, if a user needs to determine if guard regions are present in a > range, they have to scan all VMAs (or have knowledge of which ones might > have guard regions). > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to > pagemap") and the related commit a516403787e0 ("fs/proc: extend the > PAGEMAP_SCAN ioctl to report guard regions"), users can use either > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this > operation at a virtual address level. > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level > that guard regions exist in ranges. > > This patch remedies the situation by establishing a new VMA flag, > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is > uncertain because we cannot reasonably determine whether a > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and > additionally VMAs may change across merge/split). > > We utilise 0x800 for this flag which makes it available to 32-bit > architectures also, a flag that was previously used by VM_DENYWRITE, which > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't > bee reused yet. > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write > lock (and also VMA write lock) whereas previously it did not, but this > seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any workable way.

FWIW I think you could work it as an atomic flag and treat those races as benign (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated on fork.

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag setting? I think the places that could race with us to cause RMW use vma

I mean, I just spoke about why I didn't think introducing an entirely new (afaik) one-sided atomic VMA flag write, so maybe deal with that first before proposing something new...

On the other hand, it's going to be difficult to get compelling data either way as will always be workload dependent etc.

So since you and Pedro both bring this up, and it'd be a pity to establish more stringent locking requirements here, let me look into an atomic write situation.

We'll need to tread carefully here but if we can achieve that it would obviously be entirely preferable to requiring write lock.

I'll dig into it some :)

BTW I do think a VMA read lock is entirely possible here as-is, so we should try to shift to that if we can make atomic VMA flag write here work.

Thanks, Lorenzo

Lorenzo Stoakes

29 Oct 29 Oct

4:50 p.m.

New subject: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags

It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:

* The destination VMA has VM_UFFD_WP set * The mapping is a PFN or mixed map * The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)

Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.

We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.

Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.

In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.

Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.

This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.

As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.

We update the VMA userland tests to account for the changes and, furthermore, in order to assert that the functionality is workingly correctly, update the new VMA and existing VMA merging logic to consider every permutation of the flag being set/not set in all VMAs being considered for merge.

As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.

Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- include/linux/mm.h | 32 ++++++++++++ mm/memory.c | 3 +- mm/vma.c | 22 ++++---- tools/testing/vma/vma.c | 89 ++++++++++++++++++++++++++++---- tools/testing/vma/vma_internal.h | 32 ++++++++++++ 5 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h index f963afa1b9de..a8811ba57150 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)

+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD + +/* + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA + * possesses it but the other does not, the merged VMA should nonetheless have + * applied to it: + * + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains + * metadata which should be unconditionally propagated upon + * fork. When merging two VMAs, we encapsulate this range in + * the merged VMA, so the flag should be 'sticky' as a result. + */ +#define VM_STICKY VM_COPY_ON_FORK + +/* + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one + * of these flags and the other not does not preclude a merge. + * + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but + * dirty bit -- the caller should mark merged VMA as dirty. If + * dirty bit won't be excluded from comparison, we increase + * pressure on the memory system forcing the kernel to generate + * new VMAs when old one could be extended instead. + * + * VM_STICKY - If one VMA has flags which most be 'sticky', that is ones + * which should propagate to all VMAs, but the other does not, + * the merge should still proceed with the merge logic applying + * sticky flags to the final VMA. + */ +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY) + /* * mapping from the currently active vm_flags protection bits (the * low four bits) to a page protection mask.. diff --git a/mm/memory.c b/mm/memory.c index a2c79ee43d68..9528133e5147 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;

- /* Guard regions have momdified page tables that require copying. */ - if (src_vma->vm_flags & VM_MAYBE_GUARD) + if (src_vma->vm_flags & VM_COPY_ON_FORK) return true;

/* diff --git a/mm/vma.c b/mm/vma.c index 919d1fc63a52..50a6909c4be3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex

if (!mpol_equal(vmg->policy, vma_policy(vma))) return false; - /* - * VM_SOFTDIRTY should not prevent from VMA merging, if we - * match the flags but dirty bit -- the caller should mark - * merged VMA as dirty. If dirty bit won't be excluded from - * comparison, we increase pressure on the memory system forcing - * the kernel to generate new VMAs when old one could be - * extended instead. - */ - if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY) + if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE) return false; if (vma->vm_file != vmg->file) return false; @@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) { + vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY; struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *next; @@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next; + sticky_flags |= (next->vm_flags & VM_STICKY); }

if (merge_left) { vma_start_write(prev); vmg->target = prev; + sticky_flags |= (prev->vm_flags & VM_STICKY); }

if (merge_both) { @@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;

+ vm_flags_set(vmg->target, sticky_flags); khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target; @@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next; + vm_flags_t sticky_flags; + + sticky_flags = vmg->vm_flags & VM_STICKY; + sticky_flags |= target->vm_flags & VM_STICKY;

VM_WARN_ON_VMG(!target, vmg);

@@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;

+ sticky_flags |= next->vm_flags & VM_STICKY; remove_next = true; /* This should already have been checked by this point. */ VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg); @@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;

+ vm_flags_set(target, sticky_flags); return 0;

nomem: @@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file && - !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) && + !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) && b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); }

diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..ee9d3547c421 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma; #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))

+#define IS_SET(_val, _flags) ((_val & _flags) == _flags) + static struct task_struct __current;

struct task_struct *get_current(void) @@ -441,7 +443,7 @@ static bool test_simple_shrink(void) return true; }

-static bool test_merge_new(void) +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; @@ -469,23 +471,32 @@ static bool test_merge_new(void) struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; bool merged;

+ if (is_sticky) + vm_flags |= VM_STICKY; + /* * 0123456789abc * AA B CC */ vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); ASSERT_NE(vma_a, NULL); + if (a_is_sticky) + vm_flags_set(vma_a, VM_STICKY); /* We give each VMA a single avc so we can test anon_vma duplication. */ INIT_LIST_HEAD(&vma_a->anon_vma_chain); list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);

vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags); ASSERT_NE(vma_b, NULL); + if (b_is_sticky) + vm_flags_set(vma_b, VM_STICKY); INIT_LIST_HEAD(&vma_b->anon_vma_chain); list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);

vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags); ASSERT_NE(vma_c, NULL); + if (c_is_sticky) + vm_flags_set(vma_c, VM_STICKY); INIT_LIST_HEAD(&vma_c->anon_vma_chain); list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);

@@ -520,6 +531,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky || a_is_sticky || b_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

/* * Merge to PREVIOUS VMA. @@ -537,6 +550,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky || a_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

/* * Merge to NEXT VMA. @@ -556,6 +571,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky) /* D uses is_sticky. */ + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

/* * Merge BOTH sides. @@ -574,6 +591,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (is_sticky || a_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

/* * Merge to NEXT VMA. @@ -592,6 +611,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (is_sticky || c_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

/* * Merge BOTH sides. @@ -609,6 +630,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 1); + if (is_sticky || a_is_sticky || c_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

/* * Final state. @@ -637,6 +660,20 @@ static bool test_merge_new(void) return true; }

+static bool test_merge_new(void) +{ + int i, j, k, l; + + /* Generate every possible permutation of sticky flags. */ + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + for (k = 0; k < 2; k++) + for (l = 0; l < 2; l++) + ASSERT_TRUE(__test_merge_new(i, j, k, l)); + + return true; +} + static bool test_vma_merge_special_flags(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void) return true; }

-static bool test_merge_existing(void) +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t prev_flags = vm_flags; + vm_flags_t next_flags = vm_flags; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma, *vma_prev, *vma_next; @@ -988,6 +1027,13 @@ static bool test_merge_existing(void) }; struct anon_vma_chain avc = {};

+ if (prev_is_sticky) + prev_flags |= VM_STICKY; + if (middle_is_sticky) + vm_flags |= VM_STICKY; + if (next_is_sticky) + next_flags |= VM_STICKY; + /* * Merge right case - partial span. * @@ -1000,7 +1046,7 @@ static bool test_merge_existing(void) */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma); vmg.middle = vma; @@ -1018,6 +1064,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 2); + if (middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));

/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1033,7 +1081,7 @@ static bool test_merge_existing(void) * NNNNNNN */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma); vmg.middle = vma; @@ -1046,6 +1094,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 1); + if (middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));

/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1060,7 +1110,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPV */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ @@ -1080,6 +1130,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (prev_is_sticky || middle_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1094,7 +1146,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPP */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); @@ -1109,6 +1161,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1); + if (prev_is_sticky || middle_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1123,10 +1177,10 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPPPPP */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; @@ -1139,6 +1193,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1); + if (prev_is_sticky || middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

/* Clear down and reset. We should have deleted prev and next. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1158,9 +1214,9 @@ static bool test_merge_existing(void) * PPPVVVVVNNN */

- vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);

vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags); vmg.prev = vma; @@ -1203,6 +1259,19 @@ static bool test_merge_existing(void) return true; }

+static bool test_merge_existing(void) +{ + int i, j, k; + + /* Generate every possible permutation of sticky flags. */ + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + for (k = 0; k < 2; k++) + ASSERT_TRUE(__test_merge_existing(i, j, k)); + + return true; +} + static bool test_anon_vma_non_mergeable(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index e40c93edc5a7..3d9cb3a9411a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif

+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD + +/* + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA + * possesses it but the other does not, the merged VMA should nonetheless have + * applied to it: + * + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains + * metadata which should be unconditionally propagated upon + * fork. When merging two VMAs, we encapsulate this range in + * the merged VMA, so the flag should be 'sticky' as a result. + */ +#define VM_STICKY VM_COPY_ON_FORK + +/* + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one + * of these flags and the other not does not preclude a merge. + * + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but + * dirty bit -- the caller should mark merged VMA as dirty. If + * dirty bit won't be excluded from comparison, we increase + * pressure on the memory system forcing the kernel to generate + * new VMAs when old one could be extended instead. + * + * VM_STICKY - If one VMA has flags which must be 'sticky', that is ones + * which should propagate to all VMAs, but the other does not, + * the merge should still proceed with the merge logic applying + * sticky flags to the final VMA. + */ +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY) + #define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL

-- 2.51.0

Suren Baghdasaryan

30 Oct 30 Oct

4:35 a.m.

New subject: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...

It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:

The destination VMA has VM_UFFD_WP set

The mapping is a PFN or mixed map

The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)

Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.

We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.

Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.

In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.

"successfully succeed" sounds weird. Just "succeed"?

...

Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.

This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.

As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.

We update the VMA userland tests to account for the changes and, furthermore, in order to assert that the functionality is workingly

s/workingly/working

...

correctly, update the new VMA and existing VMA merging logic to consider every permutation of the flag being set/not set in all VMAs being considered for merge.

As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.

Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com

include/linux/mm.h | 32 ++++++++++++ mm/memory.c | 3 +- mm/vma.c | 22 ++++---- tools/testing/vma/vma.c | 89 ++++++++++++++++++++++++++++---- tools/testing/vma/vma_internal.h | 32 ++++++++++++ 5 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h index f963afa1b9de..a8811ba57150 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)

+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD

+/*
Flags which should be 'sticky' on merge - that is, flags which, when one VMA

possesses it but the other does not, the merged VMA should nonetheless have

applied to it:

VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
              metadata which should be unconditionally propagated upon
              fork. When merging two VMAs, we encapsulate this range in
              the merged VMA, so the flag should be 'sticky' as a result.

It's probably worth noting that after a split, we do not remove "sticky" flags even if the VMA acquired them as a result of a previous merge.

...

*/

+#define VM_STICKY VM_COPY_ON_FORK

+/*
VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one

of these flags and the other not does not preclude a merge.

VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
           dirty bit -- the caller should mark merged VMA as dirty. If
           dirty bit won't be excluded from comparison, we increase
           pressure on the memory system forcing the kernel to generate
           new VMAs when old one could be extended instead.
VM_STICKY - If one VMA has flags which most be 'sticky', that is ones

s/most/must ?

...

           which should propagate to all VMAs, but the other does not,
           the merge should still proceed with the merge logic applying
           sticky flags to the final VMA.
*/
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)

/*

mapping from the currently active vm_flags protection bits (the

low four bits) to a page protection mask..

diff --git a/mm/memory.c b/mm/memory.c index a2c79ee43d68..9528133e5147 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
  /* Guard regions have momdified page tables that require copying. */
  if (src_vma->vm_flags & VM_MAYBE_GUARD)
  if (src_vma->vm_flags & VM_COPY_ON_FORK)
          return true;

  /*
diff --git a/mm/vma.c b/mm/vma.c index 919d1fc63a52..50a6909c4be3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
    if (!mpol_equal(vmg->policy, vma_policy(vma)))
            return false;
  /*
   * VM_SOFTDIRTY should not prevent from VMA merging, if we
   * match the flags but dirty bit -- the caller should mark
   * merged VMA as dirty. If dirty bit won't be excluded from
   * comparison, we increase pressure on the memory system forcing
   * the kernel to generate new VMAs when old one could be
   * extended instead.
   */
  if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
  if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE)
          return false;
  if (vma->vm_file != vmg->file)
          return false;
@@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) {
  vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY;
  struct vm_area_struct *middle = vmg->middle;
  struct vm_area_struct *prev = vmg->prev;
  struct vm_area_struct *next;
@@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next;
          sticky_flags |= (next->vm_flags & VM_STICKY);
  }

  if (merge_left) {
          vma_start_write(prev);
          vmg->target = prev;
          sticky_flags |= (prev->vm_flags & VM_STICKY);
  }

  if (merge_both) {
@@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;
  vm_flags_set(vmg->target, sticky_flags);
  khugepaged_enter_vma(vmg->target, vmg->vm_flags);
  vmg->state = VMA_MERGE_SUCCESS;
  return vmg->target;
@@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next;
  vm_flags_t sticky_flags;
  sticky_flags = vmg->vm_flags & VM_STICKY;
  sticky_flags |= target->vm_flags & VM_STICKY;

  VM_WARN_ON_VMG(!target, vmg);
@@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;
          sticky_flags |= next->vm_flags & VM_STICKY;
          remove_next = true;
          /* This should already have been checked by this point. */
          VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
@@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;
  vm_flags_set(target, sticky_flags);
  return 0;
nomem: @@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file &&
          !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
          !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) &&
          b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
}

diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..ee9d3547c421 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c

I prefer tests in a separate patch, but that might just be me. Feel free to ignore.

...

@@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma; #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))

+#define IS_SET(_val, _flags) ((_val & _flags) == _flags)

static struct task_struct __current;

struct task_struct *get_current(void) @@ -441,7 +443,7 @@ static bool test_simple_shrink(void) return true; }

-static bool test_merge_new(void) +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; @@ -469,23 +471,32 @@ static bool test_merge_new(void) struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; bool merged;
  if (is_sticky)
          vm_flags |= VM_STICKY;
  /*
   * 0123456789abc
   * AA B       CC
   */
  vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags);
  ASSERT_NE(vma_a, NULL);
  if (a_is_sticky)
          vm_flags_set(vma_a, VM_STICKY);
  /* We give each VMA a single avc so we can test anon_vma duplication. */
  INIT_LIST_HEAD(&vma_a->anon_vma_chain);
  list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);

  vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags);
  ASSERT_NE(vma_b, NULL);
  if (b_is_sticky)
          vm_flags_set(vma_b, VM_STICKY);
  INIT_LIST_HEAD(&vma_b->anon_vma_chain);
  list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);

  vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags);
  ASSERT_NE(vma_c, NULL);
  if (c_is_sticky)
          vm_flags_set(vma_c, VM_STICKY);
  INIT_LIST_HEAD(&vma_c->anon_vma_chain);
  list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);
@@ -520,6 +531,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
  if (is_sticky || a_is_sticky || b_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge to PREVIOUS VMA.
@@ -537,6 +550,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
  if (is_sticky || a_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge to NEXT VMA.
@@ -556,6 +571,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
  if (is_sticky) /* D uses is_sticky. */
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge BOTH sides.
@@ -574,6 +591,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
  if (is_sticky || a_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge to NEXT VMA.
@@ -592,6 +611,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
  if (is_sticky || c_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge BOTH sides.
@@ -609,6 +630,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 1);
  if (is_sticky || a_is_sticky || c_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Final state.
@@ -637,6 +660,20 @@ static bool test_merge_new(void) return true; }

+static bool test_merge_new(void) +{
  int i, j, k, l;
  /* Generate every possible permutation of sticky flags. */
  for (i = 0; i < 2; i++)
          for (j = 0; j < 2; j++)
                  for (k = 0; k < 2; k++)
                          for (l = 0; l < 2; l++)
                                  ASSERT_TRUE(__test_merge_new(i, j, k, l));
  return true;
+}

static bool test_vma_merge_special_flags(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void) return true; }

-static bool test_merge_existing(void) +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
  vm_flags_t prev_flags = vm_flags;
  vm_flags_t next_flags = vm_flags;
  struct mm_struct mm = {};
  VMA_ITERATOR(vmi, &mm, 0);
  struct vm_area_struct *vma, *vma_prev, *vma_next;
@@ -988,6 +1027,13 @@ static bool test_merge_existing(void) }; struct anon_vma_chain avc = {};
  if (prev_is_sticky)
          prev_flags |= VM_STICKY;
  if (middle_is_sticky)
          vm_flags |= VM_STICKY;
  if (next_is_sticky)
          next_flags |= VM_STICKY;
  /*
   * Merge right case - partial span.
   *
@@ -1000,7 +1046,7 @@ static bool test_merge_existing(void) */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */
  vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
  vma_next->vm_ops = &vm_ops; /* This should have no impact. */
  vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma);
  vmg.middle = vma;
@@ -1018,6 +1064,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 2);
  if (middle_is_sticky || next_is_sticky)
          ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));

  /* Clear down and reset. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1033,7 +1081,7 @@ static bool test_merge_existing(void) * NNNNNNN */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
  vma_next->vm_ops = &vm_ops; /* This should have no impact. */
  vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma);
  vmg.middle = vma;
@@ -1046,6 +1094,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 1);
  if (middle_is_sticky || next_is_sticky)
          ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));

  /* Clear down and reset. We should have deleted vma. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1060,7 +1110,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPV */
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
  vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
  vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
  vma->vm_ops = &vm_ops; /* This should have no impact. */
@@ -1080,6 +1130,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
  if (prev_is_sticky || middle_is_sticky)
          ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

  /* Clear down and reset. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1094,7 +1146,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPP */
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
  vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
  vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
  vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
@@ -1109,6 +1161,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1);
  if (prev_is_sticky || middle_is_sticky)
          ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

  /* Clear down and reset. We should have deleted vma. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1123,10 +1177,10 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPPPPP */
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
  vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
  vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags);
  vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
  vmg.prev = vma_prev;
  vmg.middle = vma;
@@ -1139,6 +1193,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1);
  if (prev_is_sticky || middle_is_sticky || next_is_sticky)
          ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

  /* Clear down and reset. We should have deleted prev and next. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1158,9 +1214,9 @@ static bool test_merge_existing(void) * PPPVVVVVNNN */
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
  vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);

  vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags);
  vmg.prev = vma;
@@ -1203,6 +1259,19 @@ static bool test_merge_existing(void) return true; }

+static bool test_merge_existing(void) +{
  int i, j, k;
  /* Generate every possible permutation of sticky flags. */
  for (i = 0; i < 2; i++)
          for (j = 0; j < 2; j++)
                  for (k = 0; k < 2; k++)
                          ASSERT_TRUE(__test_merge_existing(i, j, k));
  return true;
+}

static bool test_anon_vma_non_mergeable(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index e40c93edc5a7..3d9cb3a9411a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif

+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD

+/*
Flags which should be 'sticky' on merge - that is, flags which, when one VMA

possesses it but the other does not, the merged VMA should nonetheless have

applied to it:

VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
              metadata which should be unconditionally propagated upon
              fork. When merging two VMAs, we encapsulate this range in
              the merged VMA, so the flag should be 'sticky' as a result.
*/
+#define VM_STICKY VM_COPY_ON_FORK

+/*
VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one

of these flags and the other not does not preclude a merge.

VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
           dirty bit -- the caller should mark merged VMA as dirty. If
           dirty bit won't be excluded from comparison, we increase
           pressure on the memory system forcing the kernel to generate
           new VMAs when old one could be extended instead.
VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
           which should propagate to all VMAs, but the other does not,
           the merge should still proceed with the merge logic applying
           sticky flags to the final VMA.
*/
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)

#define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL

-- 2.51.0

Lorenzo Stoakes

8:25 a.m.

New subject: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags

On Wed, Oct 29, 2025 at 09:35:25PM -0700, Suren Baghdasaryan wrote:

...

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:

The destination VMA has VM_UFFD_WP set

The mapping is a PFN or mixed map

The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)

Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.

We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.

Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.

In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.

"successfully succeed" sounds weird. Just "succeed"?

Yeah... typo bonanza this series :) will fix.

...

...
Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.

This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.

As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.

We update the VMA userland tests to account for the changes and, furthermore, in order to assert that the functionality is workingly

s/workingly/working

Haha good lord. Will fix also!

...

...
correctly, update the new VMA and existing VMA merging logic to consider every permutation of the flag being set/not set in all VMAs being considered for merge.

As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.

Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com

include/linux/mm.h | 32 ++++++++++++ mm/memory.c | 3 +- mm/vma.c | 22 ++++---- tools/testing/vma/vma.c | 89 ++++++++++++++++++++++++++++---- tools/testing/vma/vma_internal.h | 32 ++++++++++++ 5 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h index f963afa1b9de..a8811ba57150 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)

+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD

+/*
Flags which should be 'sticky' on merge - that is, flags which, when one VMA

possesses it but the other does not, the merged VMA should nonetheless have

applied to it:

VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
              metadata which should be unconditionally propagated upon
              fork. When merging two VMAs, we encapsulate this range in
              the merged VMA, so the flag should be 'sticky' as a result.
It's probably worth noting that after a split, we do not remove "sticky" flags even if the VMA acquired them as a result of a previous merge.

Hm I thought this was implied. Will update to be clear however!

...

...
*/

+#define VM_STICKY VM_COPY_ON_FORK

+/*
VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one

of these flags and the other not does not preclude a merge.

VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
           dirty bit -- the caller should mark merged VMA as dirty. If
           dirty bit won't be excluded from comparison, we increase
           pressure on the memory system forcing the kernel to generate
           new VMAs when old one could be extended instead.
VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
s/most/must ?

I most learn to not typo so much :)

Yes you're right, will fix! :P

...

...
           which should propagate to all VMAs, but the other does not,
           the merge should still proceed with the merge logic applying
           sticky flags to the final VMA.
*/
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)

/*

mapping from the currently active vm_flags protection bits (the

low four bits) to a page protection mask..

diff --git a/mm/memory.c b/mm/memory.c index a2c79ee43d68..9528133e5147 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
  /* Guard regions have momdified page tables that require copying. */
  if (src_vma->vm_flags & VM_MAYBE_GUARD)
  if (src_vma->vm_flags & VM_COPY_ON_FORK)
          return true;

  /*
diff --git a/mm/vma.c b/mm/vma.c index 919d1fc63a52..50a6909c4be3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
    if (!mpol_equal(vmg->policy, vma_policy(vma)))
            return false;
  /*
   * VM_SOFTDIRTY should not prevent from VMA merging, if we
   * match the flags but dirty bit -- the caller should mark
   * merged VMA as dirty. If dirty bit won't be excluded from
   * comparison, we increase pressure on the memory system forcing
   * the kernel to generate new VMAs when old one could be
   * extended instead.
   */
  if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
  if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE)
          return false;
  if (vma->vm_file != vmg->file)
          return false;
@@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) {
  vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY;
  struct vm_area_struct *middle = vmg->middle;
  struct vm_area_struct *prev = vmg->prev;
  struct vm_area_struct *next;
@@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next;
          sticky_flags |= (next->vm_flags & VM_STICKY);
  }

  if (merge_left) {
          vma_start_write(prev);
          vmg->target = prev;
          sticky_flags |= (prev->vm_flags & VM_STICKY);
  }

  if (merge_both) {
@@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;
  vm_flags_set(vmg->target, sticky_flags);
  khugepaged_enter_vma(vmg->target, vmg->vm_flags);
  vmg->state = VMA_MERGE_SUCCESS;
  return vmg->target;
@@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next;
  vm_flags_t sticky_flags;
  sticky_flags = vmg->vm_flags & VM_STICKY;
  sticky_flags |= target->vm_flags & VM_STICKY;

  VM_WARN_ON_VMG(!target, vmg);
@@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;
          sticky_flags |= next->vm_flags & VM_STICKY;
          remove_next = true;
          /* This should already have been checked by this point. */
          VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
@@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;
  vm_flags_set(target, sticky_flags);
  return 0;
nomem: @@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file &&
          !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
          !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) &&
          b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
}

diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..ee9d3547c421 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c
I prefer tests in a separate patch, but that might just be me. Feel free to ignore.

Yeah can split it out! I do tend to do that actually, not sure why I deviated from that here.

...

...
@@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma; #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))

+#define IS_SET(_val, _flags) ((_val & _flags) == _flags)

static struct task_struct __current;

struct task_struct *get_current(void) @@ -441,7 +443,7 @@ static bool test_simple_shrink(void) return true; }

-static bool test_merge_new(void) +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; @@ -469,23 +471,32 @@ static bool test_merge_new(void) struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; bool merged;
  if (is_sticky)
          vm_flags |= VM_STICKY;
  /*
   * 0123456789abc
   * AA B       CC
   */
  vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags);
  ASSERT_NE(vma_a, NULL);
  if (a_is_sticky)
          vm_flags_set(vma_a, VM_STICKY);
  /* We give each VMA a single avc so we can test anon_vma duplication. */
  INIT_LIST_HEAD(&vma_a->anon_vma_chain);
  list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);

  vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags);
  ASSERT_NE(vma_b, NULL);
  if (b_is_sticky)
          vm_flags_set(vma_b, VM_STICKY);
  INIT_LIST_HEAD(&vma_b->anon_vma_chain);
  list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);

  vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags);
  ASSERT_NE(vma_c, NULL);
  if (c_is_sticky)
          vm_flags_set(vma_c, VM_STICKY);
  INIT_LIST_HEAD(&vma_c->anon_vma_chain);
  list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);
@@ -520,6 +531,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
  if (is_sticky || a_is_sticky || b_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge to PREVIOUS VMA.
@@ -537,6 +550,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
  if (is_sticky || a_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge to NEXT VMA.
@@ -556,6 +571,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
  if (is_sticky) /* D uses is_sticky. */
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge BOTH sides.
@@ -574,6 +591,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
  if (is_sticky || a_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge to NEXT VMA.
@@ -592,6 +611,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
  if (is_sticky || c_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Merge BOTH sides.
@@ -609,6 +630,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 1);
  if (is_sticky || a_is_sticky || c_is_sticky)
          ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));

  /*
   * Final state.
@@ -637,6 +660,20 @@ static bool test_merge_new(void) return true; }

+static bool test_merge_new(void) +{
  int i, j, k, l;
  /* Generate every possible permutation of sticky flags. */
  for (i = 0; i < 2; i++)
          for (j = 0; j < 2; j++)
                  for (k = 0; k < 2; k++)
                          for (l = 0; l < 2; l++)
                                  ASSERT_TRUE(__test_merge_new(i, j, k, l));
  return true;
+}

static bool test_vma_merge_special_flags(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void) return true; }

-static bool test_merge_existing(void) +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
  vm_flags_t prev_flags = vm_flags;
  vm_flags_t next_flags = vm_flags;
  struct mm_struct mm = {};
  VMA_ITERATOR(vmi, &mm, 0);
  struct vm_area_struct *vma, *vma_prev, *vma_next;
@@ -988,6 +1027,13 @@ static bool test_merge_existing(void) }; struct anon_vma_chain avc = {};
  if (prev_is_sticky)
          prev_flags |= VM_STICKY;
  if (middle_is_sticky)
          vm_flags |= VM_STICKY;
  if (next_is_sticky)
          next_flags |= VM_STICKY;
  /*
   * Merge right case - partial span.
   *
@@ -1000,7 +1046,7 @@ static bool test_merge_existing(void) */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */
  vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
  vma_next->vm_ops = &vm_ops; /* This should have no impact. */
  vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma);
  vmg.middle = vma;
@@ -1018,6 +1064,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 2);
  if (middle_is_sticky || next_is_sticky)
          ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));

  /* Clear down and reset. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1033,7 +1081,7 @@ static bool test_merge_existing(void) * NNNNNNN */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
  vma_next->vm_ops = &vm_ops; /* This should have no impact. */
  vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma);
  vmg.middle = vma;
@@ -1046,6 +1094,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 1);
  if (middle_is_sticky || next_is_sticky)
          ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));

  /* Clear down and reset. We should have deleted vma. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1060,7 +1110,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPV */
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
  vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
  vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
  vma->vm_ops = &vm_ops; /* This should have no impact. */
@@ -1080,6 +1130,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
  if (prev_is_sticky || middle_is_sticky)
          ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

  /* Clear down and reset. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1094,7 +1146,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPP */
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
  vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
  vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
  vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
@@ -1109,6 +1161,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1);
  if (prev_is_sticky || middle_is_sticky)
          ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

  /* Clear down and reset. We should have deleted vma. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1123,10 +1177,10 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPPPPP */
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
  vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
  vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags);
  vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
  vmg.prev = vma_prev;
  vmg.middle = vma;
@@ -1139,6 +1193,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1);
  if (prev_is_sticky || middle_is_sticky || next_is_sticky)
          ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));

  /* Clear down and reset. We should have deleted prev and next. */
  ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1158,9 +1214,9 @@ static bool test_merge_existing(void) * PPPVVVVVNNN */
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
  vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
  vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags);
  vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);

  vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags);
  vmg.prev = vma;
@@ -1203,6 +1259,19 @@ static bool test_merge_existing(void) return true; }

+static bool test_merge_existing(void) +{
  int i, j, k;
  /* Generate every possible permutation of sticky flags. */
  for (i = 0; i < 2; i++)
          for (j = 0; j < 2; j++)
                  for (k = 0; k < 2; k++)
                          ASSERT_TRUE(__test_merge_existing(i, j, k));
  return true;
+}

static bool test_anon_vma_non_mergeable(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index e40c93edc5a7..3d9cb3a9411a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif

+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD

+/*
Flags which should be 'sticky' on merge - that is, flags which, when one VMA

possesses it but the other does not, the merged VMA should nonetheless have

applied to it:

VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
              metadata which should be unconditionally propagated upon
              fork. When merging two VMAs, we encapsulate this range in
              the merged VMA, so the flag should be 'sticky' as a result.
*/
+#define VM_STICKY VM_COPY_ON_FORK

+/*
VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one

of these flags and the other not does not preclude a merge.

VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
           dirty bit -- the caller should mark merged VMA as dirty. If
           dirty bit won't be excluded from comparison, we increase
           pressure on the memory system forcing the kernel to generate
           new VMAs when old one could be extended instead.
VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
           which should propagate to all VMAs, but the other does not,
           the merge should still proceed with the merge logic applying
           sticky flags to the final VMA.
*/
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)

#define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL

-- 2.51.0

Thanks for review!

Pedro Falcato

4:25 p.m.

New subject: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags

On Wed, Oct 29, 2025 at 04:50:32PM +0000, Lorenzo Stoakes wrote:

...

It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:

The destination VMA has VM_UFFD_WP set

The mapping is a PFN or mixed map

The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)

Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.

We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.

Do we want this to be sticky though? If you're looking for more granularity with this flag, the best option might be to stop merges from happening there. If not, I can imagine a VMA that merges with other VMAs far past the original guarded range, and thus you get no granularity (possibly, not even useful).

If you're _not_ looking for granularity, then maybe using a per-mm flag for guard ranges or some other solution would be superior?

The rest of the patch (superficially) looks good to me, though.

-- Pedro

Lorenzo Stoakes

4:34 p.m.

New subject: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags

On Thu, Oct 30, 2025 at 04:25:54PM +0000, Pedro Falcato wrote:

...

On Wed, Oct 29, 2025 at 04:50:32PM +0000, Lorenzo Stoakes wrote:

...
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:

The destination VMA has VM_UFFD_WP set

The mapping is a PFN or mixed map

The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)

Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.

We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.

Do we want this to be sticky though? If you're looking for more granularity

Yes?

...

with this flag, the best option might be to stop merges from happening there.

No?

That'd entirely break VMA merging for any VMA you are installing guard regions into.

That'd be a regression, as the property of guard regions belongs to the page tables which can propagate across split/merge.

Also, a key purpose of this flag is to be able to correctly propagate page tables on fork for file-backed VMAs.

Without this we have to install an anon_vma in file-backed VMAs (what we do now), which has all the same drawbacks.

...

If not, I can imagine a VMA that merges with other VMAs far past the original guarded range, and thus you get no granularity (possibly, not even useful).

Err? What? It gets you VMA granularity. You are always going to do better than 'anywhere in the entire mm'. Of course you can imagine scenarios where one VMA somehow dominates everything, or guard regions are removed etc. but in most cases you're not going to encounter that.

Also again, the _more important_ purpose here is correct behaviour on fork.

...

If you're _not_ looking for granularity, then maybe using a per-mm flag for guard ranges or some other solution would be superior?

I'm not sure what solution you think would be superior that wouldn't involve significant overhead in having to look up guard regions on split/merge.

This is a property that belongs to the page tables that we're relating to VMAs which may or may not contain page tables which have this property.

mm granularity would be utterly pointless and leave us with the same anon_vma hack.

...

The rest of the patch (superficially) looks good to me, though.

Well there's that at least :)

...

-- Pedro

Thanks, Lorenzo

Lorenzo Stoakes

29 Oct 29 Oct

4:50 p.m.

New subject: [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test

Assert that we observe guard regions appearing in /proc/$pid/smaps as expected, and when split/merge is performed too (with expected sticky behaviour).

Also add handling for file systems which don't sanely handle mmap() VMA merging so we don't incorrectly encounter a test failure in this situation.

Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + 3 files changed, 126 insertions(+)

diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 8dd81c0a4a5a..a9be11e03a6a 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -94,6 +94,7 @@ static void *mmap_(FIXTURE_DATA(guard_regions) * self, case ANON_BACKED: flags |= MAP_PRIVATE | MAP_ANON; fd = -1; + offset = 0; break; case SHMEM_BACKED: case LOCAL_FILE_BACKED: @@ -260,6 +261,54 @@ static bool is_buf_eq(char *buf, size_t size, char chr) return true; }

+/* + * Some file systems have issues with merging due to changing merge-sensitive + * parameters in the .mmap callback, and prior to .mmap_prepare being + * implemented everywhere this will now result in an unexpected failure to + * merge (e.g. - overlayfs). + * + * Perform a simple test to see if the local file system suffers from this, if + * it does then we can skip test logic that assumes local file system merging is + * sane. + */ +static bool local_fs_has_sane_mmap(FIXTURE_DATA(guard_regions) * self, + const FIXTURE_VARIANT(guard_regions) * variant) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr2; + struct procmap_fd procmap; + + if (variant->backing != LOCAL_FILE_BACKED) + return true; + + /* Map 10 pages. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0); + if (ptr == MAP_FAILED) + return false; + /* Unmap the middle. */ + munmap(&ptr[5 * page_size], page_size); + + /* Map again. */ + ptr2 = mmap_(self, variant, &ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED, 5 * page_size); + + if (ptr2 == MAP_FAILED) + return false; + + /* Now make sure they all merged. */ + if (open_self_procmap(&procmap) != 0) + return false; + if (!find_vma_procmap(&procmap, ptr)) + return false; + if (procmap.query.vma_start != (unsigned long)ptr) + return false; + if (procmap.query.vma_end != (unsigned long)ptr + 10 * page_size) + return false; + close_procmap(&procmap); + + return true; +} + FIXTURE_SETUP(guard_regions) { self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); @@ -2138,4 +2187,75 @@ TEST_F(guard_regions, pagemap_scan) ASSERT_EQ(munmap(ptr, 10 * page_size), 0); }

+TEST_F(guard_regions, smaps) +{ + const unsigned long page_size = self->page_size; + struct procmap_fd procmap; + char *ptr, *ptr2; + int i; + + /* Map a region. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* We shouldn't yet see a guard flag. */ + ASSERT_FALSE(check_vmflag_guard(ptr)); + + /* Install a single guard region. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0); + + /* Now we should see a guard flag. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + + /* + * Removing the guard region should not change things because we simply + * cannot accurately track whether a given VMA has had all of its guard + * regions removed. + */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_REMOVE), 0); + ASSERT_TRUE(check_vmflag_guard(ptr)); + + /* Install guard regions throughout. */ + for (i = 0; i < 10; i++) { + ASSERT_EQ(madvise(&ptr[i * page_size], page_size, MADV_GUARD_INSTALL), 0); + /* We should always see the guard region flag. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + } + + /* Split into two VMAs. */ + ASSERT_EQ(munmap(&ptr[4 * page_size], page_size), 0); + + /* Both VMAs should have the guard flag set. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + ASSERT_TRUE(check_vmflag_guard(&ptr[5 * page_size])); + + /* + * If the local file system is unable to merge VMAs due to having + * unusual characteristics, there is no point in asserting merge + * behaviour. + */ + if (!local_fs_has_sane_mmap(self, variant)) { + TH_LOG("local filesystem does not support sane merging skipping merge test"); + return; + } + + /* Map a fresh VMA between the two split VMAs. */ + ptr2 = mmap_(self, variant, &ptr[4 * page_size], page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 4 * page_size); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Check the procmap to ensure that this VMA merged with the adjacent + * two. The guard region flag is 'sticky' so should not preclude + * merging. + */ + ASSERT_EQ(open_self_procmap(&procmap), 0); + ASSERT_TRUE(find_vma_procmap(&procmap, ptr)); + ASSERT_EQ(procmap.query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap.query.vma_end, (unsigned long)ptr + 10 * page_size); + ASSERT_EQ(close_procmap(&procmap), 0); + /* And, of course, this VMA should have the guard flag set. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index e33cda301dad..605cb58ea5c3 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -449,6 +449,11 @@ bool check_vmflag_pfnmap(void *addr) return check_vmflag(addr, "pf"); }

+bool check_vmflag_guard(void *addr) +{ + return check_vmflag(addr, "gu"); +} + bool softdirty_supported(void) { char *addr; diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 26c30fdc0241..a8abdf414d46 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -98,6 +98,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len, unsigned long get_free_hugepages(void); bool check_vmflag_io(void *addr); bool check_vmflag_pfnmap(void *addr); +bool check_vmflag_guard(void *addr); int open_procmap(pid_t pid, struct procmap_fd *procmap_out); int query_procmap(struct procmap_fd *procmap); bool find_vma_procmap(struct procmap_fd *procmap, void *address);

-- 2.51.0

Suren Baghdasaryan

30 Oct 30 Oct

4:40 a.m.

New subject: [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...

Assert that we observe guard regions appearing in /proc/$pid/smaps as expected, and when split/merge is performed too (with expected sticky behaviour).

Also add handling for file systems which don't sanely handle mmap() VMA merging so we don't incorrectly encounter a test failure in this situation.

Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com

Reviewed-by: Suren Baghdasaryan surenb@google.com

...

tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + 3 files changed, 126 insertions(+)

diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 8dd81c0a4a5a..a9be11e03a6a 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -94,6 +94,7 @@ static void *mmap_(FIXTURE_DATA(guard_regions) * self, case ANON_BACKED: flags |= MAP_PRIVATE | MAP_ANON; fd = -1;
          offset = 0;
          break;
  case SHMEM_BACKED:
  case LOCAL_FILE_BACKED:
@@ -260,6 +261,54 @@ static bool is_buf_eq(char *buf, size_t size, char chr) return true; }

+/*

Some file systems have issues with merging due to changing merge-sensitive

parameters in the .mmap callback, and prior to .mmap_prepare being

implemented everywhere this will now result in an unexpected failure to

merge (e.g. - overlayfs).

Perform a simple test to see if the local file system suffers from this, if

it does then we can skip test logic that assumes local file system merging is

sane.

*/

+static bool local_fs_has_sane_mmap(FIXTURE_DATA(guard_regions) * self,
                             const FIXTURE_VARIANT(guard_regions) * variant)
+{
  const unsigned long page_size = self->page_size;
  char *ptr, *ptr2;
  struct procmap_fd procmap;
  if (variant->backing != LOCAL_FILE_BACKED)
          return true;
  /* Map 10 pages. */
  ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
  if (ptr == MAP_FAILED)
          return false;
  /* Unmap the middle. */
  munmap(&ptr[5 * page_size], page_size);
  /* Map again. */
  ptr2 = mmap_(self, variant, &ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
               MAP_FIXED, 5 * page_size);
  if (ptr2 == MAP_FAILED)
          return false;
  /* Now make sure they all merged. */
  if (open_self_procmap(&procmap) != 0)
          return false;
  if (!find_vma_procmap(&procmap, ptr))
          return false;
  if (procmap.query.vma_start != (unsigned long)ptr)
          return false;
  if (procmap.query.vma_end != (unsigned long)ptr + 10 * page_size)
          return false;
  close_procmap(&procmap);
  return true;
+}

FIXTURE_SETUP(guard_regions) { self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); @@ -2138,4 +2187,75 @@ TEST_F(guard_regions, pagemap_scan) ASSERT_EQ(munmap(ptr, 10 * page_size), 0); }

+TEST_F(guard_regions, smaps) +{
  const unsigned long page_size = self->page_size;
  struct procmap_fd procmap;
  char *ptr, *ptr2;
  int i;
  /* Map a region. */
  ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
  ASSERT_NE(ptr, MAP_FAILED);
  /* We shouldn't yet see a guard flag. */
  ASSERT_FALSE(check_vmflag_guard(ptr));
  /* Install a single guard region. */
  ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
  /* Now we should see a guard flag. */
  ASSERT_TRUE(check_vmflag_guard(ptr));
  /*
   * Removing the guard region should not change things because we simply
   * cannot accurately track whether a given VMA has had all of its guard
   * regions removed.
   */
  ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_REMOVE), 0);
  ASSERT_TRUE(check_vmflag_guard(ptr));
  /* Install guard regions throughout. */
  for (i = 0; i < 10; i++) {
          ASSERT_EQ(madvise(&ptr[i * page_size], page_size, MADV_GUARD_INSTALL), 0);
          /* We should always see the guard region flag. */
          ASSERT_TRUE(check_vmflag_guard(ptr));
  }
  /* Split into two VMAs. */
  ASSERT_EQ(munmap(&ptr[4 * page_size], page_size), 0);
  /* Both VMAs should have the guard flag set. */
  ASSERT_TRUE(check_vmflag_guard(ptr));
  ASSERT_TRUE(check_vmflag_guard(&ptr[5 * page_size]));
  /*
   * If the local file system is unable to merge VMAs due to having
   * unusual characteristics, there is no point in asserting merge
   * behaviour.
   */
  if (!local_fs_has_sane_mmap(self, variant)) {
          TH_LOG("local filesystem does not support sane merging skipping merge test");
          return;
  }
  /* Map a fresh VMA between the two split VMAs. */
  ptr2 = mmap_(self, variant, &ptr[4 * page_size], page_size,
               PROT_READ | PROT_WRITE, MAP_FIXED, 4 * page_size);
  ASSERT_NE(ptr2, MAP_FAILED);
  /*
   * Check the procmap to ensure that this VMA merged with the adjacent
   * two. The guard region flag is 'sticky' so should not preclude
   * merging.
   */
  ASSERT_EQ(open_self_procmap(&procmap), 0);
  ASSERT_TRUE(find_vma_procmap(&procmap, ptr));
  ASSERT_EQ(procmap.query.vma_start, (unsigned long)ptr);
  ASSERT_EQ(procmap.query.vma_end, (unsigned long)ptr + 10 * page_size);
  ASSERT_EQ(close_procmap(&procmap), 0);
  /* And, of course, this VMA should have the guard flag set. */
  ASSERT_TRUE(check_vmflag_guard(ptr));
+}

TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index e33cda301dad..605cb58ea5c3 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -449,6 +449,11 @@ bool check_vmflag_pfnmap(void *addr) return check_vmflag(addr, "pf"); }

+bool check_vmflag_guard(void *addr) +{
  return check_vmflag(addr, "gu");
+}

bool softdirty_supported(void) { char *addr; diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 26c30fdc0241..a8abdf414d46 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -98,6 +98,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len, unsigned long get_free_hugepages(void); bool check_vmflag_io(void *addr); bool check_vmflag_pfnmap(void *addr); +bool check_vmflag_guard(void *addr); int open_procmap(pid_t pid, struct procmap_fd *procmap_out); int query_procmap(struct procmap_fd *procmap); bool find_vma_procmap(struct procmap_fd *procmap, void *address); -- 2.51.0

Lorenzo Stoakes

8:25 a.m.

New subject: [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test

On Wed, Oct 29, 2025 at 09:40:19PM -0700, Suren Baghdasaryan wrote:

...

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:

...
Assert that we observe guard regions appearing in /proc/$pid/smaps as expected, and when split/merge is performed too (with expected sticky behaviour).

Also add handling for file systems which don't sanely handle mmap() VMA merging so we don't incorrectly encounter a test failure in this situation.

Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com

Reviewed-by: Suren Baghdasaryan surenb@google.com

Thanks!

days inactive

days old

linux-kselftest-mirror@lists.linaro.org

27 comments

participants

tags (0)

participants (5)

Lorenzo Stoakes
Pedro Falcato
Randy Dunlap
Suren Baghdasaryan
Vlastimil Babka