Hi all,
This is v2 to the VMA count patch I previously posted at:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
I've split it into multiple patches to address the feedback.
The main changes in v2 are:
- Use a capacity-based check for VMA count limit, per Lorenzo. - Rename map_count to vma_count, per David. - Add assertions for exceeding the limit, per Pedro. - Add tests for max_vma_count, per Liam. - Emit a trace event for failure due to insufficient capacity for observability
Tested on x86_64 and arm64:
- Build test: - allyesconfig for rename
- Selftests: cd tools/testing/selftests/mm && \ make && \ ./run_vmtests.sh -t max_vma_count
(With trace_max_vma_count_exceeded enabled)
- vma tests: cd tools/testing/vma && \ make && \ ./vma
Thanks, Kalesh
Kalesh Singh (7): mm: fix off-by-one error in VMA count limit checks mm/selftests: add max_vma_count tests mm: introduce vma_count_remaining() mm: rename mm_struct::map_count to vma_count mm: harden vma_count against direct modification mm: add assertion for VMA count limit mm/tracing: introduce max_vma_count_exceeded trace event
fs/binfmt_elf.c | 2 +- fs/coredump.c | 2 +- include/linux/mm.h | 35 +- include/linux/mm_types.h | 5 +- include/trace/events/vma.h | 32 + kernel/fork.c | 2 +- mm/debug.c | 2 +- mm/internal.h | 1 + mm/mmap.c | 28 +- mm/mremap.c | 13 +- mm/nommu.c | 8 +- mm/util.c | 1 - mm/vma.c | 88 ++- tools/testing/selftests/mm/Makefile | 1 + .../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + tools/testing/vma/vma.c | 32 +- tools/testing/vma/vma_internal.h | 44 +- 18 files changed, 949 insertions(+), 61 deletions(-) create mode 100644 include/trace/events/vma.h create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
The VMA count limit check in do_mmap() and do_brk_flags() uses a strict inequality (>), which allows a process's VMA count to exceed the configured sysctl_max_map_count limit by one.
A process with mm->map_count == sysctl_max_map_count will incorrectly pass this check and then exceed the limit upon allocation of a new VMA when its map_count is incremented.
Other VMA allocation paths, such as split_vma(), already use the correct, inclusive (>=) comparison.
Fix this bug by changing the comparison to be inclusive in do_mmap() and do_brk_flags(), bringing them in line with the correct behavior of other allocation paths.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Chnages in v2: - Fix mmap check, per Pedro
mm/mmap.c | 2 +- mm/vma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index 7306253cc3b5..e5370e7fcd8f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */ - if (mm->map_count > sysctl_max_map_count) + if (mm->map_count >= sysctl_max_map_count) return -ENOMEM;
/* diff --git a/mm/vma.c b/mm/vma.c index 3b12c7579831..033a388bc4b1 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (mm->map_count > sysctl_max_map_count) + if (mm->map_count >= sysctl_max_map_count) return -ENOMEM;
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
lol.
x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l 661
we really blew it that time!
Andrew Morton akpm@linux-foundation.org writes:
On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
lol.
x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l 661
we really blew it that time!
A few years back I made a list of the most-fixed commits in the kernel history; unsurprisingly that one (I call it the "original sin") came out on top:
https://lwn.net/Articles/914632/
Perhaps the time has come to update that analysis.
jon
On Tue, 16 Sep 2025 08:20:14 -0600 Jonathan Corbet corbet@lwn.net wrote:
Andrew Morton akpm@linux-foundation.org writes:
On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
lol.
x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l 661
we really blew it that time!
A few years back I made a list of the most-fixed commits in the kernel history; unsurprisingly that one (I call it the "original sin") came out on top:
https://lwn.net/Articles/914632/
Perhaps the time has come to update that analysis.
Hah, thanks, that's a fun article.
It's a shame it's so recent. I do recall back in maybe 2002 a fresh young developer popped up with a massive rework of the IDE drivers. I promptly took this into my tree because I was having a hate on the IDE drivers at the time. omfg, what a disaster - this was the buggiest bunch of junk conceivable and it went away after a few weeks.
I'm glad to say that the developer (email handle at the time was "haveblue") has become better since then.
On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
The VMA count limit check in do_mmap() and do_brk_flags() uses a strict inequality (>), which allows a process's VMA count to exceed the configured sysctl_max_map_count limit by one.
A process with mm->map_count == sysctl_max_map_count will incorrectly pass this check and then exceed the limit upon allocation of a new VMA when its map_count is incremented.
Other VMA allocation paths, such as split_vma(), already use the correct, inclusive (>=) comparison.
Fix this bug by changing the comparison to be inclusive in do_mmap() and do_brk_flags(), bringing them in line with the correct behavior of other allocation paths.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Reviewed-by: Pedro Falcato pfalcato@suse.de
Looks good, thanks!
On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh kaleshsingh@google.com wrote:
The VMA count limit check in do_mmap() and do_brk_flags() uses a strict inequality (>), which allows a process's VMA count to exceed the configured sysctl_max_map_count limit by one.
A process with mm->map_count == sysctl_max_map_count will incorrectly pass this check and then exceed the limit upon allocation of a new VMA when its map_count is incremented.
Other VMA allocation paths, such as split_vma(), already use the correct, inclusive (>=) comparison.
Fix this bug by changing the comparison to be inclusive in do_mmap() and do_brk_flags(), bringing them in line with the correct behavior of other allocation paths.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Acked-by: SeongJae Park sj@kernel.org
Thanks, SJ
[...]
On 15.09.25 18:36, Kalesh Singh wrote:
The VMA count limit check in do_mmap() and do_brk_flags() uses a strict inequality (>), which allows a process's VMA count to exceed the configured sysctl_max_map_count limit by one.
A process with mm->map_count == sysctl_max_map_count will incorrectly pass this check and then exceed the limit upon allocation of a new VMA when its map_count is incremented.
Other VMA allocation paths, such as split_vma(), already use the correct, inclusive (>=) comparison.
Fix this bug by changing the comparison to be inclusive in do_mmap() and do_brk_flags(), bringing them in line with the correct behavior of other allocation paths.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Reviewed-by: David Hildenbrand david@redhat.com
On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
The VMA count limit check in do_mmap() and do_brk_flags() uses a strict inequality (>), which allows a process's VMA count to exceed the configured sysctl_max_map_count limit by one.
A process with mm->map_count == sysctl_max_map_count will incorrectly pass this check and then exceed the limit upon allocation of a new VMA when its map_count is incremented.
Other VMA allocation paths, such as split_vma(), already use the correct, inclusive (>=) comparison.
Fix this bug by changing the comparison to be inclusive in do_mmap() and do_brk_flags(), bringing them in line with the correct behavior of other allocation paths.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Reviewed-by: Pedro Falcato pfalcato@suse.de
On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
The VMA count limit check in do_mmap() and do_brk_flags() uses a strict inequality (>), which allows a process's VMA count to exceed the configured sysctl_max_map_count limit by one.
A process with mm->map_count == sysctl_max_map_count will incorrectly pass this check and then exceed the limit upon allocation of a new VMA when its map_count is incremented.
Other VMA allocation paths, such as split_vma(), already use the correct, inclusive (>=) comparison.
Nice spot :)
And also 'doh!'
Fix this bug by changing the comparison to be inclusive in do_mmap() and do_brk_flags(), bringing them in line with the correct behavior of other allocation paths.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
LGTM, so:
Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Chnages in v2:
- Fix mmap check, per Pedro
mm/mmap.c | 2 +- mm/vma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index 7306253cc3b5..e5370e7fcd8f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */
- if (mm->map_count > sysctl_max_map_count)
if (mm->map_count >= sysctl_max_map_count) return -ENOMEM;
/*
diff --git a/mm/vma.c b/mm/vma.c index 3b12c7579831..033a388bc4b1 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (mm->map_count > sysctl_max_map_count)
if (mm->map_count >= sysctl_max_map_count) return -ENOMEM;
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
-- 2.51.0.384.g4c02a37b29-goog
Add a new selftest to verify that the max VMA count limit is correctly enforced.
This test suite checks that various VMA operations (mmap, mprotect, munmap, mremap) succeed or fail as expected when the number of VMAs is close to the sysctl_max_map_count limit.
The test works by first creating a large number of VMAs to bring the process close to the limit, and then performing various operations that may or may not create new VMAs. The test then verifies that the operations that would exceed the limit fail, and that the operations that do not exceed the limit succeed.
NOTE: munmap is special as it's allowed to temporarily exceed the limit by one for splits as this will decrease back to the limit once the unmap succeeds.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - Add tests, per Liam (note that the do_brk_flags() path is not easily tested from userspace, so it's not included here). Exceeding the limit there should be uncommon.
tools/testing/selftests/mm/Makefile | 1 + .../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + 3 files changed, 715 insertions(+) create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index d13b3cef2a2b..00a4b04eab06 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress TEST_GEN_FILES += uffd-unit-tests TEST_GEN_FILES += uffd-wp-mremap +TEST_GEN_FILES += max_vma_count_tests TEST_GEN_FILES += split_huge_page_test TEST_GEN_FILES += ksm_tests TEST_GEN_FILES += ksm_functional_tests diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c new file mode 100644 index 000000000000..c8401c03425c --- /dev/null +++ b/tools/testing/selftests/mm/max_vma_count_tests.c @@ -0,0 +1,709 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright 2025 Google LLC + */ +#define _GNU_SOURCE + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <unistd.h> +#include <errno.h> +#include <stdbool.h> +#include <linux/prctl.h> /* Definition of PR_* constants */ +#include <sys/prctl.h> + +#include "../kselftest.h" + +static int get_max_vma_count(void); +static bool set_max_vma_count(int val); +static int get_current_vma_count(void); +static bool is_current_vma_count(const char *msg, int expected); +static bool is_test_area_mapped(const char *msg); +static void print_surrounding_maps(const char *msg); + +/* Globals initialized in test_suite_setup() */ +static int MAX_VMA_COUNT; +static int ORIGINAL_MAX_VMA_COUNT; +static int PAGE_SIZE; +static int GUARD_SIZE; +static int TEST_AREA_SIZE; +static int EXTRA_MAP_SIZE; + +static int MAX_VMA_COUNT; + +static int NR_EXTRA_MAPS; + +static char *TEST_AREA; +static char *EXTRA_MAPS; + +#define DEFAULT_MAX_MAP_COUNT 65530 +#define TEST_AREA_NR_PAGES 3 +/* 1 before test area + 1 after test area + 1 after extra mappings */ +#define NR_GUARDS 3 +#define TEST_AREA_PROT (PROT_NONE) +#define EXTRA_MAP_PROT (PROT_NONE) + +/** + * test_suite_setup - Set up the VMA layout for VMA count testing. + * + * Sets up the following VMA layout: + * + * +----- base_addr + * | + * V + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+ + * | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap | + * | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) | + * | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | | + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+ + * ^ ^ ^ ^ ^ + * | | | | | + * +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+ + * (PAGE_SIZE) | | reach MAX_VMA_COUNT + * | | + * +--- TEST_AREA_SIZE ---+ + * | (3 * PAGE_SIZE) | + * ^ + * | + * +-- TEST_AREA starts here + * + * Populates TEST_AREA and other globals required for the tests. + * If successful, the current VMA count will be MAX_VMA_COUNT - 1. + * + * Return: true on success, false on failure. + */ +static bool test_suite_setup(void) +{ + int initial_vma_count; + size_t reservation_size; + void *base_addr = NULL; + char *ptr = NULL; + + ksft_print_msg("Setting up vma_max_count test suite...\n"); + + /* Initialize globals */ + PAGE_SIZE = sysconf(_SC_PAGESIZE); + TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE; + GUARD_SIZE = PAGE_SIZE; + EXTRA_MAP_SIZE = PAGE_SIZE; + MAX_VMA_COUNT = get_max_vma_count(); + + MAX_VMA_COUNT = get_max_vma_count(); + if (MAX_VMA_COUNT < 0) { + ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n"); + return false; + } + + /* + * If the current limit is higher than the kernel default, + * we attempt to lower it to the default to ensure the test + * can run with a reliably known boundary. + */ + ORIGINAL_MAX_VMA_COUNT = 0; + + if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) { + ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT; + + ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n", + MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT); + + if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) { + ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n", + DEFAULT_MAX_MAP_COUNT); + ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n"); + + MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT; + + return false; + } + + /* Update MAX_VMA_COUNT for the test run */ + MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT; + } + + initial_vma_count = get_current_vma_count(); + if (initial_vma_count < 0) { + ksft_print_msg("Failed to read /proc/self/maps\n"); + return false; + } + + /* + * Calculate how many extra mappings we need to create to reach + * MAX_VMA_COUNT - 1 (excluding test area). + */ + NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count; + + if (NR_EXTRA_MAPS < 1) { + ksft_print_msg("Not enough available maps to run test\n"); + ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n", + MAX_VMA_COUNT, initial_vma_count); + return false; + } + + /* + * Reserve space for: + * - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2) + * - The test area itself (TEST_AREA_NR_PAGES) + * - The guard pages (NR_GUARDS) + */ + reservation_size = ((NR_EXTRA_MAPS * 2) + + TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE; + + base_addr = mmap(NULL, reservation_size, PROT_NONE, + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); + if (base_addr == MAP_FAILED) { + ksft_print_msg("Failed tommap initial reservation\n"); + return false; + } + + if (munmap(base_addr, reservation_size) == -1) { + ksft_print_msg("Failed to munmap initial reservation\n"); + return false; + } + + /* Get the addr of the test area */ + TEST_AREA = (char *)base_addr + GUARD_SIZE; + + /* + * Get the addr of the region for extra mappings: + * test area + 1 guard. + */ + EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE; + + /* Create single-page mappings separated by unmapped pages */ + ptr = EXTRA_MAPS; + for (int i = 0; i < NR_EXTRA_MAPS; ++i) { + if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT, + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, + -1, 0) == MAP_FAILED) { + perror("mmap in fill loop"); + ksft_print_msg("Failed on mapping #%d of %d\n", i + 1, + NR_EXTRA_MAPS); + return false; + } + + /* Advance pointer by 2 to leave a gap */ + ptr += (2 * EXTRA_MAP_SIZE); + } + + if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1)) + return false; + + ksft_print_msg("vma_max_count test suite setup done.\n"); + + return true; +} + +static void test_suite_teardown(void) +{ + if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) { + if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT)) + ksft_print_msg("Failed to restore max_map_count to %d\n", + ORIGINAL_MAX_VMA_COUNT); + } +} + +/* --- Test Helper Functions --- */ +static bool mmap_anon(void) +{ + void *addr = mmap(NULL, PAGE_SIZE, PROT_READ, + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); + + /* + * Handle cleanup here as the runner doesn't track where this, + *mapping is located. + */ + if (addr != MAP_FAILED) + munmap(addr, PAGE_SIZE); + + return addr != MAP_FAILED; +} + +static inline bool __mprotect(char *addr, int size) +{ + int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC); + + return mprotect(addr, size, new_prot) == 0; +} + +static bool mprotect_nosplit(void) +{ + return __mprotect(TEST_AREA, TEST_AREA_SIZE); +} + +static bool mprotect_2way_split(void) +{ + return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE); +} + +static bool mprotect_3way_split(void) +{ + return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE); +} + +static inline bool __munmap(char *addr, int size) +{ + return munmap(addr, size) == 0; +} + +static bool munmap_nosplit(void) +{ + return __munmap(TEST_AREA, TEST_AREA_SIZE); +} + +static bool munmap_2way_split(void) +{ + return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE); +} + +static bool munmap_3way_split(void) +{ + return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE); +} + +/* mremap accounts for the worst case to fail early */ +static const int MREMAP_REQUIRED_VMA_SLOTS = 6; + +static bool mremap_dontunmap(void) +{ + void *new_addr; + + /* + * Using MREMAP_DONTUNMAP will create a new mapping without + * removing the old one, consuming one VMA slot. + */ + new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE, + MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL); + + if (new_addr != MAP_FAILED) + munmap(new_addr, TEST_AREA_SIZE); + + return new_addr != MAP_FAILED; +} + +struct test { + const char *name; + bool (*test)(void); + /* How many VMA slots below the limit this test needs to start? */ + int vma_slots_needed; + bool expect_success; +}; + +/* --- Test Cases --- */ +struct test tests[] = { + { + .name = "mmap_at_1_below_vma_count_limit", + .test = mmap_anon, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "mmap_at_vma_count_limit", + .test = mmap_anon, + .vma_slots_needed = 0, + .expect_success = false, + }, + { + .name = "mprotect_nosplit_at_1_below_vma_count_limit", + .test = mprotect_nosplit, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "mprotect_nosplit_at_vma_count_limit", + .test = mprotect_nosplit, + .vma_slots_needed = 0, + .expect_success = true, + }, + { + .name = "mprotect_2way_split_at_1_below_vma_count_limit", + .test = mprotect_2way_split, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "mprotect_2way_split_at_vma_count_limit", + .test = mprotect_2way_split, + .vma_slots_needed = 0, + .expect_success = false, + }, + { + .name = "mprotect_3way_split_at_2_below_vma_count_limit", + .test = mprotect_3way_split, + .vma_slots_needed = 2, + .expect_success = true, + }, + { + .name = "mprotect_3way_split_at_1_below_vma_count_limit", + .test = mprotect_3way_split, + .vma_slots_needed = 1, + .expect_success = false, + }, + { + .name = "mprotect_3way_split_at_vma_count_limit", + .test = mprotect_3way_split, + .vma_slots_needed = 0, + .expect_success = false, + }, + { + .name = "munmap_nosplit_at_1_below_vma_count_limit", + .test = munmap_nosplit, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "munmap_nosplit_at_vma_count_limit", + .test = munmap_nosplit, + .vma_slots_needed = 0, + .expect_success = true, + }, + { + .name = "munmap_2way_split_at_1_below_vma_count_limit", + .test = munmap_2way_split, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "munmap_2way_split_at_vma_count_limit", + .test = munmap_2way_split, + .vma_slots_needed = 0, + .expect_success = true, + }, + { + .name = "munmap_3way_split_at_2_below_vma_count_limit", + .test = munmap_3way_split, + .vma_slots_needed = 2, + .expect_success = true, + }, + { + .name = "munmap_3way_split_at_1_below_vma_count_limit", + .test = munmap_3way_split, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "munmap_3way_split_at_vma_count_limit", + .test = munmap_3way_split, + .vma_slots_needed = 0, + .expect_success = false, + }, + { + .name = "mremap_dontunmap_at_required_vma_count_capcity", + .test = mremap_dontunmap, + .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS, + .expect_success = true, + }, + { + .name = "mremap_dontunmap_at_1_below_required_vma_count_capacity", + .test = mremap_dontunmap, + .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1, + .expect_success = false, + }, +}; + +/* --- Test Runner --- */ +int main(int argc, char **argv) +{ + int num_tests = ARRAY_SIZE(tests); + int failed_tests = 0; + + ksft_set_plan(num_tests); + + if (!test_suite_setup() != 0) { + if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) + ksft_exit_skip("max_map_count too high and cannot be lowered\n" + "Please rerun as root.\n"); + else + ksft_exit_fail_msg("Test suite setup failed. Aborting.\n"); + + } + + for (int i = 0; i < num_tests; i++) { + int maps_to_unmap = tests[i].vma_slots_needed; + const char *name = tests[i].name; + bool test_passed; + + errno = 0; + + /* 1. Setup: TEST_AREA mapping */ + if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT, + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0) + == MAP_FAILED) { + ksft_test_result_fail( + "%s: Test setup failed to map TEST_AREA\n", + name); + maps_to_unmap = 0; + goto fail; + } + + /* Label TEST_AREA to ease debugging */ + if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA, + TEST_AREA_SIZE, "TEST_AREA")) { + ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n", + name); + ksft_print_msg( + "Continuing without named TEST_AREA mapping\n"); + } + + /* 2. Setup: Adjust VMA count based on test requirements */ + if (maps_to_unmap > NR_EXTRA_MAPS) { + ksft_test_result_fail( + "%s: Test setup failed: Invalid VMA slots required %d\n", + name, tests[i].vma_slots_needed); + maps_to_unmap = 0; + goto fail; + } + + /* Unmap extra mappings, accounting for the 1-page gap */ + for (int j = 0; j < maps_to_unmap; j++) + munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE), + EXTRA_MAP_SIZE); + + /* + * 3. Verify the preconditions. + * + * Sometimes there isn't an easy way to determine the cause + * of the test failure. + * e.g. an mprotect ENOMEM may be due to trying to protect + * unmapped area or due to hitting MAX_VMA_COUNT limit. + * + * We verify the preconditions of the test to ensure any + * expected failures are from the expected cause and not + * coincidental. + */ + if (!is_current_vma_count(name, + MAX_VMA_COUNT - tests[i].vma_slots_needed)) + goto fail; + + if (!is_test_area_mapped(name)) + goto fail; + + /* 4. Run the test */ + test_passed = (tests[i].test() == tests[i].expect_success); + if (test_passed) { + ksft_test_result_pass("%s\n", name); + } else { +fail: + failed_tests++; + ksft_test_result_fail( + "%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n", + name, get_current_vma_count(), MAX_VMA_COUNT, + errno, strerror(errno)); + print_surrounding_maps(name); + } + + /* 5. Teardown: Unmap TEST_AREA. */ + munmap(TEST_AREA, TEST_AREA_SIZE); + + /* 6. Teardown: Restore extra mappings to test suite baseline */ + for (int j = 0; j < maps_to_unmap; j++) { + /* Remap extra mappings, accounting for the gap */ + mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE), + EXTRA_MAP_SIZE, EXTRA_MAP_PROT, + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, + -1, 0); + } + } + + test_suite_teardown(); + + if (failed_tests > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} + +/* --- Utilities --- */ + +static int get_max_vma_count(void) +{ + int max_count; + FILE *f; + + f = fopen("/proc/sys/vm/max_map_count", "r"); + if (!f) + return -1; + + if (fscanf(f, "%d", &max_count) != 1) + max_count = -1; + + + fclose(f); + + return max_count; +} + +static bool set_max_vma_count(int val) +{ + FILE *f; + bool success = false; + + f = fopen("/proc/sys/vm/max_map_count", "w"); + if (!f) + return false; + + if (fprintf(f, "%d", val) > 0) + success = true; + + fclose(f); + return success; +} + +static int get_current_vma_count(void) +{ + char line[1024]; + int count = 0; + FILE *f; + + f = fopen("/proc/self/maps", "r"); + if (!f) + return -1; + + while (fgets(line, sizeof(line), f)) { + if (!strstr(line, "[vsyscall]")) + count++; + } + + fclose(f); + + return count; +} + +static bool is_current_vma_count(const char *msg, int expected) +{ + int current = get_current_vma_count(); + + if (current == expected) + return true; + + ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current, + expected); + return false; +} + +static bool is_test_area_mapped(const char *msg) +{ + unsigned long search_start = (unsigned long)TEST_AREA; + unsigned long search_end = search_start + TEST_AREA_SIZE; + bool found = false; + char line[1024]; + FILE *f; + + f = fopen("/proc/self/maps", "r"); + if (!f) { + ksft_print_msg("failed to open /proc/self/maps\n"); + return false; + } + + while (fgets(line, sizeof(line), f)) { + unsigned long start, end; + + if (sscanf(line, "%lx-%lx", &start, &end) != 2) + continue; + + /* Check for an exact match of the range */ + if (start == search_start && end == search_end) { + found = true; + break; + } else if (start > search_end) { + /* + *Since maps are sorted, if we've passed the end, we + * can stop searching. + */ + break; + } + } + + fclose(f); + + if (found) + return true; + + /* Not found */ + ksft_print_msg( + "%s: TEST_AREA is not mapped as a single contiguous block.\n", + msg); + print_surrounding_maps(msg); + + return false; +} + +static void print_surrounding_maps(const char *msg) +{ + unsigned long search_start = (unsigned long)TEST_AREA; + unsigned long search_end = search_start + TEST_AREA_SIZE; + unsigned long start; + unsigned long end; + char line[1024] = {}; + int line_idx = 0; + int first_match_idx = -1; + int last_match_idx = -1; + FILE *f; + + f = fopen("/proc/self/maps", "r"); + if (!f) + return; + + if (msg) + ksft_print_msg("%s\n", msg); + + ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n", + TEST_AREA); + + /* First pass: Read all lines and find the range of matching entries */ + fseek(f, 0, SEEK_SET); /* Rewind file */ + while (fgets(line, sizeof(line), f)) { + if (sscanf(line, "%lx-%lx", &start, &end) != 2) { + line_idx++; + continue; + } + + /* Check for any overlap */ + if (start < search_end && end > search_start) { + if (first_match_idx == -1) + first_match_idx = line_idx; + last_match_idx = line_idx; + } else if (start > search_end) { + /* + * Since maps are sorted, if we've passed the end, we + * can stop searching. + */ + break; + } + + line_idx++; + } + + if (first_match_idx == -1) { + ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n", + TEST_AREA); + } else { + /* Second pass: Print the relevant lines */ + fseek(f, 0, SEEK_SET); /* Rewind file */ + line_idx = 0; + while (fgets(line, sizeof(line), f)) { + /* Print 2 lines before the first match */ + if (line_idx >= first_match_idx - 2 && + line_idx < first_match_idx) + ksft_print_msg(" %s", line); + + /* Print all matching TEST_AREA entries */ + if (line_idx >= first_match_idx && + line_idx <= last_match_idx) + ksft_print_msg(">> %s", line); + + /* Print 2 lines after the last match */ + if (line_idx > last_match_idx && + line_idx <= last_match_idx + 2) + ksft_print_msg(" %s", line); + + line_idx++; + } + } + + ksft_print_msg("--------------------------------------------------\n"); + + fclose(f); +} diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 471e539d82b8..3794b50ec280 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -49,6 +49,8 @@ separated by spaces: test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options - madv_populate test memadvise(2) MADV_POPULATE_{READ,WRITE} options +- max_vma_count + tests for max vma_count - memfd_secret test memfd_secret(2) - process_mrelease @@ -417,6 +419,9 @@ fi # VADDR64 # vmalloc stability smoke test CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
+# test operations against max vma count limit +CATEGORY="max_vma_count" run_test ./max_vma_count_tests + CATEGORY="mremap" run_test ./mremap_dontunmap
CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
- test_suite_setup - Set up the VMA layout for VMA count testing.
- Sets up the following VMA layout:
- +----- base_addr
- |
- V
- +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
- | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
- | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
- | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
- +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
- ^ ^ ^ ^ ^
- | | | | |
- +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
- (PAGE_SIZE) | | reach MAX_VMA_COUNT
| |
+--- TEST_AREA_SIZE ---+
| (3 * PAGE_SIZE) |
^
|
+-- TEST_AREA starts here
Just wondering if we could find a different name than "guard page" here, to not confuse stuff with guard ptes
Will the current "guard page" we a valid vma or just a hole?
On Wed, Sep 17, 2025 at 3:58 AM 'David Hildenbrand' via kernel-team kernel-team@android.com wrote:
- test_suite_setup - Set up the VMA layout for VMA count testing.
- Sets up the following VMA layout:
- +----- base_addr
- |
- V
- +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
- | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
- | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
- | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
- +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
- ^ ^ ^ ^ ^
- | | | | |
- +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
- (PAGE_SIZE) | | reach MAX_VMA_COUNT
| |
+--- TEST_AREA_SIZE ---+
| (3 * PAGE_SIZE) |
^
|
+-- TEST_AREA starts here
Hi David,
Thanks for the reviews.
Just wondering if we could find a different name than "guard page" here, to not confuse stuff with guard ptes
Will the current "guard page" we a valid vma or just a hole?
It's a hole to prevent coincidental merging of adjacent VMAs. I can rename it to HOLE in the next revision.
Thanks, Kalesh
-- Cheers
David / dhildenb
To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
On Mon, Sep 15, 2025 at 09:36:33AM -0700, Kalesh Singh wrote:
Add a new selftest to verify that the max VMA count limit is correctly enforced.
This test suite checks that various VMA operations (mmap, mprotect, munmap, mremap) succeed or fail as expected when the number of VMAs is close to the sysctl_max_map_count limit.
The test works by first creating a large number of VMAs to bring the process close to the limit, and then performing various operations that may or may not create new VMAs. The test then verifies that the operations that would exceed the limit fail, and that the operations that do not exceed the limit succeed.
NOTE: munmap is special as it's allowed to temporarily exceed the limit by one for splits as this will decrease back to the limit once the unmap succeeds.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Changes in v2:
- Add tests, per Liam (note that the do_brk_flags() path is not easily tested from userspace, so it's not included here). Exceeding the limit there should be uncommon.
tools/testing/selftests/mm/Makefile | 1 + .../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++
If you add a new file, update MAINTAINERS. Should put in MEMORY MAPPING section.
tools/testing/selftests/mm/run_vmtests.sh | 5 +
You're missing a .gitignore.
3 files changed, 715 insertions(+) create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index d13b3cef2a2b..00a4b04eab06 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress TEST_GEN_FILES += uffd-unit-tests TEST_GEN_FILES += uffd-wp-mremap +TEST_GEN_FILES += max_vma_count_tests TEST_GEN_FILES += split_huge_page_test TEST_GEN_FILES += ksm_tests TEST_GEN_FILES += ksm_functional_tests diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c new file mode 100644 index 000000000000..c8401c03425c --- /dev/null +++ b/tools/testing/selftests/mm/max_vma_count_tests.c @@ -0,0 +1,709 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- Copyright 2025 Google LLC
- */
+#define _GNU_SOURCE
+#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <unistd.h> +#include <errno.h> +#include <stdbool.h> +#include <linux/prctl.h> /* Definition of PR_* constants */ +#include <sys/prctl.h>
+#include "../kselftest.h"
+static int get_max_vma_count(void); +static bool set_max_vma_count(int val); +static int get_current_vma_count(void); +static bool is_current_vma_count(const char *msg, int expected); +static bool is_test_area_mapped(const char *msg); +static void print_surrounding_maps(const char *msg);
+/* Globals initialized in test_suite_setup() */ +static int MAX_VMA_COUNT; +static int ORIGINAL_MAX_VMA_COUNT; +static int PAGE_SIZE; +static int GUARD_SIZE; +static int TEST_AREA_SIZE; +static int EXTRA_MAP_SIZE;
+static int MAX_VMA_COUNT;
+static int NR_EXTRA_MAPS;
+static char *TEST_AREA; +static char *EXTRA_MAPS;
+#define DEFAULT_MAX_MAP_COUNT 65530 +#define TEST_AREA_NR_PAGES 3 +/* 1 before test area + 1 after test area + 1 after extra mappings */ +#define NR_GUARDS 3 +#define TEST_AREA_PROT (PROT_NONE) +#define EXTRA_MAP_PROT (PROT_NONE)
+/**
- test_suite_setup - Set up the VMA layout for VMA count testing.
- Sets up the following VMA layout:
- +----- base_addr
- |
- V
- +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
- | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
- | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
- | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
- +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
I have no idea what a 'guard page' is here. You're not using MADV_GUARD_INSTALL so it's presumably an arbitrary PROT_NONE page or something?
- ^ ^ ^ ^ ^
- | | | | |
- +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
- (PAGE_SIZE) | | reach MAX_VMA_COUNT
| |
+--- TEST_AREA_SIZE ---+
| (3 * PAGE_SIZE) |
^
|
+-- TEST_AREA starts here
Can we make this vertical, it's obv >80 chars and it's kind of hard to read.
Also isn't this hugely convoluted? Do we really need to set up such a complicated set of VMAs for every test?
- Populates TEST_AREA and other globals required for the tests.
- If successful, the current VMA count will be MAX_VMA_COUNT - 1.
- Return: true on success, false on failure.
- */
+static bool test_suite_setup(void) +{
- int initial_vma_count;
- size_t reservation_size;
- void *base_addr = NULL;
- char *ptr = NULL;
- ksft_print_msg("Setting up vma_max_count test suite...\n");
- /* Initialize globals */
- PAGE_SIZE = sysconf(_SC_PAGESIZE);
- TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE;
- GUARD_SIZE = PAGE_SIZE;
- EXTRA_MAP_SIZE = PAGE_SIZE;
- MAX_VMA_COUNT = get_max_vma_count();
Dude we are not COBOL programmers :) What is it with these capitalised variable names?
This is really horrible, please don't do that. And why on earth are you using globals like this.
(Again) use the self test harness please :)
- MAX_VMA_COUNT = get_max_vma_count();
- if (MAX_VMA_COUNT < 0) {
ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n");
return false;
- }
- /*
* If the current limit is higher than the kernel default,
* we attempt to lower it to the default to ensure the test
* can run with a reliably known boundary.
*/
- ORIGINAL_MAX_VMA_COUNT = 0;
- if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) {
ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT;
ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n",
MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT);
if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) {
ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n",
DEFAULT_MAX_MAP_COUNT);
Probably partly a product of you not using the harness, but you're violating the max chars per line all over the place.
checkpatch.pl kinda has a fit with this file, can we just try and keep it to ~80 chars please?
ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n");
MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT;
return false;
}
/* Update MAX_VMA_COUNT for the test run */
MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT;
- }
- initial_vma_count = get_current_vma_count();
- if (initial_vma_count < 0) {
ksft_print_msg("Failed to read /proc/self/maps\n");
return false;
- }
- /*
* Calculate how many extra mappings we need to create to reach
* MAX_VMA_COUNT - 1 (excluding test area).
*/
- NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count;
- if (NR_EXTRA_MAPS < 1) {
ksft_print_msg("Not enough available maps to run test\n");
ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n",
MAX_VMA_COUNT, initial_vma_count);
return false;
- }
- /*
* Reserve space for:
* - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2)
* - The test area itself (TEST_AREA_NR_PAGES)
* - The guard pages (NR_GUARDS)
*/
- reservation_size = ((NR_EXTRA_MAPS * 2) +
TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE;
- base_addr = mmap(NULL, reservation_size, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
- if (base_addr == MAP_FAILED) {
ksft_print_msg("Failed tommap initial reservation\n");
return false;
- }
- if (munmap(base_addr, reservation_size) == -1) {
ksft_print_msg("Failed to munmap initial reservation\n");
return false;
- }
- /* Get the addr of the test area */
- TEST_AREA = (char *)base_addr + GUARD_SIZE;
- /*
* Get the addr of the region for extra mappings:
* test area + 1 guard.
*/
- EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE;
- /* Create single-page mappings separated by unmapped pages */
- ptr = EXTRA_MAPS;
- for (int i = 0; i < NR_EXTRA_MAPS; ++i) {
if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
-1, 0) == MAP_FAILED) {
perror("mmap in fill loop");
ksft_print_msg("Failed on mapping #%d of %d\n", i + 1,
NR_EXTRA_MAPS);
return false;
}
/* Advance pointer by 2 to leave a gap */
ptr += (2 * EXTRA_MAP_SIZE);
- }
- if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1))
return false;
- ksft_print_msg("vma_max_count test suite setup done.\n");
- return true;
+}
I mean it'll be fixed by the test harness, but even in test code let's please not write MASSIVE functions that do a million things. It's not really readable.
+static void test_suite_teardown(void) +{
- if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) {
if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT))
ksft_print_msg("Failed to restore max_map_count to %d\n",
ORIGINAL_MAX_VMA_COUNT);
- }
+}
+/* --- Test Helper Functions --- */ +static bool mmap_anon(void) +{
- void *addr = mmap(NULL, PAGE_SIZE, PROT_READ,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
- /*
* Handle cleanup here as the runner doesn't track where this,
*mapping is located.
*/
- if (addr != MAP_FAILED)
munmap(addr, PAGE_SIZE);
- return addr != MAP_FAILED;
+}
+static inline bool __mprotect(char *addr, int size) +{
- int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC);
- return mprotect(addr, size, new_prot) == 0;
+}
+static bool mprotect_nosplit(void) +{
- return __mprotect(TEST_AREA, TEST_AREA_SIZE);
+}
+static bool mprotect_2way_split(void) +{
- return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
+}
+static bool mprotect_3way_split(void) +{
- return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
+}
+static inline bool __munmap(char *addr, int size) +{
- return munmap(addr, size) == 0;
+}
+static bool munmap_nosplit(void) +{
- return __munmap(TEST_AREA, TEST_AREA_SIZE);
+}
+static bool munmap_2way_split(void) +{
- return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
+}
+static bool munmap_3way_split(void) +{
- return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
+}
+/* mremap accounts for the worst case to fail early */ +static const int MREMAP_REQUIRED_VMA_SLOTS = 6;
+static bool mremap_dontunmap(void) +{
- void *new_addr;
- /*
* Using MREMAP_DONTUNMAP will create a new mapping without
* removing the old one, consuming one VMA slot.
*/
- new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE,
MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL);
- if (new_addr != MAP_FAILED)
munmap(new_addr, TEST_AREA_SIZE);
- return new_addr != MAP_FAILED;
+}
+struct test {
- const char *name;
- bool (*test)(void);
/* How many VMA slots below the limit this test needs to start? */
- int vma_slots_needed;
- bool expect_success;
+};
+/* --- Test Cases --- */ +struct test tests[] = {
- {
.name = "mmap_at_1_below_vma_count_limit",
.test = mmap_anon,
.vma_slots_needed = 1,
.expect_success = true,
- },
- {
.name = "mmap_at_vma_count_limit",
.test = mmap_anon,
.vma_slots_needed = 0,
.expect_success = false,
- },
- {
.name = "mprotect_nosplit_at_1_below_vma_count_limit",
.test = mprotect_nosplit,
.vma_slots_needed = 1,
.expect_success = true,
- },
- {
.name = "mprotect_nosplit_at_vma_count_limit",
.test = mprotect_nosplit,
.vma_slots_needed = 0,
.expect_success = true,
- },
- {
.name = "mprotect_2way_split_at_1_below_vma_count_limit",
.test = mprotect_2way_split,
.vma_slots_needed = 1,
.expect_success = true,
- },
- {
.name = "mprotect_2way_split_at_vma_count_limit",
.test = mprotect_2way_split,
.vma_slots_needed = 0,
.expect_success = false,
- },
- {
.name = "mprotect_3way_split_at_2_below_vma_count_limit",
.test = mprotect_3way_split,
.vma_slots_needed = 2,
.expect_success = true,
- },
- {
.name = "mprotect_3way_split_at_1_below_vma_count_limit",
.test = mprotect_3way_split,
.vma_slots_needed = 1,
.expect_success = false,
- },
- {
.name = "mprotect_3way_split_at_vma_count_limit",
.test = mprotect_3way_split,
.vma_slots_needed = 0,
.expect_success = false,
- },
- {
.name = "munmap_nosplit_at_1_below_vma_count_limit",
.test = munmap_nosplit,
.vma_slots_needed = 1,
.expect_success = true,
- },
- {
.name = "munmap_nosplit_at_vma_count_limit",
.test = munmap_nosplit,
.vma_slots_needed = 0,
.expect_success = true,
- },
- {
.name = "munmap_2way_split_at_1_below_vma_count_limit",
.test = munmap_2way_split,
.vma_slots_needed = 1,
.expect_success = true,
- },
- {
.name = "munmap_2way_split_at_vma_count_limit",
.test = munmap_2way_split,
.vma_slots_needed = 0,
.expect_success = true,
- },
- {
.name = "munmap_3way_split_at_2_below_vma_count_limit",
.test = munmap_3way_split,
.vma_slots_needed = 2,
.expect_success = true,
- },
- {
.name = "munmap_3way_split_at_1_below_vma_count_limit",
.test = munmap_3way_split,
.vma_slots_needed = 1,
.expect_success = true,
- },
- {
.name = "munmap_3way_split_at_vma_count_limit",
.test = munmap_3way_split,
.vma_slots_needed = 0,
.expect_success = false,
- },
- {
.name = "mremap_dontunmap_at_required_vma_count_capcity",
.test = mremap_dontunmap,
.vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS,
.expect_success = true,
- },
- {
.name = "mremap_dontunmap_at_1_below_required_vma_count_capacity",
.test = mremap_dontunmap,
.vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1,
.expect_success = false,
- },
This is horrible. We don't need to do it this way.
Use the kselftest_harness please. See the guard_regions.c test for an example of how to use it.
+};
+/* --- Test Runner --- */ +int main(int argc, char **argv) +{
- int num_tests = ARRAY_SIZE(tests);
- int failed_tests = 0;
- ksft_set_plan(num_tests);
- if (!test_suite_setup() != 0) {
if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT)
ksft_exit_skip("max_map_count too high and cannot be lowered\n"
"Please rerun as root.\n");
else
ksft_exit_fail_msg("Test suite setup failed. Aborting.\n");
- }
- for (int i = 0; i < num_tests; i++) {
int maps_to_unmap = tests[i].vma_slots_needed;
const char *name = tests[i].name;
bool test_passed;
errno = 0;
/* 1. Setup: TEST_AREA mapping */
if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
== MAP_FAILED) {
ksft_test_result_fail(
"%s: Test setup failed to map TEST_AREA\n",
name);
maps_to_unmap = 0;
goto fail;
}
/* Label TEST_AREA to ease debugging */
if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA,
TEST_AREA_SIZE, "TEST_AREA")) {
ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n",
name);
ksft_print_msg(
"Continuing without named TEST_AREA mapping\n");
}
/* 2. Setup: Adjust VMA count based on test requirements */
if (maps_to_unmap > NR_EXTRA_MAPS) {
ksft_test_result_fail(
"%s: Test setup failed: Invalid VMA slots required %d\n",
name, tests[i].vma_slots_needed);
maps_to_unmap = 0;
goto fail;
}
/* Unmap extra mappings, accounting for the 1-page gap */
for (int j = 0; j < maps_to_unmap; j++)
munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
EXTRA_MAP_SIZE);
/*
* 3. Verify the preconditions.
*
* Sometimes there isn't an easy way to determine the cause
* of the test failure.
* e.g. an mprotect ENOMEM may be due to trying to protect
* unmapped area or due to hitting MAX_VMA_COUNT limit.
*
* We verify the preconditions of the test to ensure any
* expected failures are from the expected cause and not
* coincidental.
*/
if (!is_current_vma_count(name,
MAX_VMA_COUNT - tests[i].vma_slots_needed))
goto fail;
if (!is_test_area_mapped(name))
goto fail;
/* 4. Run the test */
test_passed = (tests[i].test() == tests[i].expect_success);
if (test_passed) {
ksft_test_result_pass("%s\n", name);
} else {
+fail:
failed_tests++;
ksft_test_result_fail(
"%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n",
name, get_current_vma_count(), MAX_VMA_COUNT,
errno, strerror(errno));
print_surrounding_maps(name);
}
/* 5. Teardown: Unmap TEST_AREA. */
munmap(TEST_AREA, TEST_AREA_SIZE);
/* 6. Teardown: Restore extra mappings to test suite baseline */
for (int j = 0; j < maps_to_unmap; j++) {
/* Remap extra mappings, accounting for the gap */
mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
EXTRA_MAP_SIZE, EXTRA_MAP_PROT,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
-1, 0);
}
- }
- test_suite_teardown();
- if (failed_tests > 0)
ksft_exit_fail();
- else
ksft_exit_pass();
+}
+/* --- Utilities --- */
+static int get_max_vma_count(void) +{
- int max_count;
- FILE *f;
- f = fopen("/proc/sys/vm/max_map_count", "r");
- if (!f)
return -1;
- if (fscanf(f, "%d", &max_count) != 1)
max_count = -1;
- fclose(f);
- return max_count;
+}
+static bool set_max_vma_count(int val) +{
- FILE *f;
- bool success = false;
- f = fopen("/proc/sys/vm/max_map_count", "w");
- if (!f)
return false;
- if (fprintf(f, "%d", val) > 0)
success = true;
- fclose(f);
- return success;
+}
+static int get_current_vma_count(void) +{
- char line[1024];
- int count = 0;
- FILE *f;
- f = fopen("/proc/self/maps", "r");
- if (!f)
return -1;
- while (fgets(line, sizeof(line), f)) {
if (!strstr(line, "[vsyscall]"))
count++;
- }
- fclose(f);
- return count;
+}
+static bool is_current_vma_count(const char *msg, int expected) +{
- int current = get_current_vma_count();
- if (current == expected)
return true;
- ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current,
expected);
- return false;
+}
+static bool is_test_area_mapped(const char *msg) +{
- unsigned long search_start = (unsigned long)TEST_AREA;
- unsigned long search_end = search_start + TEST_AREA_SIZE;
- bool found = false;
- char line[1024];
- FILE *f;
- f = fopen("/proc/self/maps", "r");
- if (!f) {
ksft_print_msg("failed to open /proc/self/maps\n");
return false;
- }
- while (fgets(line, sizeof(line), f)) {
unsigned long start, end;
if (sscanf(line, "%lx-%lx", &start, &end) != 2)
continue;
/* Check for an exact match of the range */
if (start == search_start && end == search_end) {
found = true;
break;
} else if (start > search_end) {
/*
*Since maps are sorted, if we've passed the end, we
* can stop searching.
*/
break;
}
- }
There's helpers for this kind of thing in vm_util.h, let's not duplicate code. Also you can use the proc query ioctl thingy (also helpers for that in vm_util.h) rather than having to spelunk /proc/$pid/maps.
- fclose(f);
- if (found)
return true;
- /* Not found */
- ksft_print_msg(
"%s: TEST_AREA is not mapped as a single contiguous block.\n",
msg);
- print_surrounding_maps(msg);
- return false;
+}
+static void print_surrounding_maps(const char *msg)
WHy are you printing a bunch of noise in a test?
+{
- unsigned long search_start = (unsigned long)TEST_AREA;
- unsigned long search_end = search_start + TEST_AREA_SIZE;
- unsigned long start;
- unsigned long end;
- char line[1024] = {};
- int line_idx = 0;
- int first_match_idx = -1;
- int last_match_idx = -1;
- FILE *f;
- f = fopen("/proc/self/maps", "r");
- if (!f)
return;
- if (msg)
ksft_print_msg("%s\n", msg);
- ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n",
TEST_AREA);
- /* First pass: Read all lines and find the range of matching entries */
- fseek(f, 0, SEEK_SET); /* Rewind file */
- while (fgets(line, sizeof(line), f)) {
if (sscanf(line, "%lx-%lx", &start, &end) != 2) {
line_idx++;
continue;
}
/* Check for any overlap */
if (start < search_end && end > search_start) {
if (first_match_idx == -1)
first_match_idx = line_idx;
last_match_idx = line_idx;
} else if (start > search_end) {
/*
* Since maps are sorted, if we've passed the end, we
* can stop searching.
*/
break;
}
line_idx++;
- }
- if (first_match_idx == -1) {
ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n",
TEST_AREA);
- } else {
/* Second pass: Print the relevant lines */
fseek(f, 0, SEEK_SET); /* Rewind file */
line_idx = 0;
while (fgets(line, sizeof(line), f)) {
/* Print 2 lines before the first match */
if (line_idx >= first_match_idx - 2 &&
line_idx < first_match_idx)
ksft_print_msg(" %s", line);
/* Print all matching TEST_AREA entries */
if (line_idx >= first_match_idx &&
line_idx <= last_match_idx)
ksft_print_msg(">> %s", line);
/* Print 2 lines after the last match */
if (line_idx > last_match_idx &&
line_idx <= last_match_idx + 2)
ksft_print_msg(" %s", line);
line_idx++;
}
- }
- ksft_print_msg("--------------------------------------------------\n");
- fclose(f);
+} diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 471e539d82b8..3794b50ec280 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -49,6 +49,8 @@ separated by spaces: test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
- madv_populate test memadvise(2) MADV_POPULATE_{READ,WRITE} options
+- max_vma_count
- tests for max vma_count
- memfd_secret test memfd_secret(2)
- process_mrelease
@@ -417,6 +419,9 @@ fi # VADDR64 # vmalloc stability smoke test CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
+# test operations against max vma count limit +CATEGORY="max_vma_count" run_test ./max_vma_count_tests
CATEGORY="mremap" run_test ./mremap_dontunmap
CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
2.51.0.384.g4c02a37b29-goog
On Thu, Sep 18, 2025 at 7:42 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Sep 15, 2025 at 09:36:33AM -0700, Kalesh Singh wrote:
Add a new selftest to verify that the max VMA count limit is correctly enforced.
This test suite checks that various VMA operations (mmap, mprotect, munmap, mremap) succeed or fail as expected when the number of VMAs is close to the sysctl_max_map_count limit.
The test works by first creating a large number of VMAs to bring the process close to the limit, and then performing various operations that may or may not create new VMAs. The test then verifies that the operations that would exceed the limit fail, and that the operations that do not exceed the limit succeed.
NOTE: munmap is special as it's allowed to temporarily exceed the limit by one for splits as this will decrease back to the limit once the unmap succeeds.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Changes in v2:
- Add tests, per Liam (note that the do_brk_flags() path is not easily tested from userspace, so it's not included here). Exceeding the limit there should be uncommon.
tools/testing/selftests/mm/Makefile | 1 + .../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++
If you add a new file, update MAINTAINERS. Should put in MEMORY MAPPING section.
tools/testing/selftests/mm/run_vmtests.sh | 5 +
You're missing a .gitignore.
3 files changed, 715 insertions(+) create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index d13b3cef2a2b..00a4b04eab06 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress TEST_GEN_FILES += uffd-unit-tests TEST_GEN_FILES += uffd-wp-mremap +TEST_GEN_FILES += max_vma_count_tests TEST_GEN_FILES += split_huge_page_test TEST_GEN_FILES += ksm_tests TEST_GEN_FILES += ksm_functional_tests diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c new file mode 100644 index 000000000000..c8401c03425c --- /dev/null +++ b/tools/testing/selftests/mm/max_vma_count_tests.c @@ -0,0 +1,709 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- Copyright 2025 Google LLC
- */
+#define _GNU_SOURCE
+#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <unistd.h> +#include <errno.h> +#include <stdbool.h> +#include <linux/prctl.h> /* Definition of PR_* constants */ +#include <sys/prctl.h>
+#include "../kselftest.h"
+static int get_max_vma_count(void); +static bool set_max_vma_count(int val); +static int get_current_vma_count(void); +static bool is_current_vma_count(const char *msg, int expected); +static bool is_test_area_mapped(const char *msg); +static void print_surrounding_maps(const char *msg);
+/* Globals initialized in test_suite_setup() */ +static int MAX_VMA_COUNT; +static int ORIGINAL_MAX_VMA_COUNT; +static int PAGE_SIZE; +static int GUARD_SIZE; +static int TEST_AREA_SIZE; +static int EXTRA_MAP_SIZE;
+static int MAX_VMA_COUNT;
+static int NR_EXTRA_MAPS;
+static char *TEST_AREA; +static char *EXTRA_MAPS;
+#define DEFAULT_MAX_MAP_COUNT 65530 +#define TEST_AREA_NR_PAGES 3 +/* 1 before test area + 1 after test area + 1 after extra mappings */ +#define NR_GUARDS 3 +#define TEST_AREA_PROT (PROT_NONE) +#define EXTRA_MAP_PROT (PROT_NONE)
+/**
- test_suite_setup - Set up the VMA layout for VMA count testing.
- Sets up the following VMA layout:
- +----- base_addr
- |
- V
- +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
- | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
- | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
- | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
- +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
I have no idea what a 'guard page' is here. You're not using MADV_GUARD_INSTALL so it's presumably an arbitrary PROT_NONE page or something?
I've changed this to "hole" per David's earlier feedbac.
- ^ ^ ^ ^ ^
- | | | | |
- +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
- (PAGE_SIZE) | | reach MAX_VMA_COUNT
| |
+--- TEST_AREA_SIZE ---+
| (3 * PAGE_SIZE) |
^
|
+-- TEST_AREA starts here
Can we make this vertical, it's obv >80 chars and it's kind of hard to read.
Will do.
Also isn't this hugely convoluted? Do we really need to set up such a complicated set of VMAs for every test?
I thought about playing with max_map_count to achieve the needed conditions for each test. But since this is global lowering it too significantly will introduce issues in other system processes while running the test (if the new max count) is lower than the current of any process.
We only set this up once for all the tests.
- Populates TEST_AREA and other globals required for the tests.
- If successful, the current VMA count will be MAX_VMA_COUNT - 1.
- Return: true on success, false on failure.
- */
+static bool test_suite_setup(void) +{
int initial_vma_count;
size_t reservation_size;
void *base_addr = NULL;
char *ptr = NULL;
ksft_print_msg("Setting up vma_max_count test suite...\n");
/* Initialize globals */
PAGE_SIZE = sysconf(_SC_PAGESIZE);
TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE;
GUARD_SIZE = PAGE_SIZE;
EXTRA_MAP_SIZE = PAGE_SIZE;
MAX_VMA_COUNT = get_max_vma_count();
Dude we are not COBOL programmers :) What is it with these capitalised variable names?
This is really horrible, please don't do that. And why on earth are you using globals like this.
My intention was to avoid passing a ton of parameters to each test. I guess I could have thread them through some struct as well :)
(Again) use the self test harness please :)
Let me rework it with the harness ...
MAX_VMA_COUNT = get_max_vma_count();
if (MAX_VMA_COUNT < 0) {
ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n");
return false;
}
/*
* If the current limit is higher than the kernel default,
* we attempt to lower it to the default to ensure the test
* can run with a reliably known boundary.
*/
ORIGINAL_MAX_VMA_COUNT = 0;
if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) {
ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT;
ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n",
MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT);
if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) {
ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n",
DEFAULT_MAX_MAP_COUNT);
Probably partly a product of you not using the harness, but you're violating the max chars per line all over the place.
checkpatch.pl kinda has a fit with this file, can we just try and keep it to ~80 chars please?
ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n");
MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT;
return false;
}
/* Update MAX_VMA_COUNT for the test run */
MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT;
}
initial_vma_count = get_current_vma_count();
if (initial_vma_count < 0) {
ksft_print_msg("Failed to read /proc/self/maps\n");
return false;
}
/*
* Calculate how many extra mappings we need to create to reach
* MAX_VMA_COUNT - 1 (excluding test area).
*/
NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count;
if (NR_EXTRA_MAPS < 1) {
ksft_print_msg("Not enough available maps to run test\n");
ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n",
MAX_VMA_COUNT, initial_vma_count);
return false;
}
/*
* Reserve space for:
* - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2)
* - The test area itself (TEST_AREA_NR_PAGES)
* - The guard pages (NR_GUARDS)
*/
reservation_size = ((NR_EXTRA_MAPS * 2) +
TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE;
base_addr = mmap(NULL, reservation_size, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (base_addr == MAP_FAILED) {
ksft_print_msg("Failed tommap initial reservation\n");
return false;
}
if (munmap(base_addr, reservation_size) == -1) {
ksft_print_msg("Failed to munmap initial reservation\n");
return false;
}
/* Get the addr of the test area */
TEST_AREA = (char *)base_addr + GUARD_SIZE;
/*
* Get the addr of the region for extra mappings:
* test area + 1 guard.
*/
EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE;
/* Create single-page mappings separated by unmapped pages */
ptr = EXTRA_MAPS;
for (int i = 0; i < NR_EXTRA_MAPS; ++i) {
if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
-1, 0) == MAP_FAILED) {
perror("mmap in fill loop");
ksft_print_msg("Failed on mapping #%d of %d\n", i + 1,
NR_EXTRA_MAPS);
return false;
}
/* Advance pointer by 2 to leave a gap */
ptr += (2 * EXTRA_MAP_SIZE);
}
if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1))
return false;
ksft_print_msg("vma_max_count test suite setup done.\n");
return true;
+}
I mean it'll be fixed by the test harness, but even in test code let's please not write MASSIVE functions that do a million things. It's not really readable.
I'll try breaking this up in the next revision.
+static void test_suite_teardown(void) +{
if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) {
if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT))
ksft_print_msg("Failed to restore max_map_count to %d\n",
ORIGINAL_MAX_VMA_COUNT);
}
+}
+/* --- Test Helper Functions --- */ +static bool mmap_anon(void) +{
void *addr = mmap(NULL, PAGE_SIZE, PROT_READ,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
/*
* Handle cleanup here as the runner doesn't track where this,
*mapping is located.
*/
if (addr != MAP_FAILED)
munmap(addr, PAGE_SIZE);
return addr != MAP_FAILED;
+}
+static inline bool __mprotect(char *addr, int size) +{
int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC);
return mprotect(addr, size, new_prot) == 0;
+}
+static bool mprotect_nosplit(void) +{
return __mprotect(TEST_AREA, TEST_AREA_SIZE);
+}
+static bool mprotect_2way_split(void) +{
return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
+}
+static bool mprotect_3way_split(void) +{
return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
+}
+static inline bool __munmap(char *addr, int size) +{
return munmap(addr, size) == 0;
+}
+static bool munmap_nosplit(void) +{
return __munmap(TEST_AREA, TEST_AREA_SIZE);
+}
+static bool munmap_2way_split(void) +{
return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
+}
+static bool munmap_3way_split(void) +{
return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
+}
+/* mremap accounts for the worst case to fail early */ +static const int MREMAP_REQUIRED_VMA_SLOTS = 6;
+static bool mremap_dontunmap(void) +{
void *new_addr;
/*
* Using MREMAP_DONTUNMAP will create a new mapping without
* removing the old one, consuming one VMA slot.
*/
new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE,
MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL);
if (new_addr != MAP_FAILED)
munmap(new_addr, TEST_AREA_SIZE);
return new_addr != MAP_FAILED;
+}
+struct test {
const char *name;
bool (*test)(void);
/* How many VMA slots below the limit this test needs to start? */
int vma_slots_needed;
bool expect_success;
+};
+/* --- Test Cases --- */ +struct test tests[] = {
{
.name = "mmap_at_1_below_vma_count_limit",
.test = mmap_anon,
.vma_slots_needed = 1,
.expect_success = true,
},
{
.name = "mmap_at_vma_count_limit",
.test = mmap_anon,
.vma_slots_needed = 0,
.expect_success = false,
},
{
.name = "mprotect_nosplit_at_1_below_vma_count_limit",
.test = mprotect_nosplit,
.vma_slots_needed = 1,
.expect_success = true,
},
{
.name = "mprotect_nosplit_at_vma_count_limit",
.test = mprotect_nosplit,
.vma_slots_needed = 0,
.expect_success = true,
},
{
.name = "mprotect_2way_split_at_1_below_vma_count_limit",
.test = mprotect_2way_split,
.vma_slots_needed = 1,
.expect_success = true,
},
{
.name = "mprotect_2way_split_at_vma_count_limit",
.test = mprotect_2way_split,
.vma_slots_needed = 0,
.expect_success = false,
},
{
.name = "mprotect_3way_split_at_2_below_vma_count_limit",
.test = mprotect_3way_split,
.vma_slots_needed = 2,
.expect_success = true,
},
{
.name = "mprotect_3way_split_at_1_below_vma_count_limit",
.test = mprotect_3way_split,
.vma_slots_needed = 1,
.expect_success = false,
},
{
.name = "mprotect_3way_split_at_vma_count_limit",
.test = mprotect_3way_split,
.vma_slots_needed = 0,
.expect_success = false,
},
{
.name = "munmap_nosplit_at_1_below_vma_count_limit",
.test = munmap_nosplit,
.vma_slots_needed = 1,
.expect_success = true,
},
{
.name = "munmap_nosplit_at_vma_count_limit",
.test = munmap_nosplit,
.vma_slots_needed = 0,
.expect_success = true,
},
{
.name = "munmap_2way_split_at_1_below_vma_count_limit",
.test = munmap_2way_split,
.vma_slots_needed = 1,
.expect_success = true,
},
{
.name = "munmap_2way_split_at_vma_count_limit",
.test = munmap_2way_split,
.vma_slots_needed = 0,
.expect_success = true,
},
{
.name = "munmap_3way_split_at_2_below_vma_count_limit",
.test = munmap_3way_split,
.vma_slots_needed = 2,
.expect_success = true,
},
{
.name = "munmap_3way_split_at_1_below_vma_count_limit",
.test = munmap_3way_split,
.vma_slots_needed = 1,
.expect_success = true,
},
{
.name = "munmap_3way_split_at_vma_count_limit",
.test = munmap_3way_split,
.vma_slots_needed = 0,
.expect_success = false,
},
{
.name = "mremap_dontunmap_at_required_vma_count_capcity",
.test = mremap_dontunmap,
.vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS,
.expect_success = true,
},
{
.name = "mremap_dontunmap_at_1_below_required_vma_count_capacity",
.test = mremap_dontunmap,
.vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1,
.expect_success = false,
},
This is horrible. We don't need to do it this way.
Use the kselftest_harness please. See the guard_regions.c test for an example of how to use it.
Will do, thanks for the reference.
+};
+/* --- Test Runner --- */ +int main(int argc, char **argv) +{
int num_tests = ARRAY_SIZE(tests);
int failed_tests = 0;
ksft_set_plan(num_tests);
if (!test_suite_setup() != 0) {
if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT)
ksft_exit_skip("max_map_count too high and cannot be lowered\n"
"Please rerun as root.\n");
else
ksft_exit_fail_msg("Test suite setup failed. Aborting.\n");
}
for (int i = 0; i < num_tests; i++) {
int maps_to_unmap = tests[i].vma_slots_needed;
const char *name = tests[i].name;
bool test_passed;
errno = 0;
/* 1. Setup: TEST_AREA mapping */
if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
== MAP_FAILED) {
ksft_test_result_fail(
"%s: Test setup failed to map TEST_AREA\n",
name);
maps_to_unmap = 0;
goto fail;
}
/* Label TEST_AREA to ease debugging */
if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA,
TEST_AREA_SIZE, "TEST_AREA")) {
ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n",
name);
ksft_print_msg(
"Continuing without named TEST_AREA mapping\n");
}
/* 2. Setup: Adjust VMA count based on test requirements */
if (maps_to_unmap > NR_EXTRA_MAPS) {
ksft_test_result_fail(
"%s: Test setup failed: Invalid VMA slots required %d\n",
name, tests[i].vma_slots_needed);
maps_to_unmap = 0;
goto fail;
}
/* Unmap extra mappings, accounting for the 1-page gap */
for (int j = 0; j < maps_to_unmap; j++)
munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
EXTRA_MAP_SIZE);
/*
* 3. Verify the preconditions.
*
* Sometimes there isn't an easy way to determine the cause
* of the test failure.
* e.g. an mprotect ENOMEM may be due to trying to protect
* unmapped area or due to hitting MAX_VMA_COUNT limit.
*
* We verify the preconditions of the test to ensure any
* expected failures are from the expected cause and not
* coincidental.
*/
if (!is_current_vma_count(name,
MAX_VMA_COUNT - tests[i].vma_slots_needed))
goto fail;
if (!is_test_area_mapped(name))
goto fail;
/* 4. Run the test */
test_passed = (tests[i].test() == tests[i].expect_success);
if (test_passed) {
ksft_test_result_pass("%s\n", name);
} else {
+fail:
failed_tests++;
ksft_test_result_fail(
"%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n",
name, get_current_vma_count(), MAX_VMA_COUNT,
errno, strerror(errno));
print_surrounding_maps(name);
}
/* 5. Teardown: Unmap TEST_AREA. */
munmap(TEST_AREA, TEST_AREA_SIZE);
/* 6. Teardown: Restore extra mappings to test suite baseline */
for (int j = 0; j < maps_to_unmap; j++) {
/* Remap extra mappings, accounting for the gap */
mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
EXTRA_MAP_SIZE, EXTRA_MAP_PROT,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
-1, 0);
}
}
test_suite_teardown();
if (failed_tests > 0)
ksft_exit_fail();
else
ksft_exit_pass();
+}
+/* --- Utilities --- */
+static int get_max_vma_count(void) +{
int max_count;
FILE *f;
f = fopen("/proc/sys/vm/max_map_count", "r");
if (!f)
return -1;
if (fscanf(f, "%d", &max_count) != 1)
max_count = -1;
fclose(f);
return max_count;
+}
+static bool set_max_vma_count(int val) +{
FILE *f;
bool success = false;
f = fopen("/proc/sys/vm/max_map_count", "w");
if (!f)
return false;
if (fprintf(f, "%d", val) > 0)
success = true;
fclose(f);
return success;
+}
+static int get_current_vma_count(void) +{
char line[1024];
int count = 0;
FILE *f;
f = fopen("/proc/self/maps", "r");
if (!f)
return -1;
while (fgets(line, sizeof(line), f)) {
if (!strstr(line, "[vsyscall]"))
count++;
}
fclose(f);
return count;
+}
+static bool is_current_vma_count(const char *msg, int expected) +{
int current = get_current_vma_count();
if (current == expected)
return true;
ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current,
expected);
return false;
+}
+static bool is_test_area_mapped(const char *msg) +{
unsigned long search_start = (unsigned long)TEST_AREA;
unsigned long search_end = search_start + TEST_AREA_SIZE;
bool found = false;
char line[1024];
FILE *f;
f = fopen("/proc/self/maps", "r");
if (!f) {
ksft_print_msg("failed to open /proc/self/maps\n");
return false;
}
while (fgets(line, sizeof(line), f)) {
unsigned long start, end;
if (sscanf(line, "%lx-%lx", &start, &end) != 2)
continue;
/* Check for an exact match of the range */
if (start == search_start && end == search_end) {
found = true;
break;
} else if (start > search_end) {
/*
*Since maps are sorted, if we've passed the end, we
* can stop searching.
*/
break;
}
}
There's helpers for this kind of thing in vm_util.h, let's not duplicate code. Also you can use the proc query ioctl thingy (also helpers for that in vm_util.h) rather than having to spelunk /proc/$pid/maps.
Thanks I'll use the helpers there instead.
fclose(f);
if (found)
return true;
/* Not found */
ksft_print_msg(
"%s: TEST_AREA is not mapped as a single contiguous block.\n",
msg);
print_surrounding_maps(msg);
return false;
+}
+static void print_surrounding_maps(const char *msg)
WHy are you printing a bunch of noise in a test?
This is only printed in the failure case, it makes debugging much easier. I can update it to a TH_LOG() instead once it's moved to the test harness.
Thanks, Kalesh
+{
unsigned long search_start = (unsigned long)TEST_AREA;
unsigned long search_end = search_start + TEST_AREA_SIZE;
unsigned long start;
unsigned long end;
char line[1024] = {};
int line_idx = 0;
int first_match_idx = -1;
int last_match_idx = -1;
FILE *f;
f = fopen("/proc/self/maps", "r");
if (!f)
return;
if (msg)
ksft_print_msg("%s\n", msg);
ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n",
TEST_AREA);
/* First pass: Read all lines and find the range of matching entries */
fseek(f, 0, SEEK_SET); /* Rewind file */
while (fgets(line, sizeof(line), f)) {
if (sscanf(line, "%lx-%lx", &start, &end) != 2) {
line_idx++;
continue;
}
/* Check for any overlap */
if (start < search_end && end > search_start) {
if (first_match_idx == -1)
first_match_idx = line_idx;
last_match_idx = line_idx;
} else if (start > search_end) {
/*
* Since maps are sorted, if we've passed the end, we
* can stop searching.
*/
break;
}
line_idx++;
}
if (first_match_idx == -1) {
ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n",
TEST_AREA);
} else {
/* Second pass: Print the relevant lines */
fseek(f, 0, SEEK_SET); /* Rewind file */
line_idx = 0;
while (fgets(line, sizeof(line), f)) {
/* Print 2 lines before the first match */
if (line_idx >= first_match_idx - 2 &&
line_idx < first_match_idx)
ksft_print_msg(" %s", line);
/* Print all matching TEST_AREA entries */
if (line_idx >= first_match_idx &&
line_idx <= last_match_idx)
ksft_print_msg(">> %s", line);
/* Print 2 lines after the last match */
if (line_idx > last_match_idx &&
line_idx <= last_match_idx + 2)
ksft_print_msg(" %s", line);
line_idx++;
}
}
ksft_print_msg("--------------------------------------------------\n");
fclose(f);
+} diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 471e539d82b8..3794b50ec280 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -49,6 +49,8 @@ separated by spaces: test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
- madv_populate test memadvise(2) MADV_POPULATE_{READ,WRITE} options
+- max_vma_count
tests for max vma_count
- memfd_secret test memfd_secret(2)
- process_mrelease
@@ -417,6 +419,9 @@ fi # VADDR64 # vmalloc stability smoke test CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
+# test operations against max vma count limit +CATEGORY="max_vma_count" run_test ./max_vma_count_tests
CATEGORY="mremap" run_test ./mremap_dontunmap
CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
2.51.0.384.g4c02a37b29-goog
The checks against sysctl_max_map_count are open-coded in multiple places. While simple checks are manageable, the logic in places like mremap.c involves arithmetic with magic numbers that can be difficult to reason about. e.g. ... >= sysctl_max_map_count - 3
To improve readability and centralize the logic, introduce a new helper, vma_count_remaining(). This function returns the VMA count headroom available for a givine process.
The most common case of checking for a single new VMA can be done with the convenience helper has_vma_count_remaining():
if (!vma_count_remaining(mm))
And the complex checks in mremap.c become clearer by expressing the required capacity directly:
if (vma_count_remaining(mm) < 4)
While a capacity-based function could be misused (e.g., with an incorrect '<' vs '<=' comparison), the improved readability at the call sites makes such errors less likely than with the previous open-coded arithmetic.
As part of this change, sysctl_max_map_count is made static to mm/mmap.c to improve encapsulation.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - Fix documentation comment for vma_count_remaining(), per Mike - Remove extern in header, per Mike and Pedro - Move declaration to mm/internal.h, per Mike - Replace exceeds_max_map_count() with capacity-based vma_count_remaining(), per Lorenzo. - Fix tools/testing/vma, per Lorenzo.
include/linux/mm.h | 2 -- mm/internal.h | 2 ++ mm/mmap.c | 21 ++++++++++++++++++++- mm/mremap.c | 7 ++++--- mm/nommu.c | 2 +- mm/util.c | 1 - mm/vma.c | 10 +++++----- tools/testing/vma/vma_internal.h | 9 +++++++++ 8 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1ae97a0b8ec7..138bab2988f8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page) #define MAPCOUNT_ELF_CORE_MARGIN (5) #define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-extern int sysctl_max_map_count; - extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes;
diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc03..39f1c9535ae5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm); int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
+int vma_count_remaining(const struct mm_struct *mm); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/mmap.c b/mm/mmap.c index e5370e7fcd8f..af88ce1fbb5f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */ - if (mm->map_count >= sysctl_max_map_count) + if (!vma_count_remaining(mm)) return -ENOMEM;
/* @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping( int sysctl_legacy_va_layout; #endif
+static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; + +/** + * vma_count_remaining - Determine available VMA slots + * @mm: The memory descriptor for the process. + * + * Check how many more VMAs can be created for the given @mm + * before hitting the sysctl_max_map_count limit. + * + * Return: The number of new VMAs the process can accommodate. + */ +int vma_count_remaining(const struct mm_struct *mm) +{ + const int map_count = mm->map_count; + const int max_count = sysctl_max_map_count; + + return (max_count > map_count) ? (max_count - map_count) : 0; +} + static const struct ctl_table mmap_table[] = { { .procname = "max_map_count", diff --git a/mm/mremap.c b/mm/mremap.c index 35de0a7b910e..14d35d87e89b 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) * We'd prefer to avoid failure later on in do_munmap: * which may split one vma into three before unmapping. */ - if (current->mm->map_count >= sysctl_max_map_count - 3) + if (vma_count_remaining(current->mm) < 4) return -ENOMEM;
if (vma->vm_ops && vma->vm_ops->may_split) { @@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) * split in 3 before unmapping it. * That means 2 more maps (1 for each) to the ones we already hold. * Check whether current map count plus 2 still leads us to 4 maps below - * the threshold, otherwise return -ENOMEM here to be more safe. + * the threshold. In other words, is the current map count + 6 at or + * below the threshold? Otherwise return -ENOMEM here to be more safe. */ - if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3) + if (vma_count_remaining(current->mm) < 6) return -ENOMEM;
return 0; diff --git a/mm/nommu.c b/mm/nommu.c index 8b819fafd57b..dd75f2334812 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, return -ENOMEM;
mm = vma->vm_mm; - if (mm->map_count >= sysctl_max_map_count) + if (!vma_count_remaining(mm)) return -ENOMEM;
region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL); diff --git a/mm/util.c b/mm/util.c index f814e6a59ab1..b6e83922cafe 100644 --- a/mm/util.c +++ b/mm/util.c @@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy); int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; static int sysctl_overcommit_ratio __read_mostly = 50; static unsigned long sysctl_overcommit_kbytes __read_mostly; -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */ unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
diff --git a/mm/vma.c b/mm/vma.c index 033a388bc4b1..df0e8409f63d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, }
/* - * __split_vma() bypasses sysctl_max_map_count checking. We use this where it - * has already been checked or doesn't make sense to fail. + * __split_vma() bypasses vma_count_remaining() checks. We use this where + * it has already been checked or doesn't make sense to fail. * VMA Iterator will point to the original VMA. */ static __must_check int @@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) { - if (vma->vm_mm->map_count >= sysctl_max_map_count) + if (!vma_count_remaining(vma->vm_mm)) return -ENOMEM;
return __split_vma(vmi, vma, addr, new_below); @@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end && - vms->vma->vm_mm->map_count >= sysctl_max_map_count) { + !vma_count_remaining(vms->vma->vm_mm)) { error = -ENOMEM; goto map_count_exceeded; } @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (mm->map_count >= sysctl_max_map_count) + if (!vma_count_remaining(mm)) return -ENOMEM;
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 3639aa8dd2b0..52cd7ddc73f4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi return vm_flags; }
+/* Helper to get VMA count capacity */ +static int vma_count_remaining(const struct mm_struct *mm) +{ + const int map_count = mm->map_count; + const int max_count = sysctl_max_map_count; + + return (max_count > map_count) ? (max_count - map_count) : 0; +} + #endif /* __MM_VMA_INTERNAL_H */
On 15.09.25 18:36, Kalesh Singh wrote:
The checks against sysctl_max_map_count are open-coded in multiple places. While simple checks are manageable, the logic in places like mremap.c involves arithmetic with magic numbers that can be difficult to reason about. e.g. ... >= sysctl_max_map_count - 3
To improve readability and centralize the logic, introduce a new helper, vma_count_remaining(). This function returns the VMA count headroom available for a givine process.
s/givine/given/
s/process/mm/
The most common case of checking for a single new VMA can be done with the convenience helper has_vma_count_remaining():
if (!vma_count_remaining(mm))
And the complex checks in mremap.c become clearer by expressing the required capacity directly:
if (vma_count_remaining(mm) < 4)
While a capacity-based function could be misused (e.g., with an incorrect '<' vs '<=' comparison), the improved readability at the call sites makes such errors less likely than with the previous open-coded arithmetic.
As part of this change, sysctl_max_map_count is made static to mm/mmap.c to improve encapsulation.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
[...]
/* @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping( int sysctl_legacy_va_layout; #endif +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+/**
- vma_count_remaining - Determine available VMA slots
- @mm: The memory descriptor for the process.
- Check how many more VMAs can be created for the given @mm
- before hitting the sysctl_max_map_count limit.
- Return: The number of new VMAs the process can accommodate.
- */
+int vma_count_remaining(const struct mm_struct *mm) +{
- const int map_count = mm->map_count;
- const int max_count = sysctl_max_map_count;
If we worry about rare races (sysctl_max_map_count changing?) we should probably force a single read through READ_ONCE()?
Otherwise one might trick vma_count_remaining() into returning a negative number I assume.
- return (max_count > map_count) ? (max_count - map_count) : 0;
+}
Nothing else jumped at me.
Not sure what the buildbot complains about but I'm sure you'll figure it out :)
On Wed, Sep 17, 2025 at 6:38 AM David Hildenbrand david@redhat.com wrote:
On 15.09.25 18:36, Kalesh Singh wrote:
The checks against sysctl_max_map_count are open-coded in multiple places. While simple checks are manageable, the logic in places like mremap.c involves arithmetic with magic numbers that can be difficult to reason about. e.g. ... >= sysctl_max_map_count - 3
To improve readability and centralize the logic, introduce a new helper, vma_count_remaining(). This function returns the VMA count headroom available for a givine process.
s/givine/given/
s/process/mm/
The most common case of checking for a single new VMA can be done with the convenience helper has_vma_count_remaining():
if (!vma_count_remaining(mm))
And the complex checks in mremap.c become clearer by expressing the required capacity directly:
if (vma_count_remaining(mm) < 4)
While a capacity-based function could be misused (e.g., with an incorrect '<' vs '<=' comparison), the improved readability at the call sites makes such errors less likely than with the previous open-coded arithmetic.
As part of this change, sysctl_max_map_count is made static to mm/mmap.c to improve encapsulation.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
[...]
/*
@@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping( int sysctl_legacy_va_layout; #endif
+static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+/**
- vma_count_remaining - Determine available VMA slots
- @mm: The memory descriptor for the process.
- Check how many more VMAs can be created for the given @mm
- before hitting the sysctl_max_map_count limit.
- Return: The number of new VMAs the process can accommodate.
- */
+int vma_count_remaining(const struct mm_struct *mm) +{
const int map_count = mm->map_count;
const int max_count = sysctl_max_map_count;
If we worry about rare races (sysctl_max_map_count changing?) we should probably force a single read through READ_ONCE()?
Otherwise one might trick vma_count_remaining() into returning a negative number I assume.
Agreed, I'll add the READ_ONCE when resending.
Thanks, Kalesh
return (max_count > map_count) ? (max_count - map_count) : 0;
+}
Nothing else jumped at me.
Not sure what the buildbot complains about but I'm sure you'll figure it out :)
-- Cheers
David / dhildenb
On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
include/linux/mm.h | 2 -- mm/internal.h | 2 ++ mm/mmap.c | 21 ++++++++++++++++++++- mm/mremap.c | 7 ++++--- mm/nommu.c | 2 +- mm/util.c | 1 - mm/vma.c | 10 +++++----- tools/testing/vma/vma_internal.h | 9 +++++++++
Will look into this properly later, but there's a conflict in vma_internal.h atm for mm-new FYI.
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 3639aa8dd2b0..52cd7ddc73f4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi return vm_flags; }
+/* Helper to get VMA count capacity */ +static int vma_count_remaining(const struct mm_struct *mm) +{
- const int map_count = mm->map_count;
- const int max_count = sysctl_max_map_count;
- return (max_count > map_count) ? (max_count - map_count) : 0;
+}
#endif /* __MM_VMA_INTERNAL_H */
2.51.0.384.g4c02a37b29-goog
Probably because stuff got added at the end... :)
On Thu, Sep 18, 2025 at 02:20:03PM +0100, Lorenzo Stoakes wrote:
On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
include/linux/mm.h | 2 -- mm/internal.h | 2 ++ mm/mmap.c | 21 ++++++++++++++++++++- mm/mremap.c | 7 ++++--- mm/nommu.c | 2 +- mm/util.c | 1 - mm/vma.c | 10 +++++----- tools/testing/vma/vma_internal.h | 9 +++++++++
Will look into this properly later, but there's a conflict in vma_internal.h atm for mm-new FYI.
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 3639aa8dd2b0..52cd7ddc73f4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi return vm_flags; }
+/* Helper to get VMA count capacity */ +static int vma_count_remaining(const struct mm_struct *mm) +{
- const int map_count = mm->map_count;
- const int max_count = sysctl_max_map_count;
- return (max_count > map_count) ? (max_count - map_count) : 0;
+}
#endif /* __MM_VMA_INTERNAL_H */
2.51.0.384.g4c02a37b29-goog
Probably because stuff got added at the end... :)
Oh mm/internal.h too, seems that's my mmap_prepare series. FYI
On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
The checks against sysctl_max_map_count are open-coded in multiple places. While simple checks are manageable, the logic in places like mremap.c involves arithmetic with magic numbers that can be difficult to reason about. e.g. ... >= sysctl_max_map_count - 3
To improve readability and centralize the logic, introduce a new helper, vma_count_remaining(). This function returns the VMA count headroom available for a givine process.
The most common case of checking for a single new VMA can be done with the convenience helper has_vma_count_remaining():
if (!vma_count_remaining(mm))
And the complex checks in mremap.c become clearer by expressing the required capacity directly:
if (vma_count_remaining(mm) < 4)
Double space after 4.
While a capacity-based function could be misused (e.g., with an incorrect '<' vs '<=' comparison), the improved readability at the call sites makes such errors less likely than with the previous open-coded arithmetic.
As part of this change, sysctl_max_map_count is made static to mm/mmap.c to improve encapsulation.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Generally logic looks ok, with various stuff fixed below + in commit msg we should be good to go :)
Changes in v2:
- Fix documentation comment for vma_count_remaining(), per Mike
- Remove extern in header, per Mike and Pedro
- Move declaration to mm/internal.h, per Mike
- Replace exceeds_max_map_count() with capacity-based vma_count_remaining(), per Lorenzo.
- Fix tools/testing/vma, per Lorenzo.
include/linux/mm.h | 2 -- mm/internal.h | 2 ++ mm/mmap.c | 21 ++++++++++++++++++++- mm/mremap.c | 7 ++++--- mm/nommu.c | 2 +- mm/util.c | 1 - mm/vma.c | 10 +++++----- tools/testing/vma/vma_internal.h | 9 +++++++++ 8 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1ae97a0b8ec7..138bab2988f8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page) #define MAPCOUNT_ELF_CORE_MARGIN (5) #define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-extern int sysctl_max_map_count;
Nice to get rid of this as a global! :)
extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes;
diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc03..39f1c9535ae5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm); int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
+int vma_count_remaining(const struct mm_struct *mm);
#endif /* __MM_INTERNAL_H */ diff --git a/mm/mmap.c b/mm/mmap.c index e5370e7fcd8f..af88ce1fbb5f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */
- if (mm->map_count >= sysctl_max_map_count)
if (!vma_count_remaining(mm)) return -ENOMEM;
/*
@@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping( int sysctl_legacy_va_layout; #endif
+static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+/**
- vma_count_remaining - Determine available VMA slots
- @mm: The memory descriptor for the process.
- Check how many more VMAs can be created for the given @mm
- before hitting the sysctl_max_map_count limit.
- Return: The number of new VMAs the process can accommodate.
- */
+int vma_count_remaining(const struct mm_struct *mm) +{
- const int map_count = mm->map_count;
- const int max_count = sysctl_max_map_count;
David already commented on the READ_ONCE() here, seems wise.
- return (max_count > map_count) ? (max_count - map_count) : 0;
Not a big deal but would prefer:
if (map_count >= map_count) return 0;
return max_count - map_count;
As the ternary here is a bit less clear, and it puts the 'failure' case first and the 'success' case afterwards.
+}
As discussed in reply to the kernel bot, you've accidentally placed this in a CONFIG_SYSCTL block, so need to move it :>)
static const struct ctl_table mmap_table[] = { { .procname = "max_map_count", diff --git a/mm/mremap.c b/mm/mremap.c index 35de0a7b910e..14d35d87e89b 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) * We'd prefer to avoid failure later on in do_munmap: * which may split one vma into three before unmapping. */
- if (current->mm->map_count >= sysctl_max_map_count - 3)
- if (vma_count_remaining(current->mm) < 4) return -ENOMEM;
This is much clearer.
if (vma->vm_ops && vma->vm_ops->may_split) { @@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) * split in 3 before unmapping it. * That means 2 more maps (1 for each) to the ones we already hold. * Check whether current map count plus 2 still leads us to 4 maps below
* the threshold, otherwise return -ENOMEM here to be more safe.
* the threshold. In other words, is the current map count + 6 at or
*/* below the threshold? Otherwise return -ENOMEM here to be more safe.
- if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
- if (vma_count_remaining(current->mm) < 6) return -ENOMEM;
I hate that we do this silly check here, but the time to revisit it is another series...
return 0; diff --git a/mm/nommu.c b/mm/nommu.c index 8b819fafd57b..dd75f2334812 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, return -ENOMEM;
mm = vma->vm_mm;
- if (mm->map_count >= sysctl_max_map_count)
if (!vma_count_remaining(mm)) return -ENOMEM;
region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
diff --git a/mm/util.c b/mm/util.c index f814e6a59ab1..b6e83922cafe 100644 --- a/mm/util.c +++ b/mm/util.c @@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy); int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; static int sysctl_overcommit_ratio __read_mostly = 50; static unsigned long sysctl_overcommit_kbytes __read_mostly; -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */ unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
diff --git a/mm/vma.c b/mm/vma.c index 033a388bc4b1..df0e8409f63d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, }
/*
- __split_vma() bypasses sysctl_max_map_count checking. We use this where it
- has already been checked or doesn't make sense to fail.
- __split_vma() bypasses vma_count_remaining() checks. We use this where
*/
- it has already been checked or doesn't make sense to fail.
- VMA Iterator will point to the original VMA.
static __must_check int @@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) {
- if (vma->vm_mm->map_count >= sysctl_max_map_count)
if (!vma_count_remaining(vma->vm_mm)) return -ENOMEM;
return __split_vma(vmi, vma, addr, new_below);
@@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end &&
vms->vma->vm_mm->map_count >= sysctl_max_map_count) {
}!vma_count_remaining(vms->vma->vm_mm)) { error = -ENOMEM; goto map_count_exceeded;
@@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (mm->map_count >= sysctl_max_map_count)
if (!vma_count_remaining(mm)) return -ENOMEM;
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 3639aa8dd2b0..52cd7ddc73f4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi return vm_flags; }
+/* Helper to get VMA count capacity */ +static int vma_count_remaining(const struct mm_struct *mm) +{
- const int map_count = mm->map_count;
- const int max_count = sysctl_max_map_count;
- return (max_count > map_count) ? (max_count - map_count) : 0;
+}
#endif /* __MM_VMA_INTERNAL_H */
2.51.0.384.g4c02a37b29-goog
On Thu, Sep 18, 2025 at 7:32 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
The checks against sysctl_max_map_count are open-coded in multiple places. While simple checks are manageable, the logic in places like mremap.c involves arithmetic with magic numbers that can be difficult to reason about. e.g. ... >= sysctl_max_map_count - 3
To improve readability and centralize the logic, introduce a new helper, vma_count_remaining(). This function returns the VMA count headroom available for a givine process.
The most common case of checking for a single new VMA can be done with the convenience helper has_vma_count_remaining():
if (!vma_count_remaining(mm))
And the complex checks in mremap.c become clearer by expressing the required capacity directly:
if (vma_count_remaining(mm) < 4)
Double space after 4.
Will fix in the resend.
While a capacity-based function could be misused (e.g., with an incorrect '<' vs '<=' comparison), the improved readability at the call sites makes such errors less likely than with the previous open-coded arithmetic.
As part of this change, sysctl_max_map_count is made static to mm/mmap.c to improve encapsulation.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Generally logic looks ok, with various stuff fixed below + in commit msg we should be good to go :)
Changes in v2:
- Fix documentation comment for vma_count_remaining(), per Mike
- Remove extern in header, per Mike and Pedro
- Move declaration to mm/internal.h, per Mike
- Replace exceeds_max_map_count() with capacity-based vma_count_remaining(), per Lorenzo.
- Fix tools/testing/vma, per Lorenzo.
include/linux/mm.h | 2 -- mm/internal.h | 2 ++ mm/mmap.c | 21 ++++++++++++++++++++- mm/mremap.c | 7 ++++--- mm/nommu.c | 2 +- mm/util.c | 1 - mm/vma.c | 10 +++++----- tools/testing/vma/vma_internal.h | 9 +++++++++ 8 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1ae97a0b8ec7..138bab2988f8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page) #define MAPCOUNT_ELF_CORE_MARGIN (5) #define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-extern int sysctl_max_map_count;
Nice to get rid of this as a global! :)
extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes;
diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc03..39f1c9535ae5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm); int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
+int vma_count_remaining(const struct mm_struct *mm);
#endif /* __MM_INTERNAL_H */ diff --git a/mm/mmap.c b/mm/mmap.c index e5370e7fcd8f..af88ce1fbb5f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */
if (mm->map_count >= sysctl_max_map_count)
if (!vma_count_remaining(mm)) return -ENOMEM; /*
@@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping( int sysctl_legacy_va_layout; #endif
+static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+/**
- vma_count_remaining - Determine available VMA slots
- @mm: The memory descriptor for the process.
- Check how many more VMAs can be created for the given @mm
- before hitting the sysctl_max_map_count limit.
- Return: The number of new VMAs the process can accommodate.
- */
+int vma_count_remaining(const struct mm_struct *mm) +{
const int map_count = mm->map_count;
const int max_count = sysctl_max_map_count;
David already commented on the READ_ONCE() here, seems wise.
return (max_count > map_count) ? (max_count - map_count) : 0;
Not a big deal but would prefer:
if (map_count >= map_count) return 0; return max_count - map_count;
As the ternary here is a bit less clear, and it puts the 'failure' case first and the 'success' case afterwards.
+}
As discussed in reply to the kernel bot, you've accidentally placed this in a CONFIG_SYSCTL block, so need to move it :>)
Yeah I unintentionally placed it there, I've moved this out in a newer version. Thanks.
static const struct ctl_table mmap_table[] = { { .procname = "max_map_count", diff --git a/mm/mremap.c b/mm/mremap.c index 35de0a7b910e..14d35d87e89b 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) * We'd prefer to avoid failure later on in do_munmap: * which may split one vma into three before unmapping. */
if (current->mm->map_count >= sysctl_max_map_count - 3)
if (vma_count_remaining(current->mm) < 4) return -ENOMEM;
This is much clearer.
if (vma->vm_ops && vma->vm_ops->may_split) {
@@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) * split in 3 before unmapping it. * That means 2 more maps (1 for each) to the ones we already hold. * Check whether current map count plus 2 still leads us to 4 maps below
* the threshold, otherwise return -ENOMEM here to be more safe.
* the threshold. In other words, is the current map count + 6 at or
* below the threshold? Otherwise return -ENOMEM here to be more safe. */
if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
if (vma_count_remaining(current->mm) < 6) return -ENOMEM;
I hate that we do this silly check here, but the time to revisit it is another series...
Agreed, I also wondered if this check is more conservative than it needs to be in all cases.
-- Kalesh
return 0;
diff --git a/mm/nommu.c b/mm/nommu.c index 8b819fafd57b..dd75f2334812 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, return -ENOMEM;
mm = vma->vm_mm;
if (mm->map_count >= sysctl_max_map_count)
if (!vma_count_remaining(mm)) return -ENOMEM; region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
diff --git a/mm/util.c b/mm/util.c index f814e6a59ab1..b6e83922cafe 100644 --- a/mm/util.c +++ b/mm/util.c @@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy); int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; static int sysctl_overcommit_ratio __read_mostly = 50; static unsigned long sysctl_overcommit_kbytes __read_mostly; -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */ unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
diff --git a/mm/vma.c b/mm/vma.c index 033a388bc4b1..df0e8409f63d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, }
/*
- __split_vma() bypasses sysctl_max_map_count checking. We use this where it
- has already been checked or doesn't make sense to fail.
- __split_vma() bypasses vma_count_remaining() checks. We use this where
*/
- it has already been checked or doesn't make sense to fail.
- VMA Iterator will point to the original VMA.
static __must_check int @@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) {
if (vma->vm_mm->map_count >= sysctl_max_map_count)
if (!vma_count_remaining(vma->vm_mm)) return -ENOMEM; return __split_vma(vmi, vma, addr, new_below);
@@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end &&
vms->vma->vm_mm->map_count >= sysctl_max_map_count) {
!vma_count_remaining(vms->vma->vm_mm)) { error = -ENOMEM; goto map_count_exceeded; }
@@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
if (mm->map_count >= sysctl_max_map_count)
if (!vma_count_remaining(mm)) return -ENOMEM; if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 3639aa8dd2b0..52cd7ddc73f4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi return vm_flags; }
+/* Helper to get VMA count capacity */ +static int vma_count_remaining(const struct mm_struct *mm) +{
const int map_count = mm->map_count;
const int max_count = sysctl_max_map_count;
return (max_count > map_count) ? (max_count - map_count) : 0;
+}
#endif /* __MM_VMA_INTERNAL_H */
2.51.0.384.g4c02a37b29-goog
A mechanical rename of the mm_struct->map_count field to vma_count; no functional change is intended.
The name "map_count" is ambiguous within the memory management subsystem, as it can be confused with the folio/page->_mapcount field, which tracks PTE references.
The new name, vma_count, is more precise as this field has always counted the number of vm_area_structs associated with an mm_struct.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - map_count is easily confused with _mapcount rename to vma_count, per David
fs/binfmt_elf.c | 2 +- fs/coredump.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 2 +- mm/mmap.c | 6 +++--- mm/nommu.c | 6 +++--- mm/vma.c | 24 ++++++++++++------------ tools/testing/vma/vma.c | 32 ++++++++++++++++---------------- tools/testing/vma/vma_internal.h | 6 +++--- 10 files changed, 42 insertions(+), 42 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 264fba0d44bd..52449dec12cb 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1643,7 +1643,7 @@ static int fill_files_note(struct memelfnote *note, struct coredump_params *cprm data[0] = count; data[1] = PAGE_SIZE; /* - * Count usually is less than mm->map_count, + * Count usually is less than mm->vma_count, * we need to move filenames down. */ n = cprm->vma_count - count; diff --git a/fs/coredump.c b/fs/coredump.c index 60bc9685e149..8881459c53d9 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -1731,7 +1731,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
cprm->vma_data_size = 0; gate_vma = get_gate_vma(mm); - cprm->vma_count = mm->map_count + (gate_vma ? 1 : 0); + cprm->vma_count = mm->vma_count + (gate_vma ? 1 : 0);
cprm->vma_meta = kvmalloc_array(cprm->vma_count, sizeof(*cprm->vma_meta), GFP_KERNEL); if (!cprm->vma_meta) { diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 08bc2442db93..4343be2f9e85 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1020,7 +1020,7 @@ struct mm_struct { #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif - int map_count; /* number of VMAs */ + int vma_count; /* number of VMAs */
spinlock_t page_table_lock; /* Protects page tables and some * counters diff --git a/kernel/fork.c b/kernel/fork.c index c4ada32598bd..8fcbbf947579 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm_pgtables_bytes_init(mm); - mm->map_count = 0; + mm->vma_count = 0; mm->locked_vm = 0; atomic64_set(&mm->pinned_vm, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); diff --git a/mm/debug.c b/mm/debug.c index b4388f4dcd4d..40fc9425a84a 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -204,7 +204,7 @@ void dump_mm(const struct mm_struct *mm) mm->pgd, atomic_read(&mm->mm_users), atomic_read(&mm->mm_count), mm_pgtables_bytes(mm), - mm->map_count, + mm->vma_count, mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, (u64)atomic64_read(&mm->pinned_vm), mm->data_vm, mm->exec_vm, mm->stack_vm, diff --git a/mm/mmap.c b/mm/mmap.c index af88ce1fbb5f..c6769394a174 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm) vma = vma_next(&vmi); } while (vma && likely(!xa_is_zero(vma)));
- BUG_ON(count != mm->map_count); + BUG_ON(count != mm->vma_count);
trace_exit_mmap(mm); destroy: @@ -1517,7 +1517,7 @@ static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; */ int vma_count_remaining(const struct mm_struct *mm) { - const int map_count = mm->map_count; + const int map_count = mm->vma_count; const int max_count = sysctl_max_map_count;
return (max_count > map_count) ? (max_count - map_count) : 0; @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) */ vma_iter_bulk_store(&vmi, tmp);
- mm->map_count++; + mm->vma_count++;
if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff --git a/mm/nommu.c b/mm/nommu.c index dd75f2334812..9ab2e5ca736d 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -576,7 +576,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
static void cleanup_vma_from_mm(struct vm_area_struct *vma) { - vma->vm_mm->map_count--; + vma->vm_mm->vma_count--; /* remove the VMA from the mapping */ if (vma->vm_file) { struct address_space *mapping; @@ -1198,7 +1198,7 @@ unsigned long do_mmap(struct file *file, goto error_just_free;
setup_vma_to_mm(vma, current->mm); - current->mm->map_count++; + current->mm->vma_count++; /* add the VMA to the tree */ vma_iter_store_new(&vmi, vma);
@@ -1366,7 +1366,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, setup_vma_to_mm(vma, mm); setup_vma_to_mm(new, mm); vma_iter_store_new(vmi, new); - mm->map_count++; + mm->vma_count++; return 0;
err_vmi_preallocate: diff --git a/mm/vma.c b/mm/vma.c index df0e8409f63d..64f4e7c867c3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert); - mm->map_count++; + mm->vma_count++; }
if (vp->anon_vma) { @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, } if (vp->remove->anon_vma) anon_vma_merge(vp->vma, vp->remove); - mm->map_count--; + mm->vma_count--; mpol_put(vma_policy(vp->remove)); if (!vp->remove2) WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end); @@ -683,13 +683,13 @@ void validate_mm(struct mm_struct *mm) } #endif /* Check for a infinite loop */ - if (++i > mm->map_count + 10) { + if (++i > mm->vma_count + 10) { i = -1; break; } } - if (i != mm->map_count) { - pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i); + if (i != mm->vma_count) { + pr_emerg("vma_count %d vma iterator %d\n", mm->vma_count, i); bug = 1; } VM_BUG_ON_MM(bug, mm); @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct mm_struct *mm;
mm = current->mm; - mm->map_count -= vms->vma_count; + mm->vma_count -= vms->vma_count; mm->locked_vm -= vms->locked_vm; if (vms->unlock) mmap_write_downgrade(mm); @@ -1340,14 +1340,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, if (vms->start > vms->vma->vm_start) {
/* - * Make sure that map_count on return from munmap() will + * Make sure that vma_count on return from munmap() will * not exceed its limit; but let map_count go just above * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end && !vma_count_remaining(vms->vma->vm_mm)) { error = -ENOMEM; - goto map_count_exceeded; + goto vma_count_exceeded; }
/* Don't bother splitting the VMA if we can't unmap it anyway */ @@ -1461,7 +1461,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, modify_vma_failed: reattach_vmas(mas_detach); start_split_failed: -map_count_exceeded: +vma_count_exceeded: return error; }
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) vma_start_write(vma); vma_iter_store_new(&vmi, vma); vma_link_file(vma); - mm->map_count++; + mm->vma_count++; validate_mm(mm); return 0; } @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); vma_iter_store_new(vmi, vma); - map->mm->map_count++; + map->mm->vma_count++; vma_link_file(vma);
/* @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail;
- mm->map_count++; + mm->vma_count++; validate_mm(mm); out: perf_event_mmap(vma); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..69fa7d14a6c2 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi) }
mtree_destroy(&mm->mm_mt); - mm->map_count = 0; + mm->vma_count = 0; return count; }
@@ -500,7 +500,7 @@ static bool test_merge_new(void) INIT_LIST_HEAD(&vma_d->anon_vma_chain); list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain); ASSERT_FALSE(merged); - ASSERT_EQ(mm.map_count, 4); + ASSERT_EQ(mm.vma_count, 4);
/* * Merge BOTH sides. @@ -519,7 +519,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 3); + ASSERT_EQ(mm.vma_count, 3);
/* * Merge to PREVIOUS VMA. @@ -536,7 +536,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 3); + ASSERT_EQ(mm.vma_count, 3);
/* * Merge to NEXT VMA. @@ -555,7 +555,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 6); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 3); + ASSERT_EQ(mm.vma_count, 3);
/* * Merge BOTH sides. @@ -573,7 +573,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
/* * Merge to NEXT VMA. @@ -591,7 +591,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0xa); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
/* * Merge BOTH sides. @@ -608,7 +608,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
/* * Final state. @@ -967,7 +967,7 @@ static bool test_vma_merge_new_with_close(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->vm_ops, &vm_ops); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
cleanup_mm(&mm, &vmi); return true; @@ -1017,7 +1017,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma->vm_pgoff, 2); ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1045,7 +1045,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->vm_pgoff, 2); ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1079,7 +1079,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma->vm_pgoff, 6); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1108,7 +1108,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->vm_pgoff, 0); ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1138,7 +1138,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->vm_pgoff, 0); ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted prev and next. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1540,7 +1540,7 @@ static bool test_merge_extend(void) ASSERT_EQ(vma->vm_end, 0x4000); ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
cleanup_mm(&mm, &vmi); return true; @@ -1652,7 +1652,7 @@ static bool test_mmap_region_basic(void) 0x24d, NULL); ASSERT_EQ(addr, 0x24d000);
- ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
for_each_vma(vmi, vma) { if (vma->vm_start == 0x300000) { diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 52cd7ddc73f4..15525b86145d 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -251,7 +251,7 @@ struct mutex {};
struct mm_struct { struct maple_tree mm_mt; - int map_count; /* number of VMAs */ + int vma_count; /* number of VMAs */ unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ @@ -1520,10 +1520,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi /* Helper to get VMA count capacity */ static int vma_count_remaining(const struct mm_struct *mm) { - const int map_count = mm->map_count; + const int vma_count = mm->vma_count; const int max_count = sysctl_max_map_count;
- return (max_count > map_count) ? (max_count - map_count) : 0; + return (max_count > vma_count) ? (max_count - vma_count) : 0; }
#endif /* __MM_VMA_INTERNAL_H */
On 15.09.25 18:36, Kalesh Singh wrote:
A mechanical rename of the mm_struct->map_count field to vma_count; no functional change is intended.
The name "map_count" is ambiguous within the memory management subsystem, as it can be confused with the folio/page->_mapcount field, which tracks PTE references.
The new name, vma_count, is more precise as this field has always counted the number of vm_area_structs associated with an mm_struct.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
[...]
+++ b/mm/mmap.c @@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm) vma = vma_next(&vmi); } while (vma && likely(!xa_is_zero(vma)));
- BUG_ON(count != mm->map_count);
- BUG_ON(count != mm->vma_count);
While at it, best to change that to a WARN_ON_ONCE() or even a VM_WARN_ON_ONCE().
[ or remove it -- have we ever seen that firing? ]
Acked-by: David Hildenbrand david@redhat.com
On Mon, Sep 15, 2025 at 09:36:35AM -0700, Kalesh Singh wrote:
A mechanical rename of the mm_struct->map_count field to vma_count; no functional change is intended.
The name "map_count" is ambiguous within the memory management subsystem, as it can be confused with the folio/page->_mapcount field, which tracks PTE references.
The new name, vma_count, is more precise as this field has always counted the number of vm_area_structs associated with an mm_struct.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Reviewed-by: Pedro Falcato pfalcato@suse.de
On Mon, Sep 15, 2025 at 09:36:35AM -0700, Kalesh Singh wrote:
A mechanical rename of the mm_struct->map_count field to vma_count; no functional change is intended.
The name "map_count" is ambiguous within the memory management subsystem, as it can be confused with the folio/page->_mapcount field, which tracks PTE references.
The new name, vma_count, is more precise as this field has always counted the number of vm_area_structs associated with an mm_struct.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Yeah this is nice thanks, LGTM so:
Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
One small comment below, but I think that change can wait for another series probably.
Changes in v2:
- map_count is easily confused with _mapcount rename to vma_count, per David
fs/binfmt_elf.c | 2 +- fs/coredump.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 2 +- mm/mmap.c | 6 +++--- mm/nommu.c | 6 +++--- mm/vma.c | 24 ++++++++++++------------ tools/testing/vma/vma.c | 32 ++++++++++++++++---------------- tools/testing/vma/vma_internal.h | 6 +++--- 10 files changed, 42 insertions(+), 42 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 264fba0d44bd..52449dec12cb 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1643,7 +1643,7 @@ static int fill_files_note(struct memelfnote *note, struct coredump_params *cprm data[0] = count; data[1] = PAGE_SIZE; /*
* Count usually is less than mm->map_count,
* Count usually is less than mm->vma_count,
*/ n = cprm->vma_count - count;
- we need to move filenames down.
diff --git a/fs/coredump.c b/fs/coredump.c index 60bc9685e149..8881459c53d9 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -1731,7 +1731,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
cprm->vma_data_size = 0; gate_vma = get_gate_vma(mm);
- cprm->vma_count = mm->map_count + (gate_vma ? 1 : 0);
cprm->vma_count = mm->vma_count + (gate_vma ? 1 : 0);
cprm->vma_meta = kvmalloc_array(cprm->vma_count, sizeof(*cprm->vma_meta), GFP_KERNEL); if (!cprm->vma_meta) {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 08bc2442db93..4343be2f9e85 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1020,7 +1020,7 @@ struct mm_struct { #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif
int map_count; /* number of VMAs */
int vma_count; /* number of VMAs */
spinlock_t page_table_lock; /* Protects page tables and some * counters
diff --git a/kernel/fork.c b/kernel/fork.c index c4ada32598bd..8fcbbf947579 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm_pgtables_bytes_init(mm);
- mm->map_count = 0;
- mm->vma_count = 0; mm->locked_vm = 0; atomic64_set(&mm->pinned_vm, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
diff --git a/mm/debug.c b/mm/debug.c index b4388f4dcd4d..40fc9425a84a 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -204,7 +204,7 @@ void dump_mm(const struct mm_struct *mm) mm->pgd, atomic_read(&mm->mm_users), atomic_read(&mm->mm_count), mm_pgtables_bytes(mm),
mm->map_count,
mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, (u64)atomic64_read(&mm->pinned_vm), mm->data_vm, mm->exec_vm, mm->stack_vm,mm->vma_count,
diff --git a/mm/mmap.c b/mm/mmap.c index af88ce1fbb5f..c6769394a174 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm) vma = vma_next(&vmi); } while (vma && likely(!xa_is_zero(vma)));
- BUG_ON(count != mm->map_count);
- BUG_ON(count != mm->vma_count);
Be nice to switch this to a WARN_ON_ONCE()... think David mentioned it though.
But maybe not one for this series...
trace_exit_mmap(mm); destroy: @@ -1517,7 +1517,7 @@ static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; */ int vma_count_remaining(const struct mm_struct *mm) {
- const int map_count = mm->map_count;
const int map_count = mm->vma_count; const int max_count = sysctl_max_map_count;
return (max_count > map_count) ? (max_count - map_count) : 0;
@@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) */ vma_iter_bulk_store(&vmi, tmp);
mm->map_count++;
mm->vma_count++;
if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp);
diff --git a/mm/nommu.c b/mm/nommu.c index dd75f2334812..9ab2e5ca736d 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -576,7 +576,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
static void cleanup_vma_from_mm(struct vm_area_struct *vma) {
- vma->vm_mm->map_count--;
- vma->vm_mm->vma_count--; /* remove the VMA from the mapping */ if (vma->vm_file) { struct address_space *mapping;
@@ -1198,7 +1198,7 @@ unsigned long do_mmap(struct file *file, goto error_just_free;
setup_vma_to_mm(vma, current->mm);
- current->mm->map_count++;
- current->mm->vma_count++; /* add the VMA to the tree */ vma_iter_store_new(&vmi, vma);
@@ -1366,7 +1366,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, setup_vma_to_mm(vma, mm); setup_vma_to_mm(new, mm); vma_iter_store_new(vmi, new);
- mm->map_count++;
- mm->vma_count++; return 0;
err_vmi_preallocate: diff --git a/mm/vma.c b/mm/vma.c index df0e8409f63d..64f4e7c867c3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert);
mm->map_count++;
mm->vma_count++;
}
if (vp->anon_vma) {
@@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, } if (vp->remove->anon_vma) anon_vma_merge(vp->vma, vp->remove);
mm->map_count--;
mpol_put(vma_policy(vp->remove)); if (!vp->remove2) WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);mm->vma_count--;
@@ -683,13 +683,13 @@ void validate_mm(struct mm_struct *mm) } #endif /* Check for a infinite loop */
if (++i > mm->map_count + 10) {
} }if (++i > mm->vma_count + 10) { i = -1; break;
- if (i != mm->map_count) {
pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i);
- if (i != mm->vma_count) {
bug = 1; } VM_BUG_ON_MM(bug, mm);pr_emerg("vma_count %d vma iterator %d\n", mm->vma_count, i);
@@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct mm_struct *mm;
mm = current->mm;
- mm->map_count -= vms->vma_count;
- mm->vma_count -= vms->vma_count; mm->locked_vm -= vms->locked_vm; if (vms->unlock) mmap_write_downgrade(mm);
@@ -1340,14 +1340,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, if (vms->start > vms->vma->vm_start) {
/*
* Make sure that map_count on return from munmap() will
* Make sure that vma_count on return from munmap() will
*/ if (vms->end < vms->vma->vm_end && !vma_count_remaining(vms->vma->vm_mm)) { error = -ENOMEM;
- not exceed its limit; but let map_count go just above
- its limit temporarily, to help free resources as expected.
goto map_count_exceeded;
goto vma_count_exceeded;
}
/* Don't bother splitting the VMA if we can't unmap it anyway */
@@ -1461,7 +1461,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, modify_vma_failed: reattach_vmas(mas_detach); start_split_failed: -map_count_exceeded: +vma_count_exceeded: return error; }
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) vma_start_write(vma); vma_iter_store_new(&vmi, vma); vma_link_file(vma);
- mm->map_count++;
- mm->vma_count++; validate_mm(mm); return 0;
} @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); vma_iter_store_new(vmi, vma);
- map->mm->map_count++;
map->mm->vma_count++; vma_link_file(vma);
/*
@@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail;
- mm->map_count++;
- mm->vma_count++; validate_mm(mm);
out: perf_event_mmap(vma); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..69fa7d14a6c2 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi) }
mtree_destroy(&mm->mm_mt);
- mm->map_count = 0;
- mm->vma_count = 0; return count;
}
@@ -500,7 +500,7 @@ static bool test_merge_new(void) INIT_LIST_HEAD(&vma_d->anon_vma_chain); list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain); ASSERT_FALSE(merged);
- ASSERT_EQ(mm.map_count, 4);
ASSERT_EQ(mm.vma_count, 4);
/*
- Merge BOTH sides.
@@ -519,7 +519,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 3);
ASSERT_EQ(mm.vma_count, 3);
/*
- Merge to PREVIOUS VMA.
@@ -536,7 +536,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 3);
ASSERT_EQ(mm.vma_count, 3);
/*
- Merge to NEXT VMA.
@@ -555,7 +555,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 6); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 3);
ASSERT_EQ(mm.vma_count, 3);
/*
- Merge BOTH sides.
@@ -573,7 +573,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 2);
ASSERT_EQ(mm.vma_count, 2);
/*
- Merge to NEXT VMA.
@@ -591,7 +591,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0xa); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 2);
ASSERT_EQ(mm.vma_count, 2);
/*
- Merge BOTH sides.
@@ -608,7 +608,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 1);
ASSERT_EQ(mm.vma_count, 1);
/*
- Final state.
@@ -967,7 +967,7 @@ static bool test_vma_merge_new_with_close(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->vm_ops, &vm_ops); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 2);
ASSERT_EQ(mm.vma_count, 2);
cleanup_mm(&mm, &vmi); return true;
@@ -1017,7 +1017,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma->vm_pgoff, 2); ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next));
- ASSERT_EQ(mm.map_count, 2);
ASSERT_EQ(mm.vma_count, 2);
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1045,7 +1045,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->vm_pgoff, 2); ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next));
- ASSERT_EQ(mm.map_count, 1);
ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1079,7 +1079,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma->vm_pgoff, 6); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 2);
ASSERT_EQ(mm.vma_count, 2);
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1108,7 +1108,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->vm_pgoff, 0); ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev));
- ASSERT_EQ(mm.map_count, 1);
ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1138,7 +1138,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->vm_pgoff, 0); ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev));
- ASSERT_EQ(mm.map_count, 1);
ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted prev and next. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1540,7 +1540,7 @@ static bool test_merge_extend(void) ASSERT_EQ(vma->vm_end, 0x4000); ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 1);
ASSERT_EQ(mm.vma_count, 1);
cleanup_mm(&mm, &vmi); return true;
@@ -1652,7 +1652,7 @@ static bool test_mmap_region_basic(void) 0x24d, NULL); ASSERT_EQ(addr, 0x24d000);
- ASSERT_EQ(mm.map_count, 2);
ASSERT_EQ(mm.vma_count, 2);
for_each_vma(vmi, vma) { if (vma->vm_start == 0x300000) {
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 52cd7ddc73f4..15525b86145d 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -251,7 +251,7 @@ struct mutex {};
struct mm_struct { struct maple_tree mm_mt;
- int map_count; /* number of VMAs */
- int vma_count; /* number of VMAs */ unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
@@ -1520,10 +1520,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi /* Helper to get VMA count capacity */ static int vma_count_remaining(const struct mm_struct *mm) {
- const int map_count = mm->map_count;
- const int vma_count = mm->vma_count; const int max_count = sysctl_max_map_count;
- return (max_count > map_count) ? (max_count - map_count) : 0;
- return (max_count > vma_count) ? (max_count - vma_count) : 0;
}
#endif /* __MM_VMA_INTERNAL_H */
2.51.0.384.g4c02a37b29-goog
To make VMA counting more robust, prevent direct modification of the mm->vma_count field. This is achieved by making the public-facing member const via a union and requiring all modifications to go through a new set of helper functions the operate on a private __vma_count.
While there are no other invariants tied to vma_count currently, this structural change improves maintainability; as it creates a single, centralized point for any future logic, such as adding debug checks or updating related statistics (in subsequent patches).
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com --- include/linux/mm.h | 25 +++++++++++++++++++++++++ include/linux/mm_types.h | 5 ++++- kernel/fork.c | 2 +- mm/mmap.c | 2 +- mm/vma.c | 12 ++++++------ tools/testing/vma/vma.c | 2 +- tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++- 7 files changed, 67 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 138bab2988f8..8bad1454984c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+static inline void vma_count_init(struct mm_struct *mm) +{ + ACCESS_PRIVATE(mm, __vma_count) = 0; +} + +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{ + ACCESS_PRIVATE(mm, __vma_count) += nr_vmas; +} + +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) +{ + vma_count_add(mm, -nr_vmas); +} + +static inline void vma_count_inc(struct mm_struct *mm) +{ + vma_count_add(mm, 1); +} + +static inline void vma_count_dec(struct mm_struct *mm) +{ + vma_count_sub(mm, 1); +} + #endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4343be2f9e85..2ea8fc722aa2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1020,7 +1020,10 @@ struct mm_struct { #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif - int vma_count; /* number of VMAs */ + union { + const int vma_count; /* number of VMAs */ + int __private __vma_count; + };
spinlock_t page_table_lock; /* Protects page tables and some * counters diff --git a/kernel/fork.c b/kernel/fork.c index 8fcbbf947579..ea9eff416e51 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm_pgtables_bytes_init(mm); - mm->vma_count = 0; + vma_count_init(mm); mm->locked_vm = 0; atomic64_set(&mm->pinned_vm, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); diff --git a/mm/mmap.c b/mm/mmap.c index c6769394a174..30ddd550197e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) */ vma_iter_bulk_store(&vmi, tmp);
- mm->vma_count++; + vma_count_inc(mm);
if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff --git a/mm/vma.c b/mm/vma.c index 64f4e7c867c3..0cd3cb472220 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert); - mm->vma_count++; + vma_count_inc(mm); }
if (vp->anon_vma) { @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, } if (vp->remove->anon_vma) anon_vma_merge(vp->vma, vp->remove); - mm->vma_count--; + vma_count_dec(mm); mpol_put(vma_policy(vp->remove)); if (!vp->remove2) WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end); @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct mm_struct *mm;
mm = current->mm; - mm->vma_count -= vms->vma_count; + vma_count_sub(mm, vms->vma_count); mm->locked_vm -= vms->locked_vm; if (vms->unlock) mmap_write_downgrade(mm); @@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) vma_start_write(vma); vma_iter_store_new(&vmi, vma); vma_link_file(vma); - mm->vma_count++; + vma_count_inc(mm); validate_mm(mm); return 0; } @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); vma_iter_store_new(vmi, vma); - map->mm->vma_count++; + vma_count_inc(map->mm); vma_link_file(vma);
/* @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail;
- mm->vma_count++; + vma_count_inc(mm); validate_mm(mm); out: perf_event_mmap(vma); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 69fa7d14a6c2..ee5a1e2365e0 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi) }
mtree_destroy(&mm->mm_mt); - mm->vma_count = 0; + vma_count_init(mm); return count; }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 15525b86145d..6e724ba1adf4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -251,7 +251,10 @@ struct mutex {};
struct mm_struct { struct maple_tree mm_mt; - int vma_count; /* number of VMAs */ + union { + const int vma_count; /* number of VMAs */ + int __vma_count; + }; unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ @@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm) return (max_count > vma_count) ? (max_count - vma_count) : 0; }
+static inline void vma_count_init(struct mm_struct *mm) +{ + mm->__vma_count = 0; +} + +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{ + mm->__vma_count += nr_vmas; +} + +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) +{ + vma_count_add(mm, -nr_vmas); +} + +static inline void vma_count_inc(struct mm_struct *mm) +{ + vma_count_add(mm, 1); +} + +static inline void vma_count_dec(struct mm_struct *mm) +{ + vma_count_sub(mm, 1); +} + #endif /* __MM_VMA_INTERNAL_H */
On Mon, Sep 15, 2025 at 09:36:36AM -0700, Kalesh Singh wrote:
To make VMA counting more robust, prevent direct modification of the mm->vma_count field. This is achieved by making the public-facing member const via a union and requiring all modifications to go through a new set of helper functions the operate on a private __vma_count.
While there are no other invariants tied to vma_count currently, this structural change improves maintainability; as it creates a single, centralized point for any future logic, such as adding debug checks or updating related statistics (in subsequent patches).
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Hmmm I"m not sure about this one.
I think this is a 'we don't need it' situation, and it's making everything a bit uglier.
I especially hate vma_count_add() and vma_count_sub(). You're essentially overridding the whole concept in these cases to make stuff that's already in place work in those cases
I don't think this really adds much honestly.
(You're also clearly missing cases as the kernel bot has found issues)
include/linux/mm.h | 25 +++++++++++++++++++++++++ include/linux/mm_types.h | 5 ++++- kernel/fork.c | 2 +- mm/mmap.c | 2 +- mm/vma.c | 12 ++++++------ tools/testing/vma/vma.c | 2 +- tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++- 7 files changed, 67 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 138bab2988f8..8bad1454984c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+static inline void vma_count_init(struct mm_struct *mm) +{
- ACCESS_PRIVATE(mm, __vma_count) = 0;
+}
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{
- ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
+}
+static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) +{
- vma_count_add(mm, -nr_vmas);
+}
+static inline void vma_count_inc(struct mm_struct *mm) +{
- vma_count_add(mm, 1);
+}
+static inline void vma_count_dec(struct mm_struct *mm) +{
- vma_count_sub(mm, 1);
+}
#endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4343be2f9e85..2ea8fc722aa2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1020,7 +1020,10 @@ struct mm_struct { #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif
int vma_count; /* number of VMAs */
union {
const int vma_count; /* number of VMAs */
int __private __vma_count;
};
spinlock_t page_table_lock; /* Protects page tables and some * counters
diff --git a/kernel/fork.c b/kernel/fork.c index 8fcbbf947579..ea9eff416e51 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm_pgtables_bytes_init(mm);
- mm->vma_count = 0;
- vma_count_init(mm); mm->locked_vm = 0; atomic64_set(&mm->pinned_vm, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
diff --git a/mm/mmap.c b/mm/mmap.c index c6769394a174..30ddd550197e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) */ vma_iter_bulk_store(&vmi, tmp);
mm->vma_count++;
vma_count_inc(mm);
if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp);
diff --git a/mm/vma.c b/mm/vma.c index 64f4e7c867c3..0cd3cb472220 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert);
mm->vma_count++;
vma_count_inc(mm);
}
if (vp->anon_vma) {
@@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, } if (vp->remove->anon_vma) anon_vma_merge(vp->vma, vp->remove);
mm->vma_count--;
mpol_put(vma_policy(vp->remove)); if (!vp->remove2) WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);vma_count_dec(mm);
@@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct mm_struct *mm;
mm = current->mm;
- mm->vma_count -= vms->vma_count;
- vma_count_sub(mm, vms->vma_count); mm->locked_vm -= vms->locked_vm; if (vms->unlock) mmap_write_downgrade(mm);
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) vma_start_write(vma); vma_iter_store_new(&vmi, vma); vma_link_file(vma);
- mm->vma_count++;
- vma_count_inc(mm); validate_mm(mm); return 0;
} @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); vma_iter_store_new(vmi, vma);
- map->mm->vma_count++;
vma_count_inc(map->mm); vma_link_file(vma);
/*
@@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail;
- mm->vma_count++;
- vma_count_inc(mm); validate_mm(mm);
out: perf_event_mmap(vma); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 69fa7d14a6c2..ee5a1e2365e0 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi) }
mtree_destroy(&mm->mm_mt);
- mm->vma_count = 0;
- vma_count_init(mm); return count;
}
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 15525b86145d..6e724ba1adf4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -251,7 +251,10 @@ struct mutex {};
struct mm_struct { struct maple_tree mm_mt;
- int vma_count; /* number of VMAs */
- union {
const int vma_count; /* number of VMAs */
int __vma_count;
- }; unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
@@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm) return (max_count > vma_count) ? (max_count - vma_count) : 0; }
+static inline void vma_count_init(struct mm_struct *mm) +{
- mm->__vma_count = 0;
+}
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{
- mm->__vma_count += nr_vmas;
+}
+static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) +{
- vma_count_add(mm, -nr_vmas);
+}
+static inline void vma_count_inc(struct mm_struct *mm) +{
- vma_count_add(mm, 1);
+}
+static inline void vma_count_dec(struct mm_struct *mm) +{
- vma_count_sub(mm, 1);
+}
#endif /* __MM_VMA_INTERNAL_H */
2.51.0.384.g4c02a37b29-goog
On Thu, Sep 18, 2025 at 7:52 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Sep 15, 2025 at 09:36:36AM -0700, Kalesh Singh wrote:
To make VMA counting more robust, prevent direct modification of the mm->vma_count field. This is achieved by making the public-facing member const via a union and requiring all modifications to go through a new set of helper functions the operate on a private __vma_count.
While there are no other invariants tied to vma_count currently, this structural change improves maintainability; as it creates a single, centralized point for any future logic, such as adding debug checks or updating related statistics (in subsequent patches).
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Hmmm I"m not sure about this one.
I think this is a 'we don't need it' situation, and it's making everything a bit uglier.
I especially hate vma_count_add() and vma_count_sub(). You're essentially overridding the whole concept in these cases to make stuff that's already in place work in those cases
I don't think this really adds much honestly.
Hi Lorenzo,
Thanks for reviewing.
This was done to facilitate adding the assertions. So I'll drop this along with that in the next version.
Thanks, Kalesh
(You're also clearly missing cases as the kernel bot has found issues)
include/linux/mm.h | 25 +++++++++++++++++++++++++ include/linux/mm_types.h | 5 ++++- kernel/fork.c | 2 +- mm/mmap.c | 2 +- mm/vma.c | 12 ++++++------ tools/testing/vma/vma.c | 2 +- tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++- 7 files changed, 67 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 138bab2988f8..8bad1454984c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+static inline void vma_count_init(struct mm_struct *mm) +{
ACCESS_PRIVATE(mm, __vma_count) = 0;
+}
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{
ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
+}
+static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) +{
vma_count_add(mm, -nr_vmas);
+}
+static inline void vma_count_inc(struct mm_struct *mm) +{
vma_count_add(mm, 1);
+}
+static inline void vma_count_dec(struct mm_struct *mm) +{
vma_count_sub(mm, 1);
+}
#endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4343be2f9e85..2ea8fc722aa2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1020,7 +1020,10 @@ struct mm_struct { #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif
int vma_count; /* number of VMAs */
union {
const int vma_count; /* number of VMAs */
int __private __vma_count;
}; spinlock_t page_table_lock; /* Protects page tables and some * counters
diff --git a/kernel/fork.c b/kernel/fork.c index 8fcbbf947579..ea9eff416e51 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm_pgtables_bytes_init(mm);
mm->vma_count = 0;
vma_count_init(mm); mm->locked_vm = 0; atomic64_set(&mm->pinned_vm, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
diff --git a/mm/mmap.c b/mm/mmap.c index c6769394a174..30ddd550197e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) */ vma_iter_bulk_store(&vmi, tmp);
mm->vma_count++;
vma_count_inc(mm); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp);
diff --git a/mm/vma.c b/mm/vma.c index 64f4e7c867c3..0cd3cb472220 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert);
mm->vma_count++;
vma_count_inc(mm); } if (vp->anon_vma) {
@@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, } if (vp->remove->anon_vma) anon_vma_merge(vp->vma, vp->remove);
mm->vma_count--;
vma_count_dec(mm); mpol_put(vma_policy(vp->remove)); if (!vp->remove2) WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
@@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct mm_struct *mm;
mm = current->mm;
mm->vma_count -= vms->vma_count;
vma_count_sub(mm, vms->vma_count); mm->locked_vm -= vms->locked_vm; if (vms->unlock) mmap_write_downgrade(mm);
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) vma_start_write(vma); vma_iter_store_new(&vmi, vma); vma_link_file(vma);
mm->vma_count++;
vma_count_inc(mm); validate_mm(mm); return 0;
} @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); vma_iter_store_new(vmi, vma);
map->mm->vma_count++;
vma_count_inc(map->mm); vma_link_file(vma); /*
@@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail;
mm->vma_count++;
vma_count_inc(mm); validate_mm(mm);
out: perf_event_mmap(vma); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 69fa7d14a6c2..ee5a1e2365e0 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi) }
mtree_destroy(&mm->mm_mt);
mm->vma_count = 0;
vma_count_init(mm); return count;
}
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 15525b86145d..6e724ba1adf4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -251,7 +251,10 @@ struct mutex {};
struct mm_struct { struct maple_tree mm_mt;
int vma_count; /* number of VMAs */
union {
const int vma_count; /* number of VMAs */
int __vma_count;
}; unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
@@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm) return (max_count > vma_count) ? (max_count - vma_count) : 0; }
+static inline void vma_count_init(struct mm_struct *mm) +{
mm->__vma_count = 0;
+}
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{
mm->__vma_count += nr_vmas;
+}
+static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) +{
vma_count_add(mm, -nr_vmas);
+}
+static inline void vma_count_inc(struct mm_struct *mm) +{
vma_count_add(mm, 1);
+}
+static inline void vma_count_dec(struct mm_struct *mm) +{
vma_count_sub(mm, 1);
+}
#endif /* __MM_VMA_INTERNAL_H */
2.51.0.384.g4c02a37b29-goog
Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect cases where the VMA count exceeds the sysctl_max_map_count limit.
This check will help catch future bugs or regressions where the VMAs are allocated exceeding the limit.
The warning is placed in the main vma_count_*() helpers, while the internal *_nocheck variants bypass it. _nocheck helpers are used to ensure that the assertion does not trigger a false positive in the legitimate case of a temporary VMA increase past the limit by a VMA split in munmap().
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - Add assertions if exceeding max_vma_count limit, per Pedro
include/linux/mm.h | 12 ++++++-- mm/internal.h | 1 - mm/vma.c | 49 +++++++++++++++++++++++++------- tools/testing/vma/vma_internal.h | 7 ++++- 4 files changed, 55 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bad1454984c..3a3749d7015c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+int vma_count_remaining(const struct mm_struct *mm); + static inline void vma_count_init(struct mm_struct *mm) { ACCESS_PRIVATE(mm, __vma_count) = 0; }
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas) { ACCESS_PRIVATE(mm, __vma_count) += nr_vmas; }
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{ + VM_WARN_ON_ONCE(!vma_count_remaining(mm)); + __vma_count_add_nocheck(mm, nr_vmas); +} + static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) { - vma_count_add(mm, -nr_vmas); + __vma_count_add_nocheck(mm, -nr_vmas); }
static inline void vma_count_inc(struct mm_struct *mm) diff --git a/mm/internal.h b/mm/internal.h index 39f1c9535ae5..e0567a3b64fa 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1661,6 +1661,5 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm); int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
-int vma_count_remaining(const struct mm_struct *mm);
#endif /* __MM_INTERNAL_H */ diff --git a/mm/vma.c b/mm/vma.c index 0cd3cb472220..0e4fcaebe209 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -323,15 +323,17 @@ static void vma_prepare(struct vma_prepare *vp) }
/* - * vma_complete- Helper function for handling the unlocking after altering VMAs, - * or for inserting a VMA. + * This is the internal, unsafe version of vma_complete(). Unlike its + * wrapper, this function bypasses runtime checks for VMA count limits by + * using the _nocheck vma_count* helpers. * - * @vp: The vma_prepare struct - * @vmi: The vma iterator - * @mm: The mm_struct + * Its use is restricted to __split_vma() where the VMA count can be + * temporarily higher than the sysctl_max_map_count limit. + * + * All other callers must use vma_complete(). */ -static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, - struct mm_struct *mm) +static void __vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, + struct mm_struct *mm) { if (vp->file) { if (vp->adj_next) @@ -352,7 +354,11 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert); - vma_count_inc(mm); + /* + * Explicitly allow vma_count to exceed the threshold to prevent, + * blocking munmap() freeing resources. + */ + __vma_count_add_nocheck(mm, 1); }
if (vp->anon_vma) { @@ -403,6 +409,26 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, uprobe_mmap(vp->insert); }
+/* + * vma_complete- Helper function for handling the unlocking after altering VMAs, + * or for inserting a VMA. + * + * @vp: The vma_prepare struct + * @vmi: The vma iterator + * @mm: The mm_struct + */ +static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, + struct mm_struct *mm) +{ + /* + * __vma_complete() explicitly foregoes checking the new + * vma_count against the sysctl_max_map_count limit, so + * do it here. + */ + VM_WARN_ON_ONCE(!vma_count_remaining(mm)); + __vma_complete(vp, vmi, mm); +} + /* * init_vma_prep() - Initializer wrapper for vma_prepare struct * @vp: The vma_prepare struct @@ -564,8 +590,11 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, vma->vm_end = addr; }
- /* vma_complete stores the new vma */ - vma_complete(&vp, vmi, vma->vm_mm); + /* + * __vma_complete stores the new vma without checking against the + * sysctl_max_map_count (vma_count) limit. + */ + __vma_complete(&vp, vmi, vma->vm_mm); validate_mm(vma->vm_mm);
/* Success. */ diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 6e724ba1adf4..d084b1eb2a5c 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1534,11 +1534,16 @@ static inline void vma_count_init(struct mm_struct *mm) mm->__vma_count = 0; }
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas) { mm->__vma_count += nr_vmas; }
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{ + __vma_count_add_nocheck(mm, nr_vmas); +} + static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) { vma_count_add(mm, -nr_vmas);
On 15.09.25 18:36, Kalesh Singh wrote:
Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect cases where the VMA count exceeds the sysctl_max_map_count limit.
This check will help catch future bugs or regressions where the VMAs are allocated exceeding the limit.
The warning is placed in the main vma_count_*() helpers, while the internal *_nocheck variants bypass it. _nocheck helpers are used to ensure that the assertion does not trigger a false positive in the legitimate case of a temporary VMA increase past the limit by a VMA split in munmap().
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Changes in v2:
- Add assertions if exceeding max_vma_count limit, per Pedro
include/linux/mm.h | 12 ++++++-- mm/internal.h | 1 - mm/vma.c | 49 +++++++++++++++++++++++++------- tools/testing/vma/vma_internal.h | 7 ++++- 4 files changed, 55 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bad1454984c..3a3749d7015c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps) void snapshot_page(struct page_snapshot *ps, const struct page *page); +int vma_count_remaining(const struct mm_struct *mm);
- static inline void vma_count_init(struct mm_struct *mm) { ACCESS_PRIVATE(mm, __vma_count) = 0; }
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas) { ACCESS_PRIVATE(mm, __vma_count) += nr_vmas; } +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{
- VM_WARN_ON_ONCE(!vma_count_remaining(mm));
Can't that fire when changing the max count from user space at just the wrong time?
I assume we'll have to tolerated that and might just want to drop this patch from the series.
On Wed, Sep 17, 2025 at 6:44 AM David Hildenbrand david@redhat.com wrote:
On 15.09.25 18:36, Kalesh Singh wrote:
Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect cases where the VMA count exceeds the sysctl_max_map_count limit.
This check will help catch future bugs or regressions where the VMAs are allocated exceeding the limit.
The warning is placed in the main vma_count_*() helpers, while the internal *_nocheck variants bypass it. _nocheck helpers are used to ensure that the assertion does not trigger a false positive in the legitimate case of a temporary VMA increase past the limit by a VMA split in munmap().
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Changes in v2:
- Add assertions if exceeding max_vma_count limit, per Pedro
include/linux/mm.h | 12 ++++++-- mm/internal.h | 1 - mm/vma.c | 49 +++++++++++++++++++++++++------- tools/testing/vma/vma_internal.h | 7 ++++- 4 files changed, 55 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bad1454984c..3a3749d7015c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+int vma_count_remaining(const struct mm_struct *mm);
- static inline void vma_count_init(struct mm_struct *mm) { ACCESS_PRIVATE(mm, __vma_count) = 0; }
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas) { ACCESS_PRIVATE(mm, __vma_count) += nr_vmas; }
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{
VM_WARN_ON_ONCE(!vma_count_remaining(mm));
Can't that fire when changing the max count from user space at just the wrong time?
You are right: technically it's possible if it was raised between the time of checking and when the new VMA is added.
I assume we'll have to tolerated that and might just want to drop this patch from the series.
It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?
Thanks, Kalesh
-- Cheers
David / dhildenb
Can't that fire when changing the max count from user space at just the wrong time?
You are right: technically it's possible if it was raised between the time of checking and when the new VMA is added.
I assume we'll have to tolerated that and might just want to drop this patch from the series.
It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?
Due to the racy nature I think any kinds of assertions would be wrong. Without any such races possible I would agree that the checks would be nice to have.
On Wed, Sep 17, 2025 at 11:34 AM David Hildenbrand david@redhat.com wrote:
Can't that fire when changing the max count from user space at just the wrong time?
You are right: technically it's possible if it was raised between the time of checking and when the new VMA is added.
I assume we'll have to tolerated that and might just want to drop this patch from the series.
It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?
Due to the racy nature I think any kinds of assertions would be wrong. Without any such races possible I would agree that the checks would be nice to have.
Alright I'll drop this in the next revision.
Thanks, Kalesh
-- Cheers
David / dhildenb
On Wed, Sep 17, 2025 at 03:44:27PM +0200, David Hildenbrand wrote:
On 15.09.25 18:36, Kalesh Singh wrote:
Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect cases where the VMA count exceeds the sysctl_max_map_count limit.
This check will help catch future bugs or regressions where the VMAs are allocated exceeding the limit.
The warning is placed in the main vma_count_*() helpers, while the internal *_nocheck variants bypass it. _nocheck helpers are used to ensure that the assertion does not trigger a false positive in the legitimate case of a temporary VMA increase past the limit by a VMA split in munmap().
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
Changes in v2:
- Add assertions if exceeding max_vma_count limit, per Pedro
include/linux/mm.h | 12 ++++++-- mm/internal.h | 1 - mm/vma.c | 49 +++++++++++++++++++++++++------- tools/testing/vma/vma_internal.h | 7 ++++- 4 files changed, 55 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bad1454984c..3a3749d7015c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps) void snapshot_page(struct page_snapshot *ps, const struct page *page); +int vma_count_remaining(const struct mm_struct *mm);
- static inline void vma_count_init(struct mm_struct *mm) { ACCESS_PRIVATE(mm, __vma_count) = 0; }
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas) { ACCESS_PRIVATE(mm, __vma_count) += nr_vmas; } +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{
- VM_WARN_ON_ONCE(!vma_count_remaining(mm));
Can't that fire when changing the max count from user space at just the wrong time?
I assume we'll have to tolerated that and might just want to drop this patch from the series.
Ah yes, of course, userspace can dynamically change it. Good catch. I guess we'll need to kill the assertion idea then.
I see that the review is generally to drop this one so it's moot, but this is also not applying:
Patch failed at 0006 mm: add assertion for VMA count limit hint: Use 'git am --show-current-patch=diff' to see the failed patch hint: When you have resolved this problem, run "git am --continue". hint: If you prefer to skip this patch, run "git am --skip" instead. hint: To restore the original branch and stop patching, run "git am --abort". hint: Disable this message with "git config set advice.mergeConflict false"
When trying to b4 shazam in mm-new.
Needed observability on in field devices can be collected with minimal overhead and can be toggled on and off. Event driven telemetry can be done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is to enable per-process aggregation per device.
This allows for observing the distribution of such problems in the field, to deduce if there are legitimate bugs or if a bump to the limit is warranted.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Chnages in v2: - Add needed observability for operations failing due to the vma count limit, per Minchan (Since there isn't a common point for debug logging due checks being external to the capacity based vma_count_remaining() helper. I used a trace event for low overhead and to facilitate event driven telemetry for in field devices)
include/trace/events/vma.h | 32 ++++++++++++++++++++++++++++++++ mm/mmap.c | 5 ++++- mm/mremap.c | 10 ++++++++-- mm/vma.c | 11 +++++++++-- 4 files changed, 53 insertions(+), 5 deletions(-) create mode 100644 include/trace/events/vma.h
diff --git a/include/trace/events/vma.h b/include/trace/events/vma.h new file mode 100644 index 000000000000..2fed63b0d0a6 --- /dev/null +++ b/include/trace/events/vma.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vma + +#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VMA_H + +#include <linux/tracepoint.h> + +TRACE_EVENT(max_vma_count_exceeded, + + TP_PROTO(struct task_struct *task), + + TP_ARGS(task), + + TP_STRUCT__entry( + __string(comm, task->comm) + __field(pid_t, tgid) + ), + + TP_fast_assign( + __assign_str(comm); + __entry->tgid = task->tgid; + ), + + TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid) +); + +#endif /* _TRACE_VMA_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/mm/mmap.c b/mm/mmap.c index 30ddd550197e..0bb311bf48f3 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -56,6 +56,7 @@
#define CREATE_TRACE_POINTS #include <trace/events/mmap.h> +#include <trace/events/vma.h>
#include "internal.h"
@@ -374,8 +375,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */ - if (!vma_count_remaining(mm)) + if (!vma_count_remaining(mm)) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
/* * addr is returned from get_unmapped_area, diff --git a/mm/mremap.c b/mm/mremap.c index 14d35d87e89b..f42ac05f0069 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -30,6 +30,8 @@ #include <asm/tlb.h> #include <asm/pgalloc.h>
+#include <trace/events/vma.h> + #include "internal.h"
/* Classify the kind of remap operation being performed. */ @@ -1040,8 +1042,10 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) * We'd prefer to avoid failure later on in do_munmap: * which may split one vma into three before unmapping. */ - if (vma_count_remaining(current->mm) < 4) + if (vma_count_remaining(current->mm) < 4) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
if (vma->vm_ops && vma->vm_ops->may_split) { if (vma->vm_start != old_addr) @@ -1817,8 +1821,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) * the threshold. In other words, is the current map count + 6 at or * below the threshold? Otherwise return -ENOMEM here to be more safe. */ - if (vma_count_remaining(current->mm) < 6) + if (vma_count_remaining(current->mm) < 6) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
return 0; } diff --git a/mm/vma.c b/mm/vma.c index 0e4fcaebe209..692c33c3e84d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -7,6 +7,8 @@ #include "vma_internal.h" #include "vma.h"
+#include <trace/events/vma.h> + struct mmap_state { struct mm_struct *mm; struct vma_iterator *vmi; @@ -621,8 +623,10 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) { - if (!vma_count_remaining(vma->vm_mm)) + if (!vma_count_remaining(vma->vm_mm)) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
return __split_vma(vmi, vma, addr, new_below); } @@ -1375,6 +1379,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, */ if (vms->end < vms->vma->vm_end && !vma_count_remaining(vms->vma->vm_mm)) { + trace_max_vma_count_exceeded(current); error = -ENOMEM; goto vma_count_exceeded; } @@ -2801,8 +2806,10 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (!vma_count_remaining(mm)) + if (!vma_count_remaining(mm)) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) return -ENOMEM;
On Mon, 15 Sep 2025 09:36:38 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Needed observability on in field devices can be collected with minimal overhead and can be toggled on and off. Event driven telemetry can be done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is to enable per-process aggregation per device.
What do you mean about comm being used to aggregation across devices? What's special about this trace event that will make it used across devices?
Note, if BPF is being used, can't the BPF program just add the current comm? Why waste space in the ring buffer for it?
+TRACE_EVENT(max_vma_count_exceeded,
- TP_PROTO(struct task_struct *task),
Why pass in the task if it's always going to be current?
- TP_ARGS(task),
- TP_STRUCT__entry(
__string(comm, task->comm)
This could be:
__string(comm, current)
But I still want to know what makes this trace event special over other trace events to store this, and can't it be retrieved another way, especially if BPF is being used to hook to it?
-- Steve
__field(pid_t, tgid)
- ),
- TP_fast_assign(
__assign_str(comm);
__entry->tgid = task->tgid;
- ),
- TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
+);
On Mon, Sep 15, 2025 at 4:41 PM Steven Rostedt rostedt@goodmis.org wrote:
On Mon, 15 Sep 2025 09:36:38 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Needed observability on in field devices can be collected with minimal overhead and can be toggled on and off. Event driven telemetry can be done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is to enable per-process aggregation per device.
What do you mean about comm being used to aggregation across devices? What's special about this trace event that will make it used across devices?
Note, if BPF is being used, can't the BPF program just add the current comm? Why waste space in the ring buffer for it?
+TRACE_EVENT(max_vma_count_exceeded,
TP_PROTO(struct task_struct *task),
Why pass in the task if it's always going to be current?
TP_ARGS(task),
TP_STRUCT__entry(
__string(comm, task->comm)
This could be:
__string(comm, current)
But I still want to know what makes this trace event special over other trace events to store this, and can't it be retrieved another way, especially if BPF is being used to hook to it?
Hi Steve,
Thanks for the comments and suggestion you are right we can use bpf to get the comm. There is nothing special about this trace event. I will drop comm in the next revision.
The reason I did the task_struct parameter (current): I believe there is a limitation that we must specify at least 1 parameter to the TRACE_EVENT() PROTO and ARGS macros.
Is there some way to use this without needing a parameter?
I hit the build failure below, with no parameters:
In file included from mm/vma.c:10: ./include/trace/events/vma.h:10:1: error: expected parameter declarator 10 | TRACE_EVENT(max_vma_count_exceeded, | ^ ...
Below is the code for reference:
/* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM vma
#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_VMA_H
#include <linux/tracepoint.h>
TRACE_EVENT(max_vma_count_exceeded,
TP_PROTO(),
TP_ARGS(),
TP_STRUCT__entry( __field(pid_t, tgid) ),
TP_fast_assign( __entry->tgid = current->tgid; ),
TP_printk("tgid=%d", __entry->tgid) );
#endif /* _TRACE_VMA_H */
/* This part must be outside protection */ #include <trace/define_trace.h>
Thanks, Kalesh
-- Steve
__field(pid_t, tgid)
),
TP_fast_assign(
__assign_str(comm);
__entry->tgid = task->tgid;
),
TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
+);
On Mon, 15 Sep 2025 18:19:53 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Hi Steve,
Thanks for the comments and suggestion you are right we can use bpf to get the comm. There is nothing special about this trace event. I will drop comm in the next revision.
The reason I did the task_struct parameter (current): I believe there is a limitation that we must specify at least 1 parameter to the TRACE_EVENT() PROTO and ARGS macros.
OK, then this is another issue. We don't want tracepoint "markers". Each tracepoint can take up to 5K in memory due to the code it generates and the meta data to control it.
For something like that, we highly recommend dynamic probes (fprobes, kprobes, etc).
The only purpose of a static tracepoint is to get data within a function that is too difficult to get via a probe. It should never be used as a trigger where its purpose is "we hit this path".
-- Steve
On Tue, Sep 16, 2025 at 8:52 AM Steven Rostedt rostedt@goodmis.org wrote:
On Mon, 15 Sep 2025 18:19:53 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Hi Steve,
Thanks for the comments and suggestion you are right we can use bpf to get the comm. There is nothing special about this trace event. I will drop comm in the next revision.
The reason I did the task_struct parameter (current): I believe there is a limitation that we must specify at least 1 parameter to the TRACE_EVENT() PROTO and ARGS macros.
OK, then this is another issue. We don't want tracepoint "markers". Each tracepoint can take up to 5K in memory due to the code it generates and the meta data to control it.
For something like that, we highly recommend dynamic probes (fprobes, kprobes, etc).
The only purpose of a static tracepoint is to get data within a function that is too difficult to get via a probe. It should never be used as a trigger where its purpose is "we hit this path".
Hi Steve,
I completely agree with the principle that static tracepoints shouldn't be used as markers if a dynamic probe will suffice. The intent here is to avoid introducing overhead in the common case to avoid regressing mmap, munmap, and other syscall latencies; while still providing observability for the max vma_count exceeded failure condition.
The original centralized check (before previous review rounds) was indeed in a dedicated function, exceeds_max_map_count(), where a kprobe/fprobe could have been easily attached without impacting the common path. This was changed due to previous review feedback to the capacity based vma_count_remaining() which necessitated the check to be done externally by the callers:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
Would you be ok with something like:
trace_max_vma_count_exceeded(mm);
TP_STRUCT__entry( __field(unsigned int, mm_id) __field(unsigned int vma_count) )
mm_id would be the hash of the mm_struct ptr similar to rss_stat and the vma_count is the current vma count (some syscalls have different requirements on the capacity remaining: mremap requires 6 available slots, other syscalls require 1).
Thanks, Kalesh
-- Steve
On Tue, 16 Sep 2025 10:36:57 -0700 Kalesh Singh kaleshsingh@google.com wrote:
I completely agree with the principle that static tracepoints shouldn't be used as markers if a dynamic probe will suffice. The intent here is to avoid introducing overhead in the common case to avoid regressing mmap, munmap, and other syscall latencies; while still providing observability for the max vma_count exceeded failure condition.
The original centralized check (before previous review rounds) was indeed in a dedicated function, exceeds_max_map_count(), where a kprobe/fprobe could have been easily attached without impacting the common path. This was changed due to previous review feedback to the capacity based vma_count_remaining() which necessitated the check to be done externally by the callers:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
Would you be ok with something like:
trace_max_vma_count_exceeded(mm);
TP_STRUCT__entry( __field(unsigned int, mm_id) __field(unsigned int vma_count) )
mm_id would be the hash of the mm_struct ptr similar to rss_stat and the vma_count is the current vma count (some syscalls have different requirements on the capacity remaining: mremap requires 6 available slots, other syscalls require 1).
BTW, why the hash of the mm pointer and not the pointer itself? We save pointers in lots of places, and if it is the pointer, you could use an eprobe to attache to the trace event to dereference its fields.
-- Steve
On Tue, Sep 16, 2025 at 10:47 AM Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 16 Sep 2025 10:36:57 -0700 Kalesh Singh kaleshsingh@google.com wrote:
I completely agree with the principle that static tracepoints shouldn't be used as markers if a dynamic probe will suffice. The intent here is to avoid introducing overhead in the common case to avoid regressing mmap, munmap, and other syscall latencies; while still providing observability for the max vma_count exceeded failure condition.
The original centralized check (before previous review rounds) was indeed in a dedicated function, exceeds_max_map_count(), where a kprobe/fprobe could have been easily attached without impacting the common path. This was changed due to previous review feedback to the capacity based vma_count_remaining() which necessitated the check to be done externally by the callers:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
Would you be ok with something like:
trace_max_vma_count_exceeded(mm);
TP_STRUCT__entry( __field(unsigned int, mm_id) __field(unsigned int vma_count) )
mm_id would be the hash of the mm_struct ptr similar to rss_stat and the vma_count is the current vma count (some syscalls have different requirements on the capacity remaining: mremap requires 6 available slots, other syscalls require 1).
BTW, why the hash of the mm pointer and not the pointer itself? We save pointers in lots of places, and if it is the pointer, you could use an eprobe to attache to the trace event to dereference its fields.
In Android we try to avoid exposing raw kernel pointers to userspace for security reasons: raising /proc/sys/kernel/kptr_restrict to 2 immediately after symbols are resolved for necessary telemetry tooling during early boot. I believe this is also why rss_stat uses the hash and not the raw pointer.
Thanks, Kalesh
-- Steve
To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
On Tue, 16 Sep 2025 10:57:43 -0700 Kalesh Singh kaleshsingh@google.com wrote:
BTW, why the hash of the mm pointer and not the pointer itself? We save pointers in lots of places, and if it is the pointer, you could use an eprobe to attache to the trace event to dereference its fields.
In Android we try to avoid exposing raw kernel pointers to userspace for security reasons: raising /proc/sys/kernel/kptr_restrict to 2 immediately after symbols are resolved for necessary telemetry tooling during early boot. I believe this is also why rss_stat uses the hash and not the raw pointer.
When it comes to tracing, you already lost. If it goes into the ring buffer it's a raw pointer. BPF doesn't use the output of the trace event, so you are exposing nothing from that. It uses the proto directly.
Heck, if you enable function tracing, you are exposing every function address it traces via the raw data output.
-- Steve
On Tue, Sep 16, 2025 at 11:01 AM Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 16 Sep 2025 10:57:43 -0700 Kalesh Singh kaleshsingh@google.com wrote:
BTW, why the hash of the mm pointer and not the pointer itself? We save pointers in lots of places, and if it is the pointer, you could use an eprobe to attache to the trace event to dereference its fields.
In Android we try to avoid exposing raw kernel pointers to userspace for security reasons: raising /proc/sys/kernel/kptr_restrict to 2 immediately after symbols are resolved for necessary telemetry tooling during early boot. I believe this is also why rss_stat uses the hash and not the raw pointer.
When it comes to tracing, you already lost. If it goes into the ring buffer it's a raw pointer. BPF doesn't use the output of the trace event, so you are exposing nothing from that. It uses the proto directly.
My understanding is that the BPF tracepoint type uses the trace event fields from TP_STRUCT__entry(); whereas the raw tracepoint type has access to the proto arguments. Please CMIW: Isn't what we'd be adding to the trace buffer is the hashed mm_id value?
Heck, if you enable function tracing, you are exposing every function address it traces via the raw data output.
Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER in Android production kernels.
Thanks, Kalesh
-- Steve
On Tue, 16 Sep 2025 11:23:03 -0700 Kalesh Singh kaleshsingh@google.com wrote:
When it comes to tracing, you already lost. If it goes into the ring buffer it's a raw pointer. BPF doesn't use the output of the trace event, so you are exposing nothing from that. It uses the proto directly.
My understanding is that the BPF tracepoint type uses the trace event fields from TP_STRUCT__entry(); whereas the raw tracepoint type has access to the proto arguments. Please CMIW: Isn't what we'd be adding to the trace buffer is the hashed mm_id value?
Ah, right. Can't the BPF infrastructure protect against it?
Heck, if you enable function tracing, you are exposing every function address it traces via the raw data output.
Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER in Android production kernels.
Does it block all the other trace events that share pointers?
Like nmi handler tracepoints, x86_fpu tracepoints, page_fault kernel tracepoint, tasklet tracepoints, cpu hot plug tracepoints, timer tracepoints, work queue tracepoints, ipi tracepoints, scheduling tracepoints, locking tracepoints, rcu tracepoints, dma tracepoints, module tracepoints, xdp tracepoints, filemap tracepoints, page map tracepoints, vmscan tracepoints, percpu tracepoints, kmem_cache tracepoints, mmap lock tracepoints, file lock tracepoints, and many more! (I got tired of looking them up).
Again, are you really protecting anything if this one trace point hashes the pointer? Most other tracepoints expose this. If BPF can access a tracepoint entry struct, it can access the raw address and break KASLR.
-- Steve
On Tue, Sep 16, 2025 at 11:51 AM Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 16 Sep 2025 11:23:03 -0700 Kalesh Singh kaleshsingh@google.com wrote:
When it comes to tracing, you already lost. If it goes into the ring buffer it's a raw pointer. BPF doesn't use the output of the trace event, so you are exposing nothing from that. It uses the proto directly.
My understanding is that the BPF tracepoint type uses the trace event fields from TP_STRUCT__entry(); whereas the raw tracepoint type has access to the proto arguments. Please CMIW: Isn't what we'd be adding to the trace buffer is the hashed mm_id value?
Ah, right. Can't the BPF infrastructure protect against it?
Heck, if you enable function tracing, you are exposing every function address it traces via the raw data output.
Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER in Android production kernels.
Does it block all the other trace events that share pointers?
Like nmi handler tracepoints, x86_fpu tracepoints, page_fault kernel tracepoint, tasklet tracepoints, cpu hot plug tracepoints, timer tracepoints, work queue tracepoints, ipi tracepoints, scheduling tracepoints, locking tracepoints, rcu tracepoints, dma tracepoints, module tracepoints, xdp tracepoints, filemap tracepoints, page map tracepoints, vmscan tracepoints, percpu tracepoints, kmem_cache tracepoints, mmap lock tracepoints, file lock tracepoints, and many more! (I got tired of looking them up).
Hi Steve,
I see your point :) I'll use the raw pointer and handle not exposing it from the BPF side.
Thanks for discussing.
--Kalesh
Again, are you really protecting anything if this one trace point hashes the pointer? Most other tracepoints expose this. If BPF can access a tracepoint entry struct, it can access the raw address and break KASLR.
Thanks, Kalesh
-- Steve
On Tue, Sep 16, 2025 at 11:52:20AM -0400, Steven Rostedt wrote:
On Mon, 15 Sep 2025 18:19:53 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Hi Steve,
Thanks for the comments and suggestion you are right we can use bpf to get the comm. There is nothing special about this trace event. I will drop comm in the next revision.
The reason I did the task_struct parameter (current): I believe there is a limitation that we must specify at least 1 parameter to the TRACE_EVENT() PROTO and ARGS macros.
OK, then this is another issue. We don't want tracepoint "markers". Each tracepoint can take up to 5K in memory due to the code it generates and the meta data to control it.
For something like that, we highly recommend dynamic probes (fprobes, kprobes, etc).
The only purpose of a static tracepoint is to get data within a function that is too difficult to get via a probe. It should never be used as a trigger where its purpose is "we hit this path".
Isn't the usual problem with that approach, that of static functions/static inline functions? I was tracing through a problem a few months ago, and man I really did think "wouldn't it be nice to have a tracepoint instead of fishing around for kprobe spots".
Not that I particularly think a tracepoint is super worth it in this case, but, y'know.
On Thu, 18 Sep 2025 12:38:56 +0100 Pedro Falcato pfalcato@suse.de wrote:
Isn't the usual problem with that approach, that of static functions/static inline functions? I was tracing through a problem a few months ago, and man I really did think "wouldn't it be nice to have a tracepoint instead of fishing around for kprobe spots".
Not that I particularly think a tracepoint is super worth it in this case, but, y'know.
Yes, it would be useful. The issue is that tracepoints are not free. They do increase the I$ hit and take up memory.
If you're going to inject a tracepoint somewhere, at least extract some useful information from that spot. If you can't think of anything to track, then it's not worth a tracepoint.
If one really wants a way to track something, they could add a static branch that would call a function when enabled that could be traced. They would also need a way to enable that static branch.
-- Steve
On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
Needed observability on in field devices can be collected with minimal overhead and can be toggled on and off. Event driven telemetry can be done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is to enable per-process aggregation per device.
This allows for observing the distribution of such problems in the field, to deduce if there are legitimate bugs or if a bump to the limit is warranted.
It's not really a bug though is it? It's somebody running out of resources.
I'm not sure how useful this is really. But I'm open to being convinced!
I also wonder if this is better as a statistic? You'd figure out it was a problem that way too right?
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
This breaks the VMA tests, please make sure to always check them:
cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -c -o vma.o vma.c In file included from vma.c:33: ../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory 10 | #include <trace/events/vma.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated. make: *** [<builtin>: vma.o] Error 1
See below though, you've included this in the wrong place (I don't blame you, perhaps we've not made this _very_ clear :)
Chnages in v2:
- Add needed observability for operations failing due to the vma count limit, per Minchan (Since there isn't a common point for debug logging due checks being external to the capacity based vma_count_remaining() helper. I used a trace event for low overhead and to facilitate event driven telemetry for in field devices)
include/trace/events/vma.h | 32 ++++++++++++++++++++++++++++++++
Do we really need a new file? We already have VMA-related tracepoints no?
Also if you add a new file you _have_ to update MAINTAINERS.
mm/mmap.c | 5 ++++- mm/mremap.c | 10 ++++++++-- mm/vma.c | 11 +++++++++-- 4 files changed, 53 insertions(+), 5 deletions(-) create mode 100644 include/trace/events/vma.h
diff --git a/include/trace/events/vma.h b/include/trace/events/vma.h new file mode 100644 index 000000000000..2fed63b0d0a6 --- /dev/null +++ b/include/trace/events/vma.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vma
+#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VMA_H
+#include <linux/tracepoint.h>
+TRACE_EVENT(max_vma_count_exceeded,
- TP_PROTO(struct task_struct *task),
- TP_ARGS(task),
- TP_STRUCT__entry(
__string(comm, task->comm)
__field(pid_t, tgid)
- ),
- TP_fast_assign(
__assign_str(comm);
__entry->tgid = task->tgid;
- ),
- TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
+);
+#endif /* _TRACE_VMA_H */
+/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/mm/mmap.c b/mm/mmap.c index 30ddd550197e..0bb311bf48f3 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -56,6 +56,7 @@
#define CREATE_TRACE_POINTS #include <trace/events/mmap.h> +#include <trace/events/vma.h>
#include "internal.h"
@@ -374,8 +375,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */
- if (!vma_count_remaining(mm))
if (!vma_count_remaining(mm)) {
trace_max_vma_count_exceeded(current);
return -ENOMEM;
}
/*
- addr is returned from get_unmapped_area,
diff --git a/mm/mremap.c b/mm/mremap.c index 14d35d87e89b..f42ac05f0069 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -30,6 +30,8 @@ #include <asm/tlb.h> #include <asm/pgalloc.h>
+#include <trace/events/vma.h>
#include "internal.h"
/* Classify the kind of remap operation being performed. */ @@ -1040,8 +1042,10 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) * We'd prefer to avoid failure later on in do_munmap: * which may split one vma into three before unmapping. */
- if (vma_count_remaining(current->mm) < 4)
- if (vma_count_remaining(current->mm) < 4) {
trace_max_vma_count_exceeded(current);
But we didn't exceed it, we're guessing we might?
return -ENOMEM;
}
if (vma->vm_ops && vma->vm_ops->may_split) { if (vma->vm_start != old_addr)
@@ -1817,8 +1821,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) * the threshold. In other words, is the current map count + 6 at or * below the threshold? Otherwise return -ENOMEM here to be more safe. */
- if (vma_count_remaining(current->mm) < 6)
- if (vma_count_remaining(current->mm) < 6) {
trace_max_vma_count_exceeded(current);
Similar point here.
return -ENOMEM;
}
return 0;
} diff --git a/mm/vma.c b/mm/vma.c index 0e4fcaebe209..692c33c3e84d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -7,6 +7,8 @@ #include "vma_internal.h" #include "vma.h"
+#include <trace/events/vma.h>
Nope you don't do this :)
vma.c is designed to be userland testable and _intentionally_ puts all its includes in vma_internal.h.
So you'd need to add this include over there, and then update the vma tests vma_internal.h to stub out the trace function.
struct mmap_state { struct mm_struct *mm; struct vma_iterator *vmi; @@ -621,8 +623,10 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) {
- if (!vma_count_remaining(vma->vm_mm))
if (!vma_count_remaining(vma->vm_mm)) {
trace_max_vma_count_exceeded(current);
return -ENOMEM;
}
return __split_vma(vmi, vma, addr, new_below);
} @@ -1375,6 +1379,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, */ if (vms->end < vms->vma->vm_end && !vma_count_remaining(vms->vma->vm_mm)) {
}trace_max_vma_count_exceeded(current); error = -ENOMEM; goto vma_count_exceeded;
@@ -2801,8 +2806,10 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (!vma_count_remaining(mm))
if (!vma_count_remaining(mm)) {
trace_max_vma_count_exceeded(current);
return -ENOMEM;
}
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) return -ENOMEM;
-- 2.51.0.384.g4c02a37b29-goog
Cheers, Lorenzo
On Thu, Sep 18, 2025 at 02:42:16PM +0100, Lorenzo Stoakes wrote:
On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
Needed observability on in field devices can be collected with minimal overhead and can be toggled on and off. Event driven telemetry can be done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is to enable per-process aggregation per device.
This allows for observing the distribution of such problems in the field, to deduce if there are legitimate bugs or if a bump to the limit is warranted.
It's not really a bug though is it? It's somebody running out of resources.
I'm not sure how useful this is really. But I'm open to being convinced!
I also wonder if this is better as a statistic? You'd figure out it was a problem that way too right?
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
This breaks the VMA tests, please make sure to always check them:
cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -c -o vma.o vma.c In file included from vma.c:33: ../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory 10 | #include <trace/events/vma.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated. make: *** [<builtin>: vma.o] Error 1
Trivial build fix:
----8<---- From fe4c30abbd302ccc628ec92381ac10cea31c6d85 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes lorenzo.stoakes@oracle.com Date: Thu, 18 Sep 2025 14:47:10 +0100 Subject: [PATCH] fix
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- mm/vma.c | 2 -- mm/vma_internal.h | 2 ++ tools/testing/vma/vma_internal.h | 4 ++++ 3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c index 26046b28cdda..a11d29a2ddc0 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -7,8 +7,6 @@ #include "vma_internal.h" #include "vma.h"
-#include <trace/events/vma.h> - struct mmap_state { struct mm_struct *mm; struct vma_iterator *vmi; diff --git a/mm/vma_internal.h b/mm/vma_internal.h index 2f05735ff190..2f5ba679f43d 100644 --- a/mm/vma_internal.h +++ b/mm/vma_internal.h @@ -47,6 +47,8 @@ #include <linux/uprobes.h> #include <linux/userfaultfd_k.h>
+#include <trace/events/vma.h> + #include <asm/current.h> #include <asm/tlb.h>
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 07f4108c5e4c..c08c91861b9a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1661,4 +1661,8 @@ static inline void vma_count_dec(struct mm_struct *mm) vma_count_sub(mm, 1); }
+static void trace_max_vma_count_exceeded(struct task_struct *task) +{ +} + #endif /* __MM_VMA_INTERNAL_H */ -- 2.51.0
On Thu, Sep 18, 2025 at 6:52 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Sep 18, 2025 at 02:42:16PM +0100, Lorenzo Stoakes wrote:
On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
Needed observability on in field devices can be collected with minimal overhead and can be toggled on and off. Event driven telemetry can be done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is to enable per-process aggregation per device.
This allows for observing the distribution of such problems in the field, to deduce if there are legitimate bugs or if a bump to the limit is warranted.
It's not really a bug though is it? It's somebody running out of resources.
I'm not sure how useful this is really. But I'm open to being convinced!
I also wonder if this is better as a statistic? You'd figure out it was a problem that way too right?
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com
This breaks the VMA tests, please make sure to always check them:
cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -c -o vma.o vma.c In file included from vma.c:33: ../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory 10 | #include <trace/events/vma.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated. make: *** [<builtin>: vma.o] Error 1
Trivial build fix:
----8<---- From fe4c30abbd302ccc628ec92381ac10cea31c6d85 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes lorenzo.stoakes@oracle.com Date: Thu, 18 Sep 2025 14:47:10 +0100 Subject: [PATCH] fix
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
mm/vma.c | 2 -- mm/vma_internal.h | 2 ++ tools/testing/vma/vma_internal.h | 4 ++++ 3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c index 26046b28cdda..a11d29a2ddc0 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -7,8 +7,6 @@ #include "vma_internal.h" #include "vma.h"
-#include <trace/events/vma.h>
struct mmap_state { struct mm_struct *mm; struct vma_iterator *vmi; diff --git a/mm/vma_internal.h b/mm/vma_internal.h index 2f05735ff190..2f5ba679f43d 100644 --- a/mm/vma_internal.h +++ b/mm/vma_internal.h @@ -47,6 +47,8 @@ #include <linux/uprobes.h> #include <linux/userfaultfd_k.h>
+#include <trace/events/vma.h>
#include <asm/current.h> #include <asm/tlb.h>
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 07f4108c5e4c..c08c91861b9a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1661,4 +1661,8 @@ static inline void vma_count_dec(struct mm_struct *mm) vma_count_sub(mm, 1); }
+static void trace_max_vma_count_exceeded(struct task_struct *task) +{ +}
#endif /* __MM_VMA_INTERNAL_H */
I made a point to build and run your tests, seems I forgot to actually test it with this last patch.
Thanks for the fix.
--Kalesh
-- 2.51.0
On Mon, 15 Sep 2025 09:36:31 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Hi all,
This is v2 to the VMA count patch I previously posted at:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
I've split it into multiple patches to address the feedback.
The main changes in v2 are:
- Use a capacity-based check for VMA count limit, per Lorenzo.
- Rename map_count to vma_count, per David.
- Add assertions for exceeding the limit, per Pedro.
- Add tests for max_vma_count, per Liam.
- Emit a trace event for failure due to insufficient capacity for observability
Tested on x86_64 and arm64:
Build test:
- allyesconfig for rename
Selftests: cd tools/testing/selftests/mm && \ make && \ ./run_vmtests.sh -t max_vma_count
(With trace_max_vma_count_exceeded enabled)
vma tests: cd tools/testing/vma && \ make && \ ./vma
fwiw, there's nothing in the above which is usable in a [0/N] overview.
While useful, the "what changed since the previous version" info isn't a suitable thing to carry in the permanent kernel record - it's short-term treansient stuff, not helpful to someone who is looking at the patchset in 2029.
Similarly, the "how it was tested" material is also useful, but it becomes irrelevant as soon as the code hits linux-next and mainline.
Anyhow, this -rc cycle has been quite the firehose in MM and I'm feeling a need to slow things down for additional stabilization and so people hopefully get additional bandwidth to digest the material we've added this far. So I think I'll just cherrypick [1/7] for now. A great flood of positive review activity would probably make me revisit that ;)
On Mon, Sep 15, 2025 at 3:34 PM Andrew Morton akpm@linux-foundation.org wrote:
On Mon, 15 Sep 2025 09:36:31 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Hi all,
This is v2 to the VMA count patch I previously posted at:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
I've split it into multiple patches to address the feedback.
The main changes in v2 are:
- Use a capacity-based check for VMA count limit, per Lorenzo.
- Rename map_count to vma_count, per David.
- Add assertions for exceeding the limit, per Pedro.
- Add tests for max_vma_count, per Liam.
- Emit a trace event for failure due to insufficient capacity for observability
Tested on x86_64 and arm64:
Build test:
- allyesconfig for rename
Selftests: cd tools/testing/selftests/mm && \ make && \ ./run_vmtests.sh -t max_vma_count
(With trace_max_vma_count_exceeded enabled)
vma tests: cd tools/testing/vma && \ make && \ ./vma
fwiw, there's nothing in the above which is usable in a [0/N] overview.
While useful, the "what changed since the previous version" info isn't a suitable thing to carry in the permanent kernel record - it's short-term treansient stuff, not helpful to someone who is looking at the patchset in 2029.
Similarly, the "how it was tested" material is also useful, but it becomes irrelevant as soon as the code hits linux-next and mainline.
Hi Andrew,
Thanks for the feedback. Do you mean the cover letter was not needed in this case or that it lacked enough context?
Anyhow, this -rc cycle has been quite the firehose in MM and I'm feeling a need to slow things down for additional stabilization and so people hopefully get additional bandwidth to digest the material we've added this far. So I think I'll just cherrypick [1/7] for now. A great flood of positive review activity would probably make me revisit that ;)
I understand, yes 1/7 is all we need for now, since it prevents an unrecoverable situation where we get over the limit and cannot recover as munmap() will then always fail.
Thanks, Kalesh
On Mon, 15 Sep 2025 16:10:55 -0700 Kalesh Singh kaleshsingh@google.com wrote:
fwiw, there's nothing in the above which is usable in a [0/N] overview.
While useful, the "what changed since the previous version" info isn't a suitable thing to carry in the permanent kernel record - it's short-term treansient stuff, not helpful to someone who is looking at the patchset in 2029.
Similarly, the "how it was tested" material is also useful, but it becomes irrelevant as soon as the code hits linux-next and mainline.
Hi Andrew,
Thanks for the feedback. Do you mean the cover letter was not needed in this case or that it lacked enough context?
The latter. As I've split up the series, please put together some words to describe the remaining 6 patches if/when resending.
Thanks.
On Mon, Sep 15, 2025 at 5:05 PM Andrew Morton akpm@linux-foundation.org wrote:
On Mon, 15 Sep 2025 16:10:55 -0700 Kalesh Singh kaleshsingh@google.com wrote:
fwiw, there's nothing in the above which is usable in a [0/N] overview.
While useful, the "what changed since the previous version" info isn't a suitable thing to carry in the permanent kernel record - it's short-term treansient stuff, not helpful to someone who is looking at the patchset in 2029.
Similarly, the "how it was tested" material is also useful, but it becomes irrelevant as soon as the code hits linux-next and mainline.
Hi Andrew,
Thanks for the feedback. Do you mean the cover letter was not needed in this case or that it lacked enough context?
The latter. As I've split up the series, please put together some words to describe the remaining 6 patches if/when resending.
Hi Andrew,
Thanks for clarifying. I'll make sure to fix that when resending.
--Kalesh
Thanks.
On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
Anyhow, this -rc cycle has been quite the firehose in MM and I'm feeling a need to slow things down for additional stabilization and so people hopefully get additional bandwidth to digest the material we've added this far. So I think I'll just cherrypick [1/7] for now. A great flood of positive review activity would probably make me revisit that ;)
Kalesh - I do intend to look at this series when I have a chance. My review workload has been insane so it's hard to keep up at the moment.
Andrew - This cycle has been crazy, speaking from point of view of somebody doing a lot of review, it's been very very exhausting from this side too, and this kind of work can feel a little... thankless... sometimes :)
I feel like we maybe need a way to ask people to slow down, sometimes at least.
Perhaps being less accepting of patches during merge window is one aspect, as the merge window leading up to this cycle was almost the same review load as when the cycle started.
Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor in all this too :)
(I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week vacataion afterwards over the merge window so I hope to stave off burnout that way. Be good if I could keep mails upon return to 3 digits, but I have my doubts :P)
Cheers, Lorenzo
On Tue, Sep 16, 2025 at 3:12 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
Anyhow, this -rc cycle has been quite the firehose in MM and I'm feeling a need to slow things down for additional stabilization and so people hopefully get additional bandwidth to digest the material we've added this far. So I think I'll just cherrypick [1/7] for now. A great flood of positive review activity would probably make me revisit that ;)
Kalesh - I do intend to look at this series when I have a chance. My review workload has been insane so it's hard to keep up at the moment.
Andrew - This cycle has been crazy, speaking from point of view of somebody doing a lot of review, it's been very very exhausting from this side too, and this kind of work can feel a little... thankless... sometimes :)
I feel like we maybe need a way to ask people to slow down, sometimes at least.
Perhaps being less accepting of patches during merge window is one aspect, as the merge window leading up to this cycle was almost the same review load as when the cycle started.
Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor in all this too :)
(I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week vacataion afterwards over the merge window so I hope to stave off burnout that way. Be good if I could keep mails upon return to 3 digits, but I have my doubts :P)
Hi Lorenzo,
Absolutely, take care of yourself. We all appreciate the amount of work you put into reviewing :)
Have a good talk and enjoy your vacation. Don't worry about the backlog.
-- Kalesh
Cheers, Lorenzo
On Tue, 16 Sep 2025 11:12:03 +0100 Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
Anyhow, this -rc cycle has been quite the firehose in MM and I'm feeling a need to slow things down for additional stabilization and so people hopefully get additional bandwidth to digest the material we've added this far. So I think I'll just cherrypick [1/7] for now. A great flood of positive review activity would probably make me revisit that ;)
Kalesh - I do intend to look at this series when I have a chance. My review workload has been insane so it's hard to keep up at the moment.
Andrew - This cycle has been crazy, speaking from point of view of somebody doing a lot of review, it's been very very exhausting from this side too, and this kind of work can feel a little... thankless... sometimes :)
I hear you. I'm shedding most everything now, to give us a couple of weeks to digest.
I feel like we maybe need a way to ask people to slow down, sometimes at least.
Yup, I'm sending submitters private emails explaining the situation.
Maybe they should be public emails, I find it a hard call.
Perhaps being less accepting of patches during merge window is one aspect, as the merge window leading up to this cycle was almost the same review load as when the cycle started.
I'm having trouble understanding what you said here?
Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor in all this too :)
(I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week vacataion afterwards over the merge window so I hope to stave off burnout that way. Be good if I could keep mails upon return to 3 digits, but I have my doubts :P)
I'd blow that in three days ;)
On Tue, Sep 16, 2025 at 07:16:45PM -0700, Andrew Morton wrote:
On Tue, 16 Sep 2025 11:12:03 +0100 Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
Anyhow, this -rc cycle has been quite the firehose in MM and I'm feeling a need to slow things down for additional stabilization and so people hopefully get additional bandwidth to digest the material we've added this far. So I think I'll just cherrypick [1/7] for now. A great flood of positive review activity would probably make me revisit that ;)
Kalesh - I do intend to look at this series when I have a chance. My review workload has been insane so it's hard to keep up at the moment.
Andrew - This cycle has been crazy, speaking from point of view of somebody doing a lot of review, it's been very very exhausting from this side too, and this kind of work can feel a little... thankless... sometimes :)
I hear you. I'm shedding most everything now, to give us a couple of weeks to digest.
Thanks, much appreciated! :)
I feel like we maybe need a way to ask people to slow down, sometimes at least.
Yup, I'm sending submitters private emails explaining the situation.
And again, much appreciated :)
Maybe they should be public emails, I find it a hard call.
Yeah it can be hard to get that balance right. Maybe public is better when we're deeper in the rc and there's a general load problem?
Perhaps being less accepting of patches during merge window is one aspect, as the merge window leading up to this cycle was almost the same review load as when the cycle started.
I'm having trouble understanding what you said here?
Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the merge window, esp. now we have mm-new.
And last merge window my review load felt similar to during a cycle, which was kind of crazy.
So I wonder if we should be less accommodating and simply say 'sorry it's the merge window, no submissions accepted'?
Of course I'm being a bit selfish here as I'm on holiday in the next merge window and hope forlornly to reduce the mail I come back to :P
Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor in all this too :)
(I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week vacataion afterwards over the merge window so I hope to stave off burnout that way. Be good if I could keep mails upon return to 3 digits, but I have my doubts :P)
I'd blow that in three days ;)
Haha yeah I bet :)
Cheers, Lorenzo
On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Perhaps being less accepting of patches during merge window is one aspect, as the merge window leading up to this cycle was almost the same review load as when the cycle started.
I'm having trouble understanding what you said here?
Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the merge window, esp. now we have mm-new.
And last merge window my review load felt similar to during a cycle, which was kind of crazy.
So I wonder if we should be less accommodating and simply say 'sorry it's the merge window, no submissions accepted'?
hm, I always have a lot of emails piled up by the time mm-stable gets merged upstream. That ~1 week between "we merged" and "-rc1" is a nice time to go through that material and add it to mm-new. I think it smooths things out. I mean, this is peak time for people to be considering the new material?
(ot, that backlog is always >400 emails and a lot of that gets tossed out anyway - either it's just too old so I request a refresh or there was a new version, or review was unpromising, etc).
On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Perhaps being less accepting of patches during merge window is one aspect, as the merge window leading up to this cycle was almost the same review load as when the cycle started.
I'm having trouble understanding what you said here?
Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the merge window, esp. now we have mm-new.
And last merge window my review load felt similar to during a cycle, which was kind of crazy.
So I wonder if we should be less accommodating and simply say 'sorry it's the merge window, no submissions accepted'?
hm, I always have a lot of emails piled up by the time mm-stable gets merged upstream. That ~1 week between "we merged" and "-rc1" is a nice time to go through that material and add it to mm-new. I think it smooths things out. I mean, this is peak time for people to be considering the new material?
I'm confused, why is the merge window a good time to consider new material?
People have the entirety of the cycle to submit new material, and they do so.
And equally, people are sending new revisions of old code, engaging in discussion from old series etc. during the merge window also.
What happens is you essentially have reviewers work 9 weeks instead of 7 for a cycle without much of a let up (+ so no break from it), based on workload from the past cycle/merge window.
I mean I can only ask that perhaps we consider not doing this in mm (I gather many other subsystems equally have a kinda 'freeze' during this time).
(ot, that backlog is always >400 emails and a lot of that gets tossed out anyway - either it's just too old so I request a refresh or there was a new version, or review was unpromising, etc).
Right, which actually makes everything a lot more uncertain from a reviewer's point of view, as we don't definitely have a solid git base, mm-new isn't synced with Linus's tree very much during this time etc.
Which makes the merge window actually even worse for this stuff.
From the point of view of avoiding burn out, it'd be good to manage expectations a bit on this.
Personally I remember literally saying to David during the last merge window 'well I hoped I could go off and work on my own stuff instead of just review, guess not then' :)
THP is a particularly busy area at the moment which is part of all this.
On 18.09.25 12:29, Lorenzo Stoakes wrote:
On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Perhaps being less accepting of patches during merge window is one aspect, as the merge window leading up to this cycle was almost the same review load as when the cycle started.
I'm having trouble understanding what you said here?
Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the merge window, esp. now we have mm-new.
And last merge window my review load felt similar to during a cycle, which was kind of crazy.
So I wonder if we should be less accommodating and simply say 'sorry it's the merge window, no submissions accepted'?
hm, I always have a lot of emails piled up by the time mm-stable gets merged upstream. That ~1 week between "we merged" and "-rc1" is a nice time to go through that material and add it to mm-new. I think it smooths things out. I mean, this is peak time for people to be considering the new material?
I'm confused, why is the merge window a good time to consider new material?
People have the entirety of the cycle to submit new material, and they do so.
My view is that if you are sending a cleanup/feature during the merge window you cannot expect a fast reply, and you should not keep sending new versions in that timeframe expecting that all people you CCed that should have a look actually did have a look.
On Thu, Sep 18, 2025 at 02:07:09PM +0200, David Hildenbrand wrote:
On 18.09.25 12:29, Lorenzo Stoakes wrote:
On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Perhaps being less accepting of patches during merge window is one aspect, as the merge window leading up to this cycle was almost the same review load as when the cycle started.
I'm having trouble understanding what you said here?
Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the merge window, esp. now we have mm-new.
And last merge window my review load felt similar to during a cycle, which was kind of crazy.
So I wonder if we should be less accommodating and simply say 'sorry it's the merge window, no submissions accepted'?
hm, I always have a lot of emails piled up by the time mm-stable gets merged upstream. That ~1 week between "we merged" and "-rc1" is a nice time to go through that material and add it to mm-new. I think it smooths things out. I mean, this is peak time for people to be considering the new material?
I'm confused, why is the merge window a good time to consider new material?
People have the entirety of the cycle to submit new material, and they do so.
My view is that if you are sending a cleanup/feature during the merge window you cannot expect a fast reply, and you should not keep sending new versions in that timeframe expecting that all people you CCed that should have a look actually did have a look.
Yes exactly.
The problem is all the conversations and respins and etc. _do_ carry on as normal, and often land in mm-new, queued up for mm-unstable etc. unless we happen to notice them.
So it makes it impossible for us to just ignore until the next cycle (or need to go through every thread that happened afterwards).
And people know that, so just keep on submitting as normal. That was _really_ palpable last merge window.
I mean I'm cheating by going on vacation for this merge window ;) but obviously means I'll have 2 weeks of review to check when I get back + 1st week of cycle to go too.
I think in some subsystems new series/respins are actively unwelcome during the merge window. I wonder if we should be the same?
-- Cheers
David / dhildenb
Cheers, Lorenzo
On Thu, 18 Sep 2025 13:49:33 +0100 Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
I'm confused, why is the merge window a good time to consider new material?
People have the entirety of the cycle to submit new material, and they do so.
My view is that if you are sending a cleanup/feature during the merge window you cannot expect a fast reply, and you should not keep sending new versions in that timeframe expecting that all people you CCed that should have a look actually did have a look.
Yes exactly.
The problem is all the conversations and respins and etc. _do_ carry on as normal, and often land in mm-new, queued up for mm-unstable etc. unless we happen to notice them.
So it makes it impossible for us to just ignore until the next cycle (or need to go through every thread that happened afterwards).
And people know that, so just keep on submitting as normal. That was _really_ palpable last merge window.
Well, what else do we have to do during the merge window? The previous cycle's batch is merged up and there may be some fallout from that, but it isn't very time-consuming.
If you're proposing that we start to use that period as a break for sanity purposes then OK, didn't see that one coming, don't know how widespread this desire is.
But perhaps a better time for this quiet period is during -rc6 and -rc7 when the rate of merging is throttled right back. Or perhaps from -rc7 to mid-merge-window.
linux-kselftest-mirror@lists.linaro.org