The failing hugetlb vmsplice() COW tests keep confusing people, and having tests that have been failing for years and likely will keep failing for years to come because nobody cares enough is rather suboptimal. Let's mark them as XFAIL and document why fixing them is not that easy as it would appear at first sight.
More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.
[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/
Cc: Andrew Morton akpm@linux-foundation.org Cc: Muchun Song muchun.song@linux.dev Cc: Shuah Khan shuah@kernel.org
David Hildenbrand (2): selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL mm/hugetlb: document why hugetlb uses folio_mapcount() for COW reuse decisions
mm/hugetlb.c | 7 ++ tools/testing/selftests/mm/cow.c | 106 +++++++++++++++++++++---------- 2 files changed, 78 insertions(+), 35 deletions(-)
The vmsplice() hugetlb tests have been failing right from the start, and we documented that in the introducing commit 7dad331be781 ("selftests/vm: anon_cow: hugetlb tests"):
Note that some tests cases still fail. This will, for example, be fixed once vmsplice properly uses FOLL_PIN instead of FOLL_GET for pinning. With 2 MiB and 1 GiB hugetlb on x86_64, the expected failures are:
Until vmsplice() is changed, these tests will likely keep failing: hugetlb COW reuse logic is harder to change, because using the same COW reuse logic as we use for !hugetlb could harm other (sane) users when running out of free hugetlb pages.
More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.
These (expected) failures keep confusing people, so flag them accordingly.
Before: $ ./cow [...] Bail out! 8 out of 778 tests failed # Totals: pass:769 fail:8 xfail:0 xpass:0 skip:1 error:0 $ echo $? 1
After: $ ./cow [...] # Totals: pass:769 fail:0 xfail:8 xpass:0 skip:1 error:0 $ echo $? 0
[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/
Signed-off-by: David Hildenbrand david@redhat.com --- tools/testing/selftests/mm/cow.c | 106 +++++++++++++++++++++---------- 1 file changed, 71 insertions(+), 35 deletions(-)
diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index 363bf5f801be5..0549672acbd63 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -199,7 +199,7 @@ static int child_vmsplice_memcmp_fn(char *mem, size_t size, typedef int (*child_fn)(char *mem, size_t size, struct comm_pipes *comm_pipes);
static void do_test_cow_in_parent(char *mem, size_t size, bool do_mprotect, - child_fn fn) + child_fn fn, bool xfail) { struct comm_pipes comm_pipes; char buf; @@ -247,33 +247,47 @@ static void do_test_cow_in_parent(char *mem, size_t size, bool do_mprotect, else ret = -EINVAL;
- ksft_test_result(!ret, "No leak from parent into child\n"); + if (!ret) { + ksft_test_result_pass("No leak from parent into child\n"); + } else if (xfail) { + /* + * With hugetlb, some vmsplice() tests are currently expected to + * fail because (a) harder to fix and (b) nobody really cares. + * Flag them as expected failure for now. + */ + ksft_test_result_xfail("Leak from parent into child\n"); + } else { + ksft_test_result_fail("Leak from parent into child\n"); + } close_comm_pipes: close_comm_pipes(&comm_pipes); }
-static void test_cow_in_parent(char *mem, size_t size) +static void test_cow_in_parent(char *mem, size_t size, bool is_hugetlb) { - do_test_cow_in_parent(mem, size, false, child_memcmp_fn); + do_test_cow_in_parent(mem, size, false, child_memcmp_fn, false); }
-static void test_cow_in_parent_mprotect(char *mem, size_t size) +static void test_cow_in_parent_mprotect(char *mem, size_t size, bool is_hugetlb) { - do_test_cow_in_parent(mem, size, true, child_memcmp_fn); + do_test_cow_in_parent(mem, size, true, child_memcmp_fn, false); }
-static void test_vmsplice_in_child(char *mem, size_t size) +static void test_vmsplice_in_child(char *mem, size_t size, bool is_hugetlb) { - do_test_cow_in_parent(mem, size, false, child_vmsplice_memcmp_fn); + do_test_cow_in_parent(mem, size, false, child_vmsplice_memcmp_fn, + is_hugetlb); }
-static void test_vmsplice_in_child_mprotect(char *mem, size_t size) +static void test_vmsplice_in_child_mprotect(char *mem, size_t size, + bool is_hugetlb) { - do_test_cow_in_parent(mem, size, true, child_vmsplice_memcmp_fn); + do_test_cow_in_parent(mem, size, true, child_vmsplice_memcmp_fn, + is_hugetlb); }
static void do_test_vmsplice_in_parent(char *mem, size_t size, - bool before_fork) + bool before_fork, bool xfail) { struct iovec iov = { .iov_base = mem, @@ -355,8 +369,18 @@ static void do_test_vmsplice_in_parent(char *mem, size_t size, } }
- ksft_test_result(!memcmp(old, new, transferred), - "No leak from child into parent\n"); + if (!memcmp(old, new, transferred)) { + ksft_test_result_pass("No leak from child into parent\n"); + } else if (xfail) { + /* + * With hugetlb, some vmsplice() tests are currently expected to + * fail because (a) harder to fix and (b) nobody really cares. + * Flag them as expected failure for now. + */ + ksft_test_result_xfail("Leak from child into parent\n"); + } else { + ksft_test_result_fail("Leak from child into parent\n"); + } close_pipe: close(fds[0]); close(fds[1]); @@ -367,14 +391,14 @@ static void do_test_vmsplice_in_parent(char *mem, size_t size, free(new); }
-static void test_vmsplice_before_fork(char *mem, size_t size) +static void test_vmsplice_before_fork(char *mem, size_t size, bool is_hugetlb) { - do_test_vmsplice_in_parent(mem, size, true); + do_test_vmsplice_in_parent(mem, size, true, is_hugetlb); }
-static void test_vmsplice_after_fork(char *mem, size_t size) +static void test_vmsplice_after_fork(char *mem, size_t size, bool is_hugetlb) { - do_test_vmsplice_in_parent(mem, size, false); + do_test_vmsplice_in_parent(mem, size, false, is_hugetlb); }
#ifdef LOCAL_CONFIG_HAVE_LIBURING @@ -529,12 +553,12 @@ static void do_test_iouring(char *mem, size_t size, bool use_fork) close_comm_pipes(&comm_pipes); }
-static void test_iouring_ro(char *mem, size_t size) +static void test_iouring_ro(char *mem, size_t size, bool is_hugetlb) { do_test_iouring(mem, size, false); }
-static void test_iouring_fork(char *mem, size_t size) +static void test_iouring_fork(char *mem, size_t size, bool is_hugetlb) { do_test_iouring(mem, size, true); } @@ -678,37 +702,41 @@ static void do_test_ro_pin(char *mem, size_t size, enum ro_pin_test test, free(tmp); }
-static void test_ro_pin_on_shared(char *mem, size_t size) +static void test_ro_pin_on_shared(char *mem, size_t size, bool is_hugetlb) { do_test_ro_pin(mem, size, RO_PIN_TEST_SHARED, false); }
-static void test_ro_fast_pin_on_shared(char *mem, size_t size) +static void test_ro_fast_pin_on_shared(char *mem, size_t size, bool is_hugetlb) { do_test_ro_pin(mem, size, RO_PIN_TEST_SHARED, true); }
-static void test_ro_pin_on_ro_previously_shared(char *mem, size_t size) +static void test_ro_pin_on_ro_previously_shared(char *mem, size_t size, + bool is_hugetlb) { do_test_ro_pin(mem, size, RO_PIN_TEST_PREVIOUSLY_SHARED, false); }
-static void test_ro_fast_pin_on_ro_previously_shared(char *mem, size_t size) +static void test_ro_fast_pin_on_ro_previously_shared(char *mem, size_t size, + bool is_hugetlb) { do_test_ro_pin(mem, size, RO_PIN_TEST_PREVIOUSLY_SHARED, true); }
-static void test_ro_pin_on_ro_exclusive(char *mem, size_t size) +static void test_ro_pin_on_ro_exclusive(char *mem, size_t size, + bool is_hugetlb) { do_test_ro_pin(mem, size, RO_PIN_TEST_RO_EXCLUSIVE, false); }
-static void test_ro_fast_pin_on_ro_exclusive(char *mem, size_t size) +static void test_ro_fast_pin_on_ro_exclusive(char *mem, size_t size, + bool is_hugetlb) { do_test_ro_pin(mem, size, RO_PIN_TEST_RO_EXCLUSIVE, true); }
-typedef void (*test_fn)(char *mem, size_t size); +typedef void (*test_fn)(char *mem, size_t size, bool hugetlb);
static void do_run_with_base_page(test_fn fn, bool swapout) { @@ -740,7 +768,7 @@ static void do_run_with_base_page(test_fn fn, bool swapout) } }
- fn(mem, pagesize); + fn(mem, pagesize, false); munmap: munmap(mem, pagesize); } @@ -904,7 +932,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize) break; }
- fn(mem, size); + fn(mem, size, false); munmap: munmap(mmap_mem, mmap_size); if (mremap_mem != MAP_FAILED) @@ -997,7 +1025,7 @@ static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize) } munmap(dummy, hugetlbsize);
- fn(mem, hugetlbsize); + fn(mem, hugetlbsize, true); munmap: munmap(mem, hugetlbsize); } @@ -1036,7 +1064,7 @@ static const struct test_case anon_test_cases[] = { */ { "vmsplice() + unmap in child", - test_vmsplice_in_child + test_vmsplice_in_child, }, /* * vmsplice() test, but do an additional mprotect(PROT_READ)+ @@ -1044,7 +1072,7 @@ static const struct test_case anon_test_cases[] = { */ { "vmsplice() + unmap in child with mprotect() optimization", - test_vmsplice_in_child_mprotect + test_vmsplice_in_child_mprotect, }, /* * vmsplice() [R/O GUP] in parent before fork(), unmap in parent after @@ -1322,23 +1350,31 @@ static void do_test_anon_thp_collapse(char *mem, size_t size, close_comm_pipes(&comm_pipes); }
-static void test_anon_thp_collapse_unshared(char *mem, size_t size) +static void test_anon_thp_collapse_unshared(char *mem, size_t size, + bool is_hugetlb) { + assert(!is_hugetlb); do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UNSHARED); }
-static void test_anon_thp_collapse_fully_shared(char *mem, size_t size) +static void test_anon_thp_collapse_fully_shared(char *mem, size_t size, + bool is_hugetlb) { + assert(!is_hugetlb); do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_FULLY_SHARED); }
-static void test_anon_thp_collapse_lower_shared(char *mem, size_t size) +static void test_anon_thp_collapse_lower_shared(char *mem, size_t size, + bool is_hugetlb) { + assert(!is_hugetlb); do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_LOWER_SHARED); }
-static void test_anon_thp_collapse_upper_shared(char *mem, size_t size) +static void test_anon_thp_collapse_upper_shared(char *mem, size_t size, + bool is_hugetlb) { + assert(!is_hugetlb); do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UPPER_SHARED); }
Let's document why hugetlb still uses folio_mapcount() and is prone to leaking memory between processes, for example using vmsplice() that still uses FOLL_GET.
More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.
[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/
Suggested-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- mm/hugetlb.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 417fc5cdb6eeb..a7efb350f5d07 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5963,6 +5963,13 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, /* * If no-one else is actually using this page, we're the exclusive * owner and can reuse this page. + * + * Note that we don't rely on the (safer) folio refcount here, because + * copying the hugetlb folio when there are unexpected (temporary) + * folio references could harm simple fork()+exit() users when + * we run out of free hugetlb folios: we would have to kill processes + * in scenarios that used to work. As a side effect, there can still + * be leaks between processes, for example, with FOLL_GET users. */ if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) { if (!PageAnonExclusive(&old_folio->page)) {
On Thu, May 02, 2024 at 10:52:59AM +0200, David Hildenbrand wrote:
Let's document why hugetlb still uses folio_mapcount() and is prone to leaking memory between processes, for example using vmsplice() that still uses FOLL_GET.
More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.
[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/
Suggested-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
mm/hugetlb.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 417fc5cdb6eeb..a7efb350f5d07 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5963,6 +5963,13 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, /* * If no-one else is actually using this page, we're the exclusive * owner and can reuse this page.
*
* Note that we don't rely on the (safer) folio refcount here, because
* copying the hugetlb folio when there are unexpected (temporary)
* folio references could harm simple fork()+exit() users when
* we run out of free hugetlb folios: we would have to kill processes
* in scenarios that used to work. As a side effect, there can still
*/ if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) { if (!PageAnonExclusive(&old_folio->page)) {* be leaks between processes, for example, with FOLL_GET users.
Thanks for preparing such updates, David.
However is fork+exit the real problem? E.g. if a child simply fork, do something, then exit, I don't see a major issue even if we follow refcount here (despite the "check against 1 or 2 or 3" issue, where hugetlb_fault can take one already). As long as the child quits, all ref / map counts will be released then. If the child needs to write to ANON|PRIV it needs to manage hugetlb reservations anyways.
In the case of vmsplice it's kind of malicious, and holding that refcount (with 0 mapcount) doesn't sound the common scenario to me.
IIUC if we need to keep this, it was more about the case where (as you correctly mentioned in another follow up reply) hugetlb isn't that flexible to memory overcommits, and in many cases it won't have extra pages floating around to allow adhoc CoWs? While random refcount boost is easy to happen, and here the problem is we simply cannot identify that v.s. vmsplice() malicious takers.
Thanks,
On 02.05.24 16:28, Peter Xu wrote:
On Thu, May 02, 2024 at 10:52:59AM +0200, David Hildenbrand wrote:
Let's document why hugetlb still uses folio_mapcount() and is prone to leaking memory between processes, for example using vmsplice() that still uses FOLL_GET.
More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.
[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/
Suggested-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
mm/hugetlb.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 417fc5cdb6eeb..a7efb350f5d07 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5963,6 +5963,13 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, /* * If no-one else is actually using this page, we're the exclusive * owner and can reuse this page.
*
* Note that we don't rely on the (safer) folio refcount here, because
* copying the hugetlb folio when there are unexpected (temporary)
* folio references could harm simple fork()+exit() users when
* we run out of free hugetlb folios: we would have to kill processes
* in scenarios that used to work. As a side effect, there can still
*/ if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) { if (!PageAnonExclusive(&old_folio->page)) {* be leaks between processes, for example, with FOLL_GET users.
Thanks for preparing such updates, David.
However is fork+exit the real problem? E.g. if a child simply fork, do something, then exit, I don't see a major issue even if we follow refcount here (despite the "check against 1 or 2 or 3" issue, where hugetlb_fault can take one already). As long as the child quits, all ref / map counts will be released then. If the child needs to write to ANON|PRIV it needs to manage hugetlb reservations anyways.
The PAE flag was cleared and if there is any unexpected (temporary) reference (page migration, lockless swapcache lookups, whatever), we're in trouble.
In the case of vmsplice it's kind of malicious, and holding that refcount (with 0 mapcount) doesn't sound the common scenario to me.
Yes, I'm not that concerned about something that that APP would be triggering (vmsplice).
IIUC if we need to keep this, it was more about the case where (as you correctly mentioned in another follow up reply) hugetlb isn't that flexible to memory overcommits, and in many cases it won't have extra pages floating around to allow adhoc CoWs? While random refcount boost is easy to happen, and here the problem is we simply cannot identify that v.s. vmsplice() malicious takers.
Yes, exactly.
linux-kselftest-mirror@lists.linaro.org