Hi everybody,
as discussed in the linux-mm alignment session on Wednesday, this is part 1
of the COW fixes: fix the COW security issue using GUP-triggered
unsharing of shared anonymous pages (ordinary, THP, hugetlb). In the
meeting slides, this approach was referred to as "Copy On Read". If anybody
wants to have access to the slides, please feel free to reach out.
The patches are based on v5.16-rc5 and available at:
https://github.com/davidhildenbrand/linux/pull/new/unshare_v1
It is currently again possible for a child process to observe modifications
of anonymous pages performed by the parent process after fork() in some
cases, which is not only a violation of the POSIX semantics of MAP_PRIVATE,
but more importantly a real security issue.
This issue, including other related COW issues, has been summarized at [1]:
"
1. Observing Memory Modifications of Private Pages From A Child Process
Long story short: process-private memory might not be as private as you
think once you fork(): successive modifications of private memory
regions in the parent process can still be observed by the child
process, for example, by smart use of vmsplice()+munmap().
The core problem is that pinning pages readable in a child process, such
as done via the vmsplice system call, can result in a child process
observing memory modifications done in the parent process the child is
not supposed to observe. [1] contains an excellent summary and [2]
contains further details. This issue was assigned CVE-2020-29374 [9].
For this to trigger, it's required to use a fork() without subsequent
exec(), for example, as used under Android zygote. Without further
details about an application that forks less-privileged child processes,
one cannot really say what's actually affected and what's not -- see the
details section the end of this mail for a short sshd/openssh analysis.
While commit 17839856fd58 ("gup: document and work around "COW can break
either way" issue") fixed this issue and resulted in other problems
(e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page()
simplification") re-introduced part of the problem unfortunately.
The original reproducer can be modified quite easily to use THP [3] and
make the issue appear again on upstream kernels. I modified it to use
hugetlb [4] and it triggers as well. The problem is certainly less
severe with hugetlb than with THP; it merely highlights that we still
have plenty of open holes we should be closing/fixing.
Regarding vmsplice(), the only known workaround is to disallow the
vmsplice() system call ... or disable THP and hugetlb. But who knows
what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
the end, it's a more generic issue.
"
This security issue was first reported by Jann Horn on 27 May 2020 and it
currently affects anonymous THP and hugetlb again. The "security issue"
part for hugetlb might be less important than for THP. However, with this
approach it's just easy to get the MAP_PRIVATE semantics of any anonymous
pages in that regard and avoid any such information leaks without much
added complexity.
Ordinary anonymous pages are currently not affected, because the COW logic
was changed in commit 09854ba94c6a ("mm: do_wp_page() simplification")
for them to COW on "page_count() != 1" instead of "mapcount > 1", which
unfortunately results in other COW issues, some of them documented in [1]
as well.
To fix this COW issue once and for all, introduce GUP-triggered unsharing
that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to
traditional COW, unsharing will leave the copied page mapped
write-protected in the page table, not having the semantics of a write
fault.
Logically, unsharing is triggered "early", as soon as GUP performs the
action that could result in a COW getting missed later and the security
issue triggering: however, unsharing is not triggered as before via a
write fault with undesired side effects.
Long story short, GUP triggers unsharing if all of the following conditions
are met:
* The page is mapped R/O
* We have an anonymous page, excluding KSM
* We want to read (!FOLL_WRITE)
* Unsharing is not disabled (!FOLL_NOUNSHARE)
* We want to take a reference (FOLL_GET or FOLL_PIN)
* The page is a shared anonymous page: mapcount > 1
To reliably detect shared anonymous THP without heavy locking, introduce
a mapcount_seqcount seqlock that protects the mapcount of a THP and can
be used to read an atomic mapcount value. The mapcount_seqlock is stored
inside the memmap of the compound page -- to keep it simple, factor out
a raw_seqlock_t from the seqlock_t.
As this patch series introduces the same unsharing logic for any
anonymous pages, it also paves the way to fix other COW issues, e.g.,
documented in [1], without reintroducing the security issue or
reintroducing other issues we observed in the past (e.g., broken ptrace on
pmem).
All reproducers for this COW issue have been consolidated in the selftest
included in this series. Hopefully we'll get this fixed for good.
Future work:
* get_user_pages_fast_only() can currently spin on the mapcount_seqcount
when reading the mapcount, which might be a rare event. While this is
fine even when done from get_user_pages_fast_only() in IRQ context, we
might want to just fail fast in get_user_pages_fast_only(). We already
have patches prepared that add page_anon_maybe_shared() and
page_trans_huge_anon_maybe_shared() that will return "true" in case
spinning would be required and make get_user_pages_fast_only() fail fast.
I'm excluding them for simplicity.
... even better would be finding a way to just not need the
mapcount_seqcount, but THP splitting and PageDoubleMap() gives us a
hard time -- but maybe we'll eventually find a way someday :)
* Part 2 will tackle the other user-space visible breakages / COW issues
raised in [1]. This series is the basis for adjusting the COW logic once
again without re-introducing the COW issue fixed in this series and
without reintroducing the issues we saw with the original CVE fix
(e.g., breaking ptrace on pmem). There might be further parts to improve
the GUP long-term <-> MM synchronicity and to optimize some things
around that.
The idea is by Andrea and some patches are rewritten versions of prototype
patches by Andrea. I cross-compiled and tested as good as possible.
I'll CC locking+selftest folks only on the relevant patch and the cover
letter to minimze the noise. I'll put everyone on CC who was either
involved with the COW issues in the past or attended the linux-mm alignment
session on Wednesday. Appologies if I forget anyone :)
[1] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
David Hildenbrand (11):
seqlock: provide lockdep-free raw_seqcount_t variant
mm: thp: consolidate mapcount logic on THP split
mm: simplify hugetlb and file-THP handling in __page_mapcount()
mm: thp: simlify total_mapcount()
mm: thp: allow for reading the THP mapcount atomically via a
raw_seqlock_t
mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)
mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required
(!hugetlb)
mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE
mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required
(hugetlb)
mm: thp: introduce and use page_trans_huge_anon_shared()
selftests/vm: add tests for the known COW security issues
Documentation/locking/seqlock.rst | 50 ++++
include/linux/huge_mm.h | 72 +++++
include/linux/mm.h | 14 +
include/linux/mm_types.h | 9 +
include/linux/seqlock.h | 145 +++++++---
mm/gup.c | 89 +++++-
mm/huge_memory.c | 120 +++++++--
mm/hugetlb.c | 129 +++++++--
mm/memory.c | 136 ++++++++--
mm/rmap.c | 40 +--
mm/swapfile.c | 35 ++-
mm/util.c | 24 +-
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/gup_cow.c | 312 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 16 ++
15 files changed, 1044 insertions(+), 148 deletions(-)
create mode 100644 tools/testing/selftests/vm/gup_cow.c
--
2.31.1
From: Mike Kravetz <mike.kravetz(a)oracle.com>
[ Upstream commit f5c73297181c6b3ad76537bad98eaad6d29b9333 ]
Currently, userfaultfd selftest for hugetlb as run from run_vmtests.sh
or any environment where there are 'just enough' hugetlb pages will
always fail with:
testing events (fork, remap, remove):
ERROR: UFFDIO_COPY error: -12 (errno=12, line=616)
The ENOMEM error code implies there are not enough hugetlb pages.
However, there are free hugetlb pages but they are all reserved. There
is a basic problem with the way the test allocates hugetlb pages which
has existed since the test was originally written.
Due to the way 'cleanup' was done between different phases of the test,
this issue was masked until recently. The issue was uncovered by commit
8ba6e8640844 ("userfaultfd/selftests: reinitialize test context in each
test").
For the hugetlb test, src and dst areas are allocated as PRIVATE
mappings of a hugetlb file. This means that at mmap time, pages are
reserved for the src and dst areas. At the start of event testing (and
other tests) the src area is populated which results in allocation of
huge pages to fill the area and consumption of reserves associated with
the area. Then, a child is forked to fault in the dst area. Note that
the dst area was allocated in the parent and hence the parent owns the
reserves associated with the mapping. The child has normal access to
the dst area, but can not use the reserves created/owned by the parent.
Thus, if there are no other huge pages available allocation of a page
for the dst by the child will fail.
Fix by not creating reserves for the dst area. In this way the child
can use free (non-reserved) pages.
Also, MAP_PRIVATE of a file only makes sense if you are interested in
the contents of the file before making a COW copy. The test does not do
this. So, just use MAP_ANONYMOUS | MAP_HUGETLB to create an anonymous
hugetlb mapping. There is no need to create a hugetlb file in the
non-shared case.
Link: https://lkml.kernel.org/r/20211217172919.7861-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Mina Almasry <almasrymina(a)google.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/vm/userfaultfd.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 60aa1a4fc69b6..81690f1737c80 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -86,7 +86,7 @@ static bool test_uffdio_minor = false;
static bool map_shared;
static int shm_fd;
-static int huge_fd;
+static int huge_fd = -1; /* only used for hugetlb_shared test */
static char *huge_fd_off0;
static unsigned long long *count_verify;
static int uffd = -1;
@@ -222,6 +222,9 @@ static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset)
static void hugetlb_release_pages(char *rel_area)
{
+ if (huge_fd == -1)
+ return;
+
if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
rel_area == huge_fd_off0 ? 0 : nr_pages * page_size,
nr_pages * page_size))
@@ -234,16 +237,17 @@ static void hugetlb_allocate_area(void **alloc_area)
char **alloc_area_alias;
*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
- (map_shared ? MAP_SHARED : MAP_PRIVATE) |
- MAP_HUGETLB,
- huge_fd, *alloc_area == area_src ? 0 :
- nr_pages * page_size);
+ map_shared ? MAP_SHARED :
+ MAP_PRIVATE | MAP_HUGETLB |
+ (*alloc_area == area_src ? 0 : MAP_NORESERVE),
+ huge_fd,
+ *alloc_area == area_src ? 0 : nr_pages * page_size);
if (*alloc_area == MAP_FAILED)
err("mmap of hugetlbfs file failed");
if (map_shared) {
area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
- MAP_SHARED | MAP_HUGETLB,
+ MAP_SHARED,
huge_fd, *alloc_area == area_src ? 0 :
nr_pages * page_size);
if (area_alias == MAP_FAILED)
--
2.34.1