The perf subsystem today unifies various tracing and monitoring
features, from both software and hardware. One benefit of the perf
subsystem is automatically inheriting events to child tasks, which
enables process-wide events monitoring with low overheads. By default
perf events are non-intrusive, not affecting behaviour of the tasks
being monitored.
For certain use-cases, however, it makes sense to leverage the
generality of the perf events subsystem and optionally allow the tasks
being monitored to receive signals on events they are interested in.
This patch series adds the option to synchronously signal user space on
events.
To better support process-wide synchronous self-monitoring, without
events propagating to children that do not share the current process's
shared environment, two pre-requisite patches are added to optionally
restrict inheritance to CLONE_THREAD, and remove events on exec (without
affecting the parent).
Examples how to use these features can be found in the tests added at
the end of the series. In addition to the tests added, the series has
also been subjected to syzkaller fuzzing (focus on 'kernel/events/'
coverage).
Motivation and Example Uses
---------------------------
1. Our immediate motivation is low-overhead sampling-based race
detection for user space [1]. By using perf_event_open() at
process initialization, we can create hardware
breakpoint/watchpoint events that are propagated automatically
to all threads in a process. As far as we are aware, today no
existing kernel facility (such as ptrace) allows us to set up
process-wide watchpoints with minimal overheads (that are
comparable to mprotect() of whole pages).
2. Other low-overhead error detectors that rely on detecting
accesses to certain memory locations or code, process-wide and
also only in a specific set of subtasks or threads.
[1] https://llvm.org/devmtg/2020-09/slides/Morehouse-GWP-Tsan.pdf
Other ideas for use-cases we found interesting, but should only
illustrate the range of potential to further motivate the utility (we're
sure there are more):
3. Code hot patching without full stop-the-world. Specifically, by
setting a code breakpoint to entry to the patched routine, then
send signals to threads and check that they are not in the
routine, but without stopping them further. If any of the
threads will enter the routine, it will receive SIGTRAP and
pause.
4. Safepoints without mprotect(). Some Java implementations use
"load from a known memory location" as a safepoint. When threads
need to be stopped, the page containing the location is
mprotect()ed and threads get a signal. This could be replaced with
a watchpoint, which does not require a whole page nor DTLB
shootdowns.
5. Threads receiving signals on performance events to
throttle/unthrottle themselves.
6. Tracking data flow globally.
Changelog
---------
v3:
* Add patch "perf: Rework perf_event_exit_event()" to beginning of
series, courtesy of Peter Zijlstra.
* Rework "perf: Add support for event removal on exec" based on
the added "perf: Rework perf_event_exit_event()".
* Fix kselftests to work with more recent libc, due to the way it forces
using the kernel's own siginfo_t.
* Add basic perf-tool built-in test.
v2/RFC: https://lkml.kernel.org/r/20210310104139.679618-1-elver@google.com
* Patch "Support only inheriting events if cloned with CLONE_THREAD"
added to series.
* Patch "Add support for event removal on exec" added to series.
* Patch "Add kselftest for process-wide sigtrap handling" added to
series.
* Patch "Add kselftest for remove_on_exec" added to series.
* Implicitly restrict inheriting events if sigtrap, but the child was
cloned with CLONE_CLEAR_SIGHAND, because it is not generally safe if
the child cleared all signal handlers to continue sending SIGTRAP.
* Various minor fixes (see details in patches).
v1/RFC: https://lkml.kernel.org/r/20210223143426.2412737-1-elver@google.com
Pre-series: The discussion at [2] led to the changes in this series. The
approach taken in "Add support for SIGTRAP on perf events" to trigger
the signal was suggested by Peter Zijlstra in [3].
[2] https://lore.kernel.org/lkml/CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZ…
[3] https://lore.kernel.org/lkml/YBv3rAT566k+6zjg@hirez.programming.kicks-ass.n…
Marco Elver (10):
perf: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children
perf: Support only inheriting events if cloned with CLONE_THREAD
perf: Add support for event removal on exec
signal: Introduce TRAP_PERF si_code and si_perf to siginfo
perf: Add support for SIGTRAP on perf events
perf: Add breakpoint information to siginfo on SIGTRAP
selftests/perf_events: Add kselftest for process-wide sigtrap handling
selftests/perf_events: Add kselftest for remove_on_exec
tools headers uapi: Sync tools/include/uapi/linux/perf_event.h
perf test: Add basic stress test for sigtrap handling
Peter Zijlstra (1):
perf: Rework perf_event_exit_event()
arch/m68k/kernel/signal.c | 3 +
arch/x86/kernel/signal_compat.c | 5 +-
fs/signalfd.c | 4 +
include/linux/compat.h | 2 +
include/linux/perf_event.h | 6 +-
include/linux/signal.h | 1 +
include/uapi/asm-generic/siginfo.h | 6 +-
include/uapi/linux/perf_event.h | 5 +-
include/uapi/linux/signalfd.h | 4 +-
kernel/events/core.c | 297 +++++++++++++-----
kernel/fork.c | 2 +-
kernel/signal.c | 11 +
tools/include/uapi/linux/perf_event.h | 5 +-
tools/perf/tests/Build | 1 +
tools/perf/tests/builtin-test.c | 5 +
tools/perf/tests/sigtrap.c | 148 +++++++++
tools/perf/tests/tests.h | 1 +
.../testing/selftests/perf_events/.gitignore | 3 +
tools/testing/selftests/perf_events/Makefile | 6 +
tools/testing/selftests/perf_events/config | 1 +
.../selftests/perf_events/remove_on_exec.c | 260 +++++++++++++++
tools/testing/selftests/perf_events/settings | 1 +
.../selftests/perf_events/sigtrap_threads.c | 206 ++++++++++++
23 files changed, 896 insertions(+), 87 deletions(-)
create mode 100644 tools/perf/tests/sigtrap.c
create mode 100644 tools/testing/selftests/perf_events/.gitignore
create mode 100644 tools/testing/selftests/perf_events/Makefile
create mode 100644 tools/testing/selftests/perf_events/config
create mode 100644 tools/testing/selftests/perf_events/remove_on_exec.c
create mode 100644 tools/testing/selftests/perf_events/settings
create mode 100644 tools/testing/selftests/perf_events/sigtrap_threads.c
--
2.31.0.291.g576ba9dcdaf-goog
Previously, we shared too much of the code with COPY and ZEROPAGE, so we
manipulated things in various invalid ways:
- Previously, we unconditionally called shmem_inode_acct_block. In the
continue case, we're looking up an existing page which would have been
accounted for properly when it was allocated. So doing it twice
results in double-counting, and eventually leaking.
- Previously, we made the pte writable whenever the VMA was writable.
However, for continue, consider this case:
1. A tmpfs file was created
2. The non-UFFD-registered side mmap()-s with MAP_SHARED
3. The UFFD-registered side mmap()-s with MAP_PRIVATE
In this case, even though the UFFD-registered VMA may be writable, we
still want CoW behavior. So, check for this case and don't make the
pte writable.
- The initial pgoff / max_off check isn't necessary, so we can skip past
it. The second one seems likely to be unnecessary too, but keep it
just in case. Modify both checks to use pgoff, as offset is equivalent
and not needed.
- Previously, we unconditionally called ClearPageDirty() in the error
path. In the continue case though, since this is an existing page, it
might have already been dirty before we started touching it. It's very
problematic to clear the bit incorrectly, but not a problem to leave
it - so, just omit the ClearPageDirty() entirely.
- Previously, we unconditionally removed the page from the page cache in
the error path. But in the continue case, we didn't add it - it was
already there because the page is present in some second
(non-UFFD-registered) mapping. So, removing it is invalid.
Because the error handling issues are easy to exercise in the selftest,
make a small modification there to do so.
Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've
added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just
check for that mode first thing, and then "goto" down to where the parts
we actually want are. This leaves the code in between cleaner.
Changes since v2:
- Drop the ClearPageDirty() entirely, instead of trying to remember the
old value.
- Modify both pgoff / max_off checks to use pgoff. It's equivalent to
offset, but offset wasn't initialized until the first check (which
we're skipping).
- Keep the second pgoff / max_off check in the continue case.
Changes since v1:
- Refactor to skip ahead with goto, instead of adding several more
"if (!is_continue)".
- Fix unconditional ClearPageDirty().
- Don't pte_mkwrite() when is_continue && !VM_SHARED.
Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")
Signed-off-by: Axel Rasmussen <axelrasmussen(a)google.com>
---
mm/shmem.c | 60 +++++++++++++-----------
tools/testing/selftests/vm/userfaultfd.c | 12 +++++
2 files changed, 44 insertions(+), 28 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index d2e0e81b7d2e..fbcce850a16e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2377,18 +2377,22 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct page *page;
pte_t _dst_pte, *dst_pte;
int ret;
- pgoff_t offset, max_off;
-
- ret = -ENOMEM;
- if (!shmem_inode_acct_block(inode, 1))
- goto out;
+ pgoff_t max_off;
+ int writable;
if (is_continue) {
ret = -EFAULT;
page = find_lock_page(mapping, pgoff);
if (!page)
- goto out_unacct_blocks;
- } else if (!*pagep) {
+ goto out;
+ goto install_ptes;
+ }
+
+ ret = -ENOMEM;
+ if (!shmem_inode_acct_block(inode, 1))
+ goto out;
+
+ if (!*pagep) {
page = shmem_alloc_page(gfp, info, pgoff);
if (!page)
goto out_unacct_blocks;
@@ -2415,30 +2419,29 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
*pagep = NULL;
}
- if (!is_continue) {
- VM_BUG_ON(PageSwapBacked(page));
- VM_BUG_ON(PageLocked(page));
- __SetPageLocked(page);
- __SetPageSwapBacked(page);
- __SetPageUptodate(page);
- }
+ VM_BUG_ON(PageSwapBacked(page));
+ VM_BUG_ON(PageLocked(page));
+ __SetPageLocked(page);
+ __SetPageSwapBacked(page);
+ __SetPageUptodate(page);
ret = -EFAULT;
- offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(offset >= max_off))
+ if (unlikely(pgoff >= max_off))
goto out_release;
- /* If page wasn't already in the page cache, add it. */
- if (!is_continue) {
- ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
- gfp & GFP_RECLAIM_MASK, dst_mm);
- if (ret)
- goto out_release;
- }
+ ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
+ gfp & GFP_RECLAIM_MASK, dst_mm);
+ if (ret)
+ goto out_release;
+install_ptes:
_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
- if (dst_vma->vm_flags & VM_WRITE)
+ /* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */
+ writable = is_continue && !(dst_vma->vm_flags & VM_SHARED)
+ ? 0
+ : dst_vma->vm_flags & VM_WRITE;
+ if (writable)
_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
else {
/*
@@ -2455,7 +2458,7 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
ret = -EFAULT;
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(offset >= max_off))
+ if (unlikely(pgoff >= max_off))
goto out_release_unlock;
ret = -EEXIST;
@@ -2485,13 +2488,14 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
return ret;
out_release_unlock:
pte_unmap_unlock(dst_pte, ptl);
- ClearPageDirty(page);
- delete_from_page_cache(page);
+ if (!is_continue)
+ delete_from_page_cache(page);
out_release:
unlock_page(page);
put_page(page);
out_unacct_blocks:
- shmem_inode_unacct_blocks(inode, 1);
+ if (!is_continue)
+ shmem_inode_unacct_blocks(inode, 1);
goto out;
}
#endif /* CONFIG_USERFAULTFD */
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index f6c86b036d0f..d8541a59dae5 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -485,6 +485,7 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
static void continue_range(int ufd, __u64 start, __u64 len)
{
struct uffdio_continue req;
+ int ret;
req.range.start = start;
req.range.len = len;
@@ -493,6 +494,17 @@ static void continue_range(int ufd, __u64 start, __u64 len)
if (ioctl(ufd, UFFDIO_CONTINUE, &req))
err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
(uint64_t)start);
+
+ /*
+ * Error handling within the kernel for continue is subtly different
+ * from copy or zeropage, so it may be a source of bugs. Trigger an
+ * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG.
+ */
+ req.mapped = 0;
+ ret = ioctl(ufd, UFFDIO_CONTINUE, &req);
+ if (ret >= 0 || req.mapped != -EEXIST)
+ err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64,
+ ret, req.mapped);
}
static void *locking_thread(void *arg)
--
2.31.0.291.g576ba9dcdaf-goog
Good Day Sir/Ms,
We are please to invite you or your company to quote the
following item listed below:
Product/Model No: A702TH FYNE PRESSURE REGULATOR
Model Number: A702TH
Qty. 30 units
Compulsory,Kindly send your quotation to:
quotation(a)pfizerbvsupply.com
for immediate approval.
Kind Regards,
Albert Bourla
PFIZER B.V Supply Chain Manager
Tel: +31(0)208080 880
ADDRESS: Rivium Westlaan 142, 2909 LD
Capelle aan den IJssel, Netherlands
From: Ira Weiny <ira.weiny(a)intel.com>
Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).
Generally PKS enables protections on 'domains' of supervisor pages to limit
supervisor mode access to pages beyond the normal paging protections. PKS
works in a similar fashion to user space pkeys, PKU. As with PKU, supervisor
pkeys are checked in addition to normal paging protections and Access or Writes
can be disabled via a MSR update without TLB flushes when permissions change.
Also like PKU, a page mapping is assigned to a domain by setting pkey bits in
the page table entry for that mapping.
Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.
XSAVE is not supported for the PKRS MSR. Therefore the implementation
saves/restores the MSR across context switches and during exceptions. Nested
exceptions are supported by each exception getting a new PKS state.
For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.
Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one. Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
CPU instance.
The following are key attributes of PKS.
1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
2) Works on a per thread basis
PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.
All code to support PKS is configured via ARCH_ENABLE_SUPERVISOR_PKEYS which
is designed to only be turned on when a user is configured on in the kernel.
Those users must depend on ARCH_HAS_SUPERVISOR_PKEYS to properly work with
other architectures which do not yet support PKS.
Originally this series was submitted as part of a large patch set which
converted the kmap call sites.[1]
Many follow on discussions revealed a few problems. The first of which was
that some callers leak a kmap mapping across threads rather than containing it
to a critical section. Attempts were made to see if these 'global kmaps' could
be supported.[2] However, supporting global kmaps had many problems. Work is
being done in parallel on converting as many kmap calls to the new
kmap_local_page().[3]
Changes from V3 [4]
Add ARCH_ENABLE_SUPERVISOR_PKEYS config which is selected by kernel
users to add the functionality to the core. However, they should only
select this if ARCH_HAS_SUPERVISOR_PKEYS is available.
Clean up test code for context switching
Adjust for extended_pt_regs
Reduce output unless --debug is specified
Address internal review comments from Dan Williams and Dave Hansen
Help with macros and assembly coding
Change names of various functions
Clean up documentation
Move all #ifdefery into header files.
Clean up cover letter.
Make extended_pt_regs handling a macro rather than coding
around every call to C
Add macross for PKS shift/mask
New patch : x86/pks: Add additional PKEY helper macros
Preserve pkrs_cache as static when PKS_TEST is not configured
Remove unnecessary pr_* prints
Clarify pks_key_alloc flags parameter
Change CONFIG_PKS_TESTING to CONFIG_PKS_TEST
Clean up test code separation from main code in fault.c
Remove module boilerplate from test code
Clean up all commit messages
Address comments from Thomas Gleixner
Provide a warning and fallback to no protection if a global
mapping is requested.
Fix context switch. Fix where pks_sched_in() is called.
Fix test to actually do a context switch
Remove unecessary noinstr's
From Andy Lutomirski
Use extended_pt_regs idea to stash pks values on the stack
Drop patches 5/10 and 7/10
And use extended_pt_regs to print pkey info on fault
Adjust tests
Comments from Randy Dunlap:
Fix gramatical errors in doc
Clean up kernel docs
Rebase to 5.12
[1] https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/
[2] https://lore.kernel.org/lkml/87mtycqcjf.fsf@nanos.tec.linutronix.de/
[3] https://lore.kernel.org/lkml/20210128061503.1496847-1-ira.weiny@intel.com/https://lore.kernel.org/lkml/20210210062221.3023586-1-ira.weiny@intel.com/https://lore.kernel.org/lkml/20210205170030.856723-1-ira.weiny@intel.com/https://lore.kernel.org/lkml/20210217024826.3466046-1-ira.weiny@intel.com/
[4] https://lore.kernel.org/lkml/20201106232908.364581-1-ira.weiny@intel.com/
</proposed cover letter>
Fenghua Yu (1):
x86/pks: Add PKS kernel API
Ira Weiny (9):
x86/pkeys: Create pkeys_common.h
x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
x86/pks: Add additional PKEY helper macros
x86/pks: Add PKS defines and Kconfig options
x86/pks: Add PKS setup code
x86/fault: Adjust WARN_ON for PKey fault
x86/pks: Preserve the PKRS MSR on context switch
x86/entry: Preserve PKRS MSR across exceptions
x86/pks: Add PKS test code
Documentation/core-api/protection-keys.rst | 111 +++-
arch/x86/Kconfig | 1 +
arch/x86/entry/calling.h | 26 +
arch/x86/entry/common.c | 58 ++
arch/x86/entry/entry_64.S | 22 +-
arch/x86/entry/entry_64_compat.S | 6 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pgtable.h | 10 +-
arch/x86/include/asm/pgtable_types.h | 12 +
arch/x86/include/asm/pkeys.h | 4 +
arch/x86/include/asm/pkeys_common.h | 34 +
arch/x86/include/asm/pks.h | 54 ++
arch/x86/include/asm/processor-flags.h | 2 +
arch/x86/include/asm/processor.h | 43 +-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/fpu/xstate.c | 22 +-
arch/x86/kernel/head_64.S | 7 +-
arch/x86/kernel/process.c | 3 +
arch/x86/kernel/process_64.c | 2 +
arch/x86/mm/fault.c | 27 +-
arch/x86/mm/pkeys.c | 218 +++++-
include/linux/pgtable.h | 4 +
include/linux/pkeys.h | 34 +
kernel/entry/common.c | 14 +-
lib/Kconfig.debug | 11 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 693 ++++++++++++++++++++
mm/Kconfig | 5 +
tools/testing/selftests/x86/Makefile | 3 +-
tools/testing/selftests/x86/test_pks.c | 150 +++++
34 files changed, 1519 insertions(+), 77 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h
create mode 100644 arch/x86/include/asm/pks.h
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c
--
2.28.0.rc0.12.gb6a658bd00c9
If a signed number field starts with a '-' the field width must be > 1,
or unlimited, to allow at least one digit after the '-'.
This patch adds a check for this. If a signed field starts with '-'
and field_width == 1 the scanf will quit.
It is ok for a signed number field to have a field width of 1 if it
starts with a digit. In that case the single digit can be converted.
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Reviewed-by: Petr Mladek <pmladek(a)suse.com>
Acked-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
---
lib/vsprintf.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 41ddc353ebb8..f78651e9b030 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -3466,8 +3466,12 @@ int vsscanf(const char *buf, const char *fmt, va_list args)
str = skip_spaces(str);
digit = *str;
- if (is_sign && digit == '-')
+ if (is_sign && digit == '-') {
+ if (field_width == 1)
+ break;
+
digit = *(str + 1);
+ }
if (!digit
|| (base == 16 && !isxdigit(digit))
--
2.20.1
Previously, we shared too much of the code with COPY and ZEROPAGE, so we
manipulated things in various invalid ways:
- Previously, we unconditionally called shmem_inode_acct_block. In the
continue case, we're looking up an existing page which would have been
accounted for properly when it was allocated. So doing it twice
results in double-counting, and eventually leaking.
- Previously, we made the pte writable whenever the VMA was writable.
However, for continue, consider this case:
1. A tmpfs file was created
2. The non-UFFD-registered side mmap()-s with MAP_SHARED
3. The UFFD-registered side mmap()-s with MAP_PRIVATE
In this case, even though the UFFD-registered VMA may be writable, we
still want CoW behavior. So, check for this case and don't make the
pte writable.
- The offset / max_off checking doesn't necessarily hurt anything, but
it's not needed in the CONTINUE case, so skip it.
- Previously, we unconditionally called ClearPageDirty() in the error
path. In the continue case though, since this is an existing page, it
might have already been dirty before we started touching it. So,
remember whether or not it was dirty before we set_page_dirty(), and
only clear the bit if it wasn't dirty before.
- Previously, we unconditionally removed the page from the page cache in
the error path. But in the continue case, we didn't add it - it was
already there because the page is present in some second
(non-UFFD-registered) mapping. So, removing it is invalid.
Because the error handling issues are easy to exercise in the selftest,
make a small modification there to do so.
Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've
added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just
check for that mode first thing, and then "goto" down to where the parts
we actually want are. This leaves the code in between cleaner.
Changes since v1:
- Refactor to skip ahead with goto, instead of adding several more
"if (!is_continue)".
- Fix unconditional ClearPageDirty().
- Don't pte_mkwrite() when is_continue && !VM_SHARED.
Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")
Signed-off-by: Axel Rasmussen <axelrasmussen(a)google.com>
---
mm/shmem.c | 67 ++++++++++++++----------
tools/testing/selftests/vm/userfaultfd.c | 12 +++++
2 files changed, 51 insertions(+), 28 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index d2e0e81b7d2e..8ab1f1f29987 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2378,17 +2378,22 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
pte_t _dst_pte, *dst_pte;
int ret;
pgoff_t offset, max_off;
-
- ret = -ENOMEM;
- if (!shmem_inode_acct_block(inode, 1))
- goto out;
+ int writable;
+ bool was_dirty;
if (is_continue) {
ret = -EFAULT;
page = find_lock_page(mapping, pgoff);
if (!page)
- goto out_unacct_blocks;
- } else if (!*pagep) {
+ goto out;
+ goto install_ptes;
+ }
+
+ ret = -ENOMEM;
+ if (!shmem_inode_acct_block(inode, 1))
+ goto out;
+
+ if (!*pagep) {
page = shmem_alloc_page(gfp, info, pgoff);
if (!page)
goto out_unacct_blocks;
@@ -2415,13 +2420,11 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
*pagep = NULL;
}
- if (!is_continue) {
- VM_BUG_ON(PageSwapBacked(page));
- VM_BUG_ON(PageLocked(page));
- __SetPageLocked(page);
- __SetPageSwapBacked(page);
- __SetPageUptodate(page);
- }
+ VM_BUG_ON(PageSwapBacked(page));
+ VM_BUG_ON(PageLocked(page));
+ __SetPageLocked(page);
+ __SetPageSwapBacked(page);
+ __SetPageUptodate(page);
ret = -EFAULT;
offset = linear_page_index(dst_vma, dst_addr);
@@ -2429,16 +2432,18 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
if (unlikely(offset >= max_off))
goto out_release;
- /* If page wasn't already in the page cache, add it. */
- if (!is_continue) {
- ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
- gfp & GFP_RECLAIM_MASK, dst_mm);
- if (ret)
- goto out_release;
- }
+ ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
+ gfp & GFP_RECLAIM_MASK, dst_mm);
+ if (ret)
+ goto out_release;
+install_ptes:
_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
- if (dst_vma->vm_flags & VM_WRITE)
+ /* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */
+ writable = is_continue && !(dst_vma->vm_flags & VM_SHARED)
+ ? 0
+ : dst_vma->vm_flags & VM_WRITE;
+ if (writable)
_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
else {
/*
@@ -2448,15 +2453,18 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
* unconditionally before unlock_page(), but doing it
* only if VM_WRITE is not set is faster.
*/
+ was_dirty = PageDirty(page);
set_page_dirty(page);
}
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
- ret = -EFAULT;
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(offset >= max_off))
- goto out_release_unlock;
+ if (!is_continue) {
+ ret = -EFAULT;
+ max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ if (unlikely(offset >= max_off))
+ goto out_release_unlock;
+ }
ret = -EEXIST;
if (!pte_none(*dst_pte))
@@ -2485,13 +2493,16 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
return ret;
out_release_unlock:
pte_unmap_unlock(dst_pte, ptl);
- ClearPageDirty(page);
- delete_from_page_cache(page);
+ if (!was_dirty)
+ ClearPageDirty(page);
+ if (!is_continue)
+ delete_from_page_cache(page);
out_release:
unlock_page(page);
put_page(page);
out_unacct_blocks:
- shmem_inode_unacct_blocks(inode, 1);
+ if (!is_continue)
+ shmem_inode_unacct_blocks(inode, 1);
goto out;
}
#endif /* CONFIG_USERFAULTFD */
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index f6c86b036d0f..d8541a59dae5 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -485,6 +485,7 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
static void continue_range(int ufd, __u64 start, __u64 len)
{
struct uffdio_continue req;
+ int ret;
req.range.start = start;
req.range.len = len;
@@ -493,6 +494,17 @@ static void continue_range(int ufd, __u64 start, __u64 len)
if (ioctl(ufd, UFFDIO_CONTINUE, &req))
err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
(uint64_t)start);
+
+ /*
+ * Error handling within the kernel for continue is subtly different
+ * from copy or zeropage, so it may be a source of bugs. Trigger an
+ * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG.
+ */
+ req.mapped = 0;
+ ret = ioctl(ufd, UFFDIO_CONTINUE, &req);
+ if (ret >= 0 || req.mapped != -EEXIST)
+ err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64,
+ ret, req.mapped);
}
static void *locking_thread(void *arg)
--
2.31.0.291.g576ba9dcdaf-goog