When execute the dirty_log_test on some aarch64 machine, it sometimes
trigger the ASSERT:
==== Test Assertion Failure ====
dirty_log_test.c:384: dirty_ring_vcpu_ring_full
pid=14854 tid=14854 errno=22 - Invalid argument
1 0x00000000004033eb: dirty_ring_collect_dirty_pages at dirty_log_test.c:384
2 0x0000000000402d27: log_mode_collect_dirty_pages at dirty_log_test.c:505
3 (inlined by) run_test at dirty_log_test.c:802
4 0x0000000000403dc7: for_each_guest_mode at guest_modes.c:100
5 0x0000000000401dff: main at dirty_log_test.c:941 (discriminator 3)
6 0x0000ffff9be173c7: ?? ??:0
7 0x0000ffff9be1749f: ?? ??:0
8 0x000000000040206f: _start at ??:?
Didn't continue vcpu even without ring full
The dirty_log_test fails when execute the dirty-ring test, this is
because the sem_vcpu_cont and the sem_vcpu_stop is non-zero value when
execute the dirty_ring_collect_dirty_pages() function. When those two
sem_t variables are non-zero, the dirty_ring_wait_vcpu() at the
beginning of the dirty_ring_collect_dirty_pages() will not wait for the
vcpu to stop, but continue to execute the following code. In this case,
before vcpu stop, if the dirty_ring_vcpu_ring_full is true, and the
dirty_ring_collect_dirty_pages() has passed the check for the
dirty_ring_vcpu_ring_full but hasn't execute the check for the
continued_vcpu, the vcpu stop, and set the dirty_ring_vcpu_ring_full to
false. Then dirty_ring_collect_dirty_pages() will trigger the ASSERT.
Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
at the end of each dirty-ring test. It can cause two cases:
1. sem_vcpu_cont be non-zero. When we set the host_quit to be true,
the vcpu_worker directly see the host_quit to be true, it quit. So
the log_mode_before_vcpu_join() function will set the sem_vcpu_cont
to 1, since the vcpu_worker has quit, it won't consume it.
2. sem_vcpu_stop be non-zero. When we set the host_quit to be true,
the vcpu_worker has entered the guest state, the next time it exit
from guest state, it will set the sem_vcpu_stop to 1, and then see
the host_quit, no one will consume the sem_vcpu_stop.
When execute more and more dirty-ring tests, the sem_vcpu_cont and
sem_vcpu_stop can be larger and larger, which makes many code paths
don't wait for the sem_t. Thus finally cause the problem.
To fix this problem, we can wait a while before set the host_quit to
true, which gives the vcpu time to enter the guest state, so it will
exit again. Then we can wait the vcpu to exit, and let it continue
again, then the vcpu will see the host_quit. Thus the sem_vcpu_cont and
sem_vcpu_stop will be both zero when test finished.
Signed-off-by: Shaoqin Huang <shahuang(a)redhat.com>
---
v1->v2:
- Fix the real logic bug, not just fresh the context.
v1: https://lore.kernel.org/all/20231116093536.22256-1-shahuang@redhat.com/
---
tools/testing/selftests/kvm/dirty_log_test.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..a6e0ff46a07c 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -417,7 +417,8 @@ static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
static void dirty_ring_before_vcpu_join(void)
{
- /* Kick another round of vcpu just to make sure it will quit */
+ /* Wait vcpu exit, and let it continue to see the host_quit. */
+ dirty_ring_wait_vcpu();
sem_post(&sem_vcpu_cont);
}
@@ -719,6 +720,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct kvm_vm *vm;
unsigned long *bmap;
uint32_t ring_buf_idx = 0;
+ int sem_val;
if (!log_mode_supported()) {
print_skip("Log mode '%s' not supported",
@@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
return;
}
+ sem_getvalue(&sem_vcpu_stop, &sem_val);
+ assert(sem_val == 0);
+ sem_getvalue(&sem_vcpu_cont, &sem_val);
+ assert(sem_val == 0);
+
/*
* We reserve page table for 2 times of extra dirty mem which
* will definitely cover the original (1G+) test range. Here
@@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
sync_global_to_guest(vm, iteration);
}
+ /*
+ *
+ * Before we set the host_quit, let the vcpu has time to run, to make
+ * sure we consume the sem_vcpu_stop and the vcpu consume the
+ * sem_vcpu_cont, to keep the semaphore balance.
+ */
+ usleep(p->interval * 1000);
/* Tell the vcpu thread to quit */
host_quit = true;
log_mode_before_vcpu_join();
--
2.40.1
This patch series give a proposal to support guest VM running
in user mode and in canonical linear address organization as
well.
First design to parition the 64-bit canonical linear address space
into two half parts belonging to user-mode and supervisor-mode
respectively, similar as the organization of linear addresses used
in linux OS. Currently the linear addresses use 48-bit canonical
format in which bits 63:47 of the address are identical.
Secondly setup page table mapping the same guest physical address
of test code and data segment onto both user-mode and supervisor-mode
address space. It allows guest in different runtime mode, i.e.
user or supervisor, can run one code base in the corresponding
linear address space.
Also provide the runtime environment setup API for switching to
user mode execution.
Zeng Guang (8):
KVM: selftests: x86: Fix bug in addr_arch_gva2gpa()
KVM: selftests: x86: Support guest running on canonical linear-address
organization
KVM: selftests: Add virt_arch_ucall_prealloc() arch specific
implementation
KVM : selftests : Adapt selftest cases to kernel canonical linear
address
KVM: selftests: x86: Prepare setup for user mode support
KVM: selftests: x86: Allow user to access user-mode address and I/O
address space
KVM: selftests: x86: Support vcpu run in user mode
KVM: selftests: x86: Add KVM forced emulation prefix capability
.../selftests/kvm/include/kvm_util_base.h | 20 ++-
.../selftests/kvm/include/x86_64/processor.h | 48 ++++++-
.../selftests/kvm/lib/aarch64/processor.c | 5 +
tools/testing/selftests/kvm/lib/kvm_util.c | 6 +-
.../selftests/kvm/lib/riscv/processor.c | 5 +
.../selftests/kvm/lib/s390x/processor.c | 5 +
.../testing/selftests/kvm/lib/ucall_common.c | 2 +
.../selftests/kvm/lib/x86_64/processor.c | 117 ++++++++++++++----
.../selftests/kvm/set_memory_region_test.c | 13 +-
.../testing/selftests/kvm/x86_64/debug_regs.c | 2 +-
.../kvm/x86_64/userspace_msr_exit_test.c | 9 +-
11 files changed, 195 insertions(+), 37 deletions(-)
--
2.21.3
The idea of this RFC is to introduce a way to catalogue and document any tests
that should be executed for changes to a subsystem, as well as to make
checkpatch.pl require a tag in commit messages certifying they were, plus
hopefully make it easier to discover and run them.
This is following a discussion Veronika Kabatova started with a few
(addressed) people at the LPC last year (IIRC), where there was a good deal of
interest for something like this.
Apart from implementing basic support (surely to be improved), two sample
changes are added on top, adding a few test suites (roughly) based on what the
maintainers described earlier. I'm definitely not qualified for describing
them adequately, and don't have the time to dig deeper, but hopefully they
could serve as illustrations, and shouldn't be merged as is.
I would defer to maintainers of the corresponding subsystems and tests to
describe their tests and requirements better. Although I would accept
amendments too, if they prefer it that way.
One bug I know that's definitely there is handling removed files. The
scripts/get_maintainer.pl chokes on non-existing files, failing to output the
required test suites (I'm sure there's a good reason, but I couldn't see it).
My first idea is to only check for required tests upon encountering the '+++
<file>' line, and ignore the '/dev/null' file, but I hope the checkpatch.pl
maintainers could recommend a better way.
Anyway, tell me what you think, and I'll work on polishing this.
Thank you!
Nick
---
Nikolai Kondrashov (3):
MAINTAINERS: Introduce V: field for required tests
MAINTAINERS: Require kvm-xfstests smoke for ext4
MAINTAINERS: Require kunit core tests for framework changes
Documentation/process/submitting-patches.rst | 19 +++++
Documentation/process/tests.rst | 80 ++++++++++++++++++
MAINTAINERS | 8 ++
scripts/checkpatch.pl | 118 ++++++++++++++++++++++++++-
scripts/get_maintainer.pl | 17 +++-
scripts/parse-maintainers.pl | 3 +-
6 files changed, 241 insertions(+), 4 deletions(-)
---
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly.
The selftest needs to verify if writing non-contiguous CBMs to the
schemata file behaves as expected in comparison to the information about
non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's currently in
review [1]. The patch also implements a similiar functionality presented
in the bash script included in the cover letter of the original
non-contiguous CBMs in Intel CAT series [2].
Changelog v2:
- Rebase onto v3 of [1] series.
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (3):
selftests/resctrl: Add helpers for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 80 ++++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 4 +-
tools/testing/selftests/resctrl/mba_test.c | 5 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 12 ++-
.../testing/selftests/resctrl/resctrl_tests.c | 18 ++--
tools/testing/selftests/resctrl/resctrlfs.c | 86 ++++++++++++++++---
7 files changed, 185 insertions(+), 26 deletions(-)
--
2.43.0
Make sv48 the default address space for mmap as some applications
currently depend on this assumption. Users can now select a
desired address space using a non-zero hint address to mmap. Previously,
requesting the default address space from mmap by passing zero as the hint
address would result in using the largest address space possible. Some
applications depend on empty bits in the virtual address space, like Go and
Java, so this patch provides more flexibility for application developers.
-Charlie
---
v10:
- Move pgtable.h defintions into a no __ASSEMBLY__ region to resolve compilation
conflicts (pointed out by Conor)
- Will now compile with allmodconfig
v9:
- Raise the mmap_end default to STACK_TOP_MAX to allow the address space to grow
beyond the default of sv48 on sv57 machines as suggested by Alexandre
- Some of the mmap macros had unnecessary conditionals that I have removed
v8:
- Fix RV32 and the RV32 compat mode of RV64 (suggested by Conor)
- Extract out addr and base from the mmap macros (suggested by Alexandre)
v7:
- Changing RLIMIT_STACK inside of an executing program does not trigger
arch_pick_mmap_layout(), so rewrite tests to change RLIMIT_STACK from a
script before executing tests. RLIMIT_STACK of infinity forces bottomup
mmap allocation.
- Make arch_get_mmap_base macro more readible by extracting out the rnd
calculation.
- Use MMAP_MIN_VA_BITS in TASK_UNMAPPED_BASE to support case when mmap
attempts to allocate address smaller than DEFAULT_MAP_WINDOW.
- Fix incorrect wording in documentation.
v6:
- Rebase onto the correct base
v5:
- Minor wording change in documentation
- Change some parenthesis in arch_get_mmap_ macros
- Added case for addr==0 in arch_get_mmap_ because without this, programs would
crash if RLIMIT_STACK was modified before executing the program. This was
tested using the libhugetlbfs tests.
v4:
- Split testcases/document patch into test cases, in-code documentation, and
formal documentation patches
- Modified the mmap_base macro to be more legible and better represent memory
layout
- Fixed documentation to better reflect the implmentation
- Renamed DEFAULT_VA_BITS to MMAP_VA_BITS
- Added additional test case for rlimit changes
---
Charlie Jenkins (4):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Add tests for RISC-V mm
RISC-V: mm: Update pgtable comment documentation
RISC-V: mm: Document mmap changes
Documentation/riscv/vm-layout.rst | 22 +++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 33 ++++++++--
arch/riscv/include/asm/processor.h | 52 +++++++++++++--
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/.gitignore | 2 +
tools/testing/selftests/riscv/mm/Makefile | 15 +++++
.../riscv/mm/testcases/mmap_bottomup.c | 35 ++++++++++
.../riscv/mm/testcases/mmap_default.c | 35 ++++++++++
.../selftests/riscv/mm/testcases/mmap_test.h | 64 +++++++++++++++++++
.../selftests/riscv/mm/testcases/run_mmap.sh | 12 ++++
11 files changed, 261 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/.gitignore
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_default.c
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_test.h
create mode 100755 tools/testing/selftests/riscv/mm/testcases/run_mmap.sh
--
2.34.1
From: Jeff Xu <jeffxu(a)chromium.org>
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with
following signature:
mseal(void addr, size_t len, unsigned long types, unsigned long flags)
addr/len: memory range. Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to documentation patch (mseal.rst) of this
patch set. Those are also fully covered by the selftest.
types: bit mask to specify the sealing types.
MM_SEAL_BASE
MM_SEAL_PROT_PKEY
MM_SEAL_DISCARD_RO_ANON
MM_SEAL_SEAL
The MM_SEAL_BASE:
The base package includes the features common to all VMA sealing
types. It prevents sealed VMAs from:
1> Unmapping, moving to another location, and shrinking the size, via
munmap() and mremap(), can leave an empty space, therefore can be
replaced with a VMA with a new set of attributes.
2> Move or expand a different vma into the current location, via mremap().
3> Modifying sealed VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed
VMA.
We consider the MM_SEAL_BASE feature, on which other sealing features
will depend. For instance, it probably does not make sense to seal
PROT_PKEY without sealing the BASE, and the kernel will implicitly add
SEAL_BASE for SEAL_PROT_PKEY.
The MM_SEAL_PROT_PKEY:
Seal PROT and PKEY of the address range, i.e. mprotect() and
pkey_mprotect() will be denied if the memory is sealed with
MM_SEAL_PROT_PKEY.
The MM_SEAL_DISCARD_RO_ANON
Certain types of madvise() operations are destructive [6], such as
MADV_DONTNEED, which can effectively alter region contents by
discarding pages, especially when memory is anonymous. This blocks
such operations for anonymous memory which is not writable to the
user.
The MM_SEAL_SEAL
MM_SEAL_SEAL denies adding a new seal for an VMA.
This is similar to F_SEAL_SEAL in fcntl.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
--------------------------------------------------------------------
Change history:
===============
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap() (Detail see new discussions)
- Add documentation - mseal.rst
Work in progress:
=================
- update man page for mseal() and mmap()
Open discussions:
=================
Several open discussions from V1/V2 were not incorporated into V3. I
believe these are worth more discussion for future versions of this
patch series.
1> mseal() vs mimmutable()
mseal(bitmasks for multiple seal types)
BASE + PROT_PKEY+ MM_SEAL_DISCARD_RO_ANON
Apply PROT_PKEY implies BASE, same for DISCARD_RO_ANON
mimmutable() (openBSD)
This is equal to SEAL_BASE + SEAL_PROT_PKEY in mseal()
Plus allowing downgrade from W=>NW (OpenBSD)
Doesn’t have MM_SEAL_DISCARD_RO_ANON
mimmutable() is designed for memory sealing in libc, and mseal()
is designed for both Chrome browser and libc.
For the two memory ranges that Chrome browser wants to seal, as
discussed previously, the allocator still needs to free or discard
memory for the sealed memory. For performance reasons, we have
explored two solutions in the past: first, using PKEY-based munmap()
[7], and second, separating SEAL_MPROTECT (v1 of this patch set).
Recently, we have experimented with an alternative approach that
allows us to remove the separation of SEAL_MPROTECT. For those two
memory ranges, Chrome browser will use BASE + PROT_PKEY +
DISCARD_RO_ANON for all its sealing needs.
In the case of libc, the .text segment can be sealed with the BASE and
PROT_PKEY, and the RO data segments can be sealed with the BASE +
PROT_PKEY + DISCARD_RO_ANON.
From a flexibility standpoint, separating BASE out could be beneficial
for future extensions of sealing features. For instance, applications
might desire downgradable "prot" permissions (X=>NX, W=>NW, R=>NR),
which would conflict with SEAL_PROT_PKEY.
The more sealing features integrated into a single sealing type, the
fewer applications can utilize these features. For example, some
applications might programmatically require DISCARD_RO_ANON memory,
which would render such VMA unsuitable for sealing.
I'd like to get the community's input on this. From Chrome's
perspective, the separation isn't as important anymore, at least in
the short term. However, I prefer the multiple bits approach because
it's more extensible in the long term.
2> mseal() vs mprotect() vs madvise() for setting the seal.
mprotect():
Using prot field, but prot supports unset. It's workable, i.e. let
applications carry the sealing type and set in all subsequent calls to
mprotect(), but it feels like this is an extra thing to care about.
madvise():
uses enum, multiple sealing types might require multiple roundtrips.
IMO: sealing is a major departure from other memory syscalls because
it takes away capabilities. The other memory APIs add behaviors or
change attributes, but sealing does the opposite: it reduces
capabilities. The name of the syscall, mseal(), can help emphasize the
"taking away" part.
My second choice would be madvise().
3> Other:
There is also a topic about ptrace/, /proc/self/mem, Userfaultfd,
which I think can be followed up using v1 thread, where it has the
most context.
New discussions topics:
=======================
During the development of V3, I had new questions and thoughts and
wished to discuss.
1> shm/aio
From reading the code, it seems to me that aio/shm can mmap/munmap
maps on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime
of those mapping are not tied to the lifetime of the process. If those
memories are sealed from userspace, then unmap will fail. This isn’t a
huge problem, since the memory will eventually be freed at exit or
exec. However, it feels like the solution is not complete, because of
the leaks in VMA address space during the lifetime of the process.
There could be two solutions to address this, which I will discuss
later.
2> Brk (heap/stack)
Currently, userspace applications can seal parts of the heap by
calling malloc() and mseal(). This raises the question of what the
expected behavior is when sealing the heap is attempted.
let's assume following calls from user space:
ptr = malloc(size);
mprotect(ptr, size, RO);
mseal(ptr, size, SEAL_PROT_PKEY);
free(ptr);
Technically, before mseal() is added, the user can change the
protection of the heap by calling mprotect(RO). As long as the user
changes the protection back to RW before free(), the memory can be
reused.
Adding mseal() into picture, however, the heap is then sealed
partially, user can still free it, but the memory remains to be RO,
and the result of brk-shrink is nondeterministic, depending on if
munmap() will try to free the sealed memory.(brk uses munmap to shrink
the heap).
3> Above two cases led to the third topic:
There are two options to address the problem mentioned above.
Option 1: A “MAP_SEALABLE” flag in mmap().
If a map is created without this flag, the mseal() operation will
fail. Applications that are not concerned with sealing will expect
their behavior to be unchanged. For those that are concerned, adding a
flag at mmap time to opt in is not difficult. For the short term, this
solves problems 1 and 2 above. The memory in shm/aio/brk will not have
the MAP_SEALABLE flag at mmap(), and the same is true for the heap.
Option 2: Add MM_SEAL_SEAL during mmap()
It is possible to defensively set MM_SEAL_SEAL for the selected mappings at
creation time. Specifically, we can find all the mmaps that we do not want to
seal, and add the MM_SEAL_SEAL flag in the mmap() call. The difference
between MAP_SEALABLE and MM_SEAL_SEAL is that the first option starts from a
small size and incrementally increases, while the second option is the
opposite.
In my opinion, MAP_SEALABLE is the preferred option. Only a limited set of
mappings need to be sealed, and these are typically created by the runtime. For
the few dedicated applications that manage their own mappings, such as Chrome,
adding an extra flag at mmap() is not a difficult task. It is also a safer
option in terms of regression risk. This is the option included in this
version.
4>
I think it might be possible to seal the stack or other special
mappings created at runtime (vdso, vsyscall, vvar). This means we can
enforce and seal W^X for certain types of application. For instance,
the stack is typically used in read-write mode, but in some cases, it
can become executable. To defend against unintented addition of executable
bit to stack, we could let the application to seal it.
Sealing the heap (for adding X) requires special handling, since the
heap can shrink, and shrink is implemented through munmap().
Indeed, it might be possible that all virtual memory accessible to user
space, regardless of its usage pattern, could be sealed. However, this
would require additional research and development work.
------------------------------------------------------------------------
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (11):
mseal: Add mseal syscall.
mseal: Wire up mseal syscall
mseal: add can_modify_mm and can_modify_vma
mseal: add MM_SEAL_BASE
mseal: add MM_SEAL_PROT_PKEY
mseal: add sealing support for mmap
mseal: make sealed VMA mergeable.
mseal: add MM_SEAL_DISCARD_RO_ANON
mseal: add MAP_SEALABLE to mmap()
selftest mm/mseal memory sealing
mseal:add documentation
Documentation/userspace-api/mseal.rst | 189 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/mips/kernel/vdso.c | 10 +-
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/userfaultfd.c | 8 +-
include/linux/mm.h | 178 +-
include/linux/mm_types.h | 8 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/mman-common.h | 16 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 5 +
kernel/sys_ni.c | 1 +
mm/Kconfig | 9 +
mm/Makefile | 1 +
mm/madvise.c | 14 +-
mm/mempolicy.c | 2 +-
mm/mlock.c | 2 +-
mm/mmap.c | 77 +-
mm/mprotect.c | 12 +-
mm/mremap.c | 44 +-
mm/mseal.c | 376 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/mseal_test.c | 2141 +++++++++++++++++++
41 files changed, 3091 insertions(+), 32 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.43.0.472.g3155946c3a-goog
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for use of GCS by userspace, along with
support for use of GCS within KVM guests. It does not enable use of GCS
by either EL1 or EL2, this will be implemented separately. Executables
are started without GCS and must use a prctl() to enable it, it is
expected that this will be done very early in application execution by
the dynamic linker or other startup code. For dynamic linking this will
be done by checking that everything in the executable is marked as GCS
compatible.
x86 has an equivalent feature called shadow stacks, this series depends
on the x86 patches for generic memory management support for the new
guarded/shadow stack page type and shares APIs as much as possible. As
there has been extensive discussion with the wider community around the
ABI for shadow stacks I have as far as practical kept implementation
decisions close to those for x86, anticipating that review would lead to
similar conclusions in the absence of strong reasoning for divergence.
The main divergence I am concious of is that x86 allows shadow stack to
be enabled and disabled repeatedly, freeing the shadow stack for the
thread whenever disabled, while this implementation keeps the GCS
allocated after disable but refuses to reenable it. This is to avoid
races with things actively walking the GCS during a disable, we do
anticipate that some systems will wish to disable GCS at runtime but are
not aware of any demand for subsequently reenabling it.
x86 uses an arch_prctl() to manage enable and disable, since only x86
and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
patch set for the equivalent RISC-V Zicfiss feature which I initially
adopted fairly directly but following review feedback has been revised
quite a bit.
We currently maintain the x86 pattern of implicitly allocating a shadow
stack for threads started with shadow stack enabled, there has been some
discussion of removing this support and requiring the use of clone3()
with explicit allocation of shadow stacks instead. I have no strong
feelings either way, implicit allocation is not really consistent with
anything else we do and creates the potential for errors around thread
exit but on the other hand it is existing ABI on x86 and minimises the
changes needed in userspace code.
There is an open issue with support for CRIU, on x86 this required the
ability to set the GCS mode via ptrace. This series supports
configuring mode bits other than enable/disable via ptrace but it needs
to be confirmed if this is sufficient.
The series depends on support for shadow stacks in clone3(), that series
includes the addition of ARCH_HAS_USER_SHADOW_STACK.
https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
It also depends on the addition of more waitpid() flags to nolibc:
https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@k…
You can see a branch with the full set of dependencies against Linus'
tree at:
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs
[1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v7:
- Rebase onto v6.7-rc2 via the clone3() patch series.
- Change the token used to cap the stack during signal handling to be
compatible with GCSPOPM.
- Fix flags for new page types.
- Fold in support for clone3().
- Replace copy_to_user_gcs() with put_user_gcs().
- Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org
Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (39):
arm64/mm: Restructure arch_validate_flags() for extensibility
prctl: arch-agnostic prctl for shadow stack
mman: Add map_shadow_stack() flags
arm64: Document boot requirements for Guarded Control Stacks
arm64/gcs: Document the ABI for Guarded Control Stacks
arm64/sysreg: Add new system registers for GCS
arm64/sysreg: Add definitions for architected GCS caps
arm64/gcs: Add manual encodings of GCS instructions
arm64/gcs: Provide put_user_gcs()
arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
arm64/mm: Allocate PIE slots for EL0 guarded control stack
mm: Define VM_SHADOW_STACK for arm64 when we support GCS
arm64/mm: Map pages for guarded control stack
KVM: arm64: Manage GCS registers for guests
arm64/gcs: Allow GCS usage at EL0 and EL1
arm64/idreg: Add overrride for GCS
arm64/hwcap: Add hwcap for GCS
arm64/traps: Handle GCS exceptions
arm64/mm: Handle GCS data aborts
arm64/gcs: Context switch GCS state for EL0
arm64/gcs: Allocate a new GCS for threads with GCS enabled
arm64/gcs: Implement shadow stack prctl() interface
arm64/mm: Implement map_shadow_stack()
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/signal: Expose GCS state in signal frames
arm64/ptrace: Expose GCS via ptrace and core files
arm64: Add Kconfig for Guarded Control Stack (GCS)
kselftest/arm64: Verify the GCS hwcap
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add test coverage for GCS mode locking
selftests/arm64: Add GCS signal tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Enable GCS for the FP stress tests
kselftest/clone3: Enable GCS in the clone3 selftests
Documentation/admin-guide/kernel-parameters.txt | 6 +
Documentation/arch/arm64/booting.rst | 22 +
Documentation/arch/arm64/elf_hwcaps.rst | 3 +
Documentation/arch/arm64/gcs.rst | 233 +++++++
Documentation/arch/arm64/index.rst | 1 +
Documentation/filesystems/proc.rst | 2 +-
arch/arm64/Kconfig | 20 +
arch/arm64/include/asm/cpufeature.h | 6 +
arch/arm64/include/asm/el2_setup.h | 17 +
arch/arm64/include/asm/esr.h | 28 +-
arch/arm64/include/asm/exception.h | 2 +
arch/arm64/include/asm/gcs.h | 107 +++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/asm/kvm_arm.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 12 +
arch/arm64/include/asm/mman.h | 23 +-
arch/arm64/include/asm/pgtable-prot.h | 14 +-
arch/arm64/include/asm/processor.h | 7 +
arch/arm64/include/asm/sysreg.h | 20 +
arch/arm64/include/asm/uaccess.h | 40 ++
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/ptrace.h | 8 +
arch/arm64/include/uapi/asm/sigcontext.h | 9 +
arch/arm64/kernel/cpufeature.c | 19 +
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/entry-common.c | 23 +
arch/arm64/kernel/idreg-override.c | 2 +
arch/arm64/kernel/process.c | 81 +++
arch/arm64/kernel/ptrace.c | 59 ++
arch/arm64/kernel/signal.c | 236 ++++++-
arch/arm64/kernel/traps.c | 11 +
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 +
arch/arm64/kvm/sys_regs.c | 22 +
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/fault.c | 79 ++-
arch/arm64/mm/gcs.c | 259 +++++++
arch/arm64/mm/mmap.c | 13 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 55 ++
arch/x86/include/uapi/asm/mman.h | 3 -
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 16 +-
include/uapi/asm-generic/mman.h | 4 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 22 +
kernel/sys.c | 30 +
tools/testing/selftests/arm64/Makefile | 2 +-
tools/testing/selftests/arm64/abi/hwcap.c | 19 +
tools/testing/selftests/arm64/fp/assembler.h | 15 +
tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 +
tools/testing/selftests/arm64/fp/sve-test.S | 2 +
tools/testing/selftests/arm64/fp/za-test.S | 2 +
tools/testing/selftests/arm64/fp/zt-test.S | 2 +
tools/testing/selftests/arm64/gcs/.gitignore | 5 +
tools/testing/selftests/arm64/gcs/Makefile | 24 +
tools/testing/selftests/arm64/gcs/asm-offsets.h | 0
tools/testing/selftests/arm64/gcs/basic-gcs.c | 428 ++++++++++++
tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++
.../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++
tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++
tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++
tools/testing/selftests/arm64/gcs/libc-gcs.c | 742 +++++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../testing/selftests/arm64/signal/test_signals.c | 17 +-
.../testing/selftests/arm64/signal/test_signals.h | 6 +
.../selftests/arm64/signal/test_signals_utils.c | 32 +-
.../selftests/arm64/signal/test_signals_utils.h | 39 ++
.../arm64/signal/testcases/gcs_exception_fault.c | 59 ++
.../selftests/arm64/signal/testcases/gcs_frame.c | 78 +++
.../arm64/signal/testcases/gcs_write_fault.c | 67 ++
.../selftests/arm64/signal/testcases/testcases.c | 7 +
.../selftests/arm64/signal/testcases/testcases.h | 1 +
tools/testing/selftests/clone3/clone3.c | 37 +
73 files changed, 4234 insertions(+), 40 deletions(-)
---
base-commit: 3d0134d322380292c055454d9633738733992d61
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This extends the KVM RISC-V ONE_REG interface to report more ISA extensions
namely: Zbz, scalar crypto, vector crypto, Zfh[min], Zihintntl, Zvfh[min],
and Zfa.
This series depends upon the "riscv: report more ISA extensions through
hwprobe" series.from Clement.
(Link: https://lore.kernel.org/lkml/20231114141256.126749-1-cleger@rivosinc.com/)
To test these patches, use KVMTOOL from the riscv_more_exts_v1 branch at:
https://github.com/avpatel/kvmtool.git
These patches can also be found in the riscv_kvm_more_exts_v1 branch at:
https://github.com/avpatel/linux.git
Anup Patel (15):
KVM: riscv: selftests: Generate ISA extension reg_list using macros
RISC-V: KVM: Allow Zbc extension for Guest/VM
KVM: riscv: selftests: Add Zbc extension to get-reg-list test
RISC-V: KVM: Allow scalar crypto extensions for Guest/VM
KVM: riscv: selftests: Add scaler crypto extensions to get-reg-list
test
RISC-V: KVM: Allow vector crypto extensions for Guest/VM
KVM: riscv: selftests: Add vector crypto extensions to get-reg-list
test
RISC-V: KVM: Allow Zfh[min] extensions for Guest/VM
KVM: riscv: selftests: Add Zfh[min] extensions to get-reg-list test
RISC-V: KVM: Allow Zihintntl extension for Guest/VM
KVM: riscv: selftests: Add Zihintntl extension to get-reg-list test
RISC-V: KVM: Allow Zvfh[min] extensions for Guest/VM
KVM: riscv: selftests: Add Zvfh[min] extensions to get-reg-list test
RISC-V: KVM: Allow Zfa extension for Guest/VM
KVM: riscv: selftests: Add Zfa extension to get-reg-list test
arch/riscv/include/uapi/asm/kvm.h | 27 ++
arch/riscv/kvm/vcpu_onereg.c | 54 +++
.../selftests/kvm/riscv/get-reg-list.c | 439 ++++++++----------
3 files changed, 265 insertions(+), 255 deletions(-)
--
2.34.1