December 2023 - Linux-kselftest-mirror

[PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach

by Bernd Edlinger

This introduces signal->exec_bprm, which is used to fix the case when at least one of the sibling threads is traced, and therefore the trace process may dead-lock in ptrace_attach, but de_thread will need to wait for the tracer to continue execution. The solution is to detect this situation and allow ptrace_attach to continue by temporarily releasing the cred_guard_mutex, while de_thread() is still waiting for traced zombies to be eventually released by the tracer. In the case of the thread group leader we only have to wait for the thread to become a zombie, which may also need co-operation from the tracer due to PTRACE_O_TRACEEXIT. When a tracer wants to ptrace_attach a task that already is in execve, we simply retry the ptrace_may_access check while temporarily installing the new credentials and dumpability which are about to be used after execve completes. If the ptrace_attach happens on a thread that is a sibling-thread of the thread doing execve, it is sufficient to check against the old credentials, as this thread will be waited for, before the new credentials are installed. Other threads die quickly since the cred_guard_mutex is released, but a deadly signal is already pending. In case the mutex_lock_killable misses the signal, the non-zero current->signal->exec_bprm makes sure they release the mutex immediately and return with -ERESTARTNOINTR. This means there is no API change, unlike the previous version of this patch which was discussed here: https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.d… See tools/testing/selftests/ptrace/vmaccess.c for a test case that gets fixed by this change. Note that since the test case was originally designed to test the ptrace_attach returning an error in this situation, the test expectation needed to be adjusted, to allow the API to succeed at the first attempt. Signed-off-by: Bernd Edlinger <bernd.edlinger(a)hotmail.de> --- fs/exec.c | 69 ++++++++++++++++------- fs/proc/base.c | 6 ++ include/linux/cred.h | 1 + include/linux/sched/signal.h | 18 ++++++ kernel/cred.c | 28 +++++++-- kernel/ptrace.c | 32 +++++++++++ kernel/seccomp.c | 12 +++- tools/testing/selftests/ptrace/vmaccess.c | 23 +++++--- 8 files changed, 155 insertions(+), 34 deletions(-) v10: Changes to previous version, make the PTRACE_ATTACH retun -EAGAIN, instead of execve return -ERESTARTSYS. Added some lessions learned to the description. v11: Check old and new credentials in PTRACE_ATTACH again without changing the API. Note: I got actually one response from an automatic checker to the v11 patch, https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/ which is complaining about: >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@ 417 struct linux_binprm *bprm = task->signal->exec_bprm; 418 const struct cred *old_cred; 419 struct mm_struct *old_mm; 420 421 retval = down_write_killable(&task->signal->exec_update_lock); 422 if (retval) 423 goto unlock_creds; 424 task_lock(task); > 425 old_cred = task->real_cred; v12: Essentially identical to v11. - Fixed a minor merge conflict in linux v5.17, and fixed the above mentioned nit by adding __rcu to the declaration. - re-tested the patch with all linux versions from v5.11 to v6.6 v10 was an alternative approach which did imply an API change. But I would prefer to avoid such an API change. The difficult part is getting the right dumpability flags assigned before de_thread starts, hope you like this version. If not, the v10 is of course also acceptable. Thanks Bernd. diff --git a/fs/exec.c b/fs/exec.c index 2f2b0acec4f0..902d3b230485 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm) return 0; } -static int de_thread(struct task_struct *tsk) +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + struct task_struct *t = tsk; + bool unsafe_execve_in_progress = false; if (thread_group_empty(tsk)) goto no_thread_group; @@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) sig->notify_count--; + while_each_thread(tsk, t) { + if (unlikely(t->ptrace) + && (t != tsk->group_leader || !t->exit_state)) + unsafe_execve_in_progress = true; + } + + if (unlikely(unsafe_execve_in_progress)) { + spin_unlock_irq(lock); + sig->exec_bprm = bprm; + mutex_unlock(&sig->cred_guard_mutex); + spin_lock_irq(lock); + } + while (sig->notify_count) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); @@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + sig->group_exec_task = NULL; sig->notify_count = 0; @@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk) return 0; killed: + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + /* protects against exit_notify() and __exit_signal() */ read_lock(&tasklist_lock); sig->group_exec_task = NULL; @@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) return retval; + /* If the binary is not readable then enforce mm->dumpable=0 */ + would_dump(bprm, bprm->file); + if (bprm->have_execfd) + would_dump(bprm, bprm->executable); + + /* + * Figure out dumpability. Note that this checking only of current + * is wrong, but userspace depends on it. This should be testing + * bprm->secureexec instead. + */ + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || + is_dumpability_changed(current_cred(), bprm->cred) || + !(uid_eq(current_euid(), current_uid()) && + gid_eq(current_egid(), current_gid()))) + set_dumpable(bprm->mm, suid_dumpable); + else + set_dumpable(bprm->mm, SUID_DUMP_USER); + /* * Ensure all future errors are fatal. */ @@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm) /* * Make this the only thread in the thread group. */ - retval = de_thread(me); + retval = de_thread(me, bprm); if (retval) goto out; @@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; - /* If the binary is not readable then enforce mm->dumpable=0 */ - would_dump(bprm, bprm->file); - if (bprm->have_execfd) - would_dump(bprm, bprm->executable); - /* * Release all of the old mmap stuff */ @@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm) me->sas_ss_sp = me->sas_ss_size = 0; - /* - * Figure out dumpability. Note that this checking only of current - * is wrong, but userspace depends on it. This should be testing - * bprm->secureexec instead. - */ - if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || - !(uid_eq(current_euid(), current_uid()) && - gid_eq(current_egid(), current_gid()))) - set_dumpable(current->mm, suid_dumpable); - else - set_dumpable(current->mm, SUID_DUMP_USER); - perf_event_exec(); __set_task_comm(me, kbasename(bprm->filename), true); @@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm) if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) return -ERESTARTNOINTR; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + bprm->cred = prepare_exec_creds(); if (likely(bprm->cred)) return 0; diff --git a/fs/proc/base.c b/fs/proc/base.c index ffd54617c354..0da9adfadb48 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf, if (rv < 0) goto out_free; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + rv = -ERESTARTNOINTR; + goto out_free; + } + rv = security_setprocattr(PROC_I(inode)->op.lsm, file->f_path.dentry->d_name.name, page, count); diff --git a/include/linux/cred.h b/include/linux/cred.h index f923528d5cc4..b01e309f5686 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *); extern struct cred *cred_alloc_blank(void); extern struct cred *prepare_creds(void); extern struct cred *prepare_exec_creds(void); +extern bool is_dumpability_changed(const struct cred *, const struct cred *); extern int commit_creds(struct cred *); extern void abort_creds(struct cred *); extern const struct cred *override_creds(const struct cred *); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 0014d3adaf84..14df7073a0a8 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -234,9 +234,27 @@ struct signal_struct { struct mm_struct *oom_mm; /* recorded mm when the thread group got * killed by the oom killer */ + struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access + * against new credentials while + * de_thread is waiting for other + * traced threads to terminate. + * Set while de_thread is executing. + * The cred_guard_mutex is released + * after de_thread() has called + * zap_other_threads(), therefore + * a fatal signal is guaranteed to be + * already pending in the unlikely + * event, that + * current->signal->exec_bprm happens + * to be non-zero after the + * cred_guard_mutex was acquired. + */ + struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) + * Held while execve runs, except when + * a sibling thread is being traced. * Deprecated do not use in new code. * Use exec_update_lock instead. */ diff --git a/kernel/cred.c b/kernel/cred.c index 98cb4eca23fb..586cb6c7cf6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset) return false; } +/** + * is_dumpability_changed - Will changing creds from old to new + * affect the dumpability in commit_creds? + * + * Return: false - dumpability will not be changed in commit_creds. + * Return: true - dumpability will be changed to non-dumpable. + * + * @old: The old credentials + * @new: The new credentials + */ +bool is_dumpability_changed(const struct cred *old, const struct cred *new) +{ + if (!uid_eq(old->euid, new->euid) || + !gid_eq(old->egid, new->egid) || + !uid_eq(old->fsuid, new->fsuid) || + !gid_eq(old->fsgid, new->fsgid) || + !cred_cap_issubset(old, new)) + return true; + + return false; +} + /** * commit_creds - Install new credentials upon the current task * @new: The credentials to be assigned @@ -467,11 +489,7 @@ int commit_creds(struct cred *new) get_cred(new); /* we will require a ref for the subj creds too */ /* dumpability changes */ - if (!uid_eq(old->euid, new->euid) || - !gid_eq(old->egid, new->egid) || - !uid_eq(old->fsuid, new->fsuid) || - !gid_eq(old->fsgid, new->fsgid) || - !cred_cap_issubset(old, new)) { + if (is_dumpability_changed(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); task->pdeath_signal = 0; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 443057bee87c..eb1c450bb7d7 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -20,6 +20,7 @@ #include <linux/pagemap.h> #include <linux/ptrace.h> #include <linux/security.h> +#include <linux/binfmts.h> #include <linux/signal.h> #include <linux/uio.h> #include <linux/audit.h> @@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request, if (retval) goto unlock_creds; + if (unlikely(task->in_execve)) { + struct linux_binprm *bprm = task->signal->exec_bprm; + const struct cred __rcu *old_cred; + struct mm_struct *old_mm; + + retval = down_write_killable(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + task_lock(task); + old_cred = task->real_cred; + old_mm = task->mm; + rcu_assign_pointer(task->real_cred, bprm->cred); + task->mm = bprm->mm; + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); + rcu_assign_pointer(task->real_cred, old_cred); + task->mm = old_mm; + task_unlock(task); + up_write(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + } + write_lock_irq(&tasklist_lock); retval = -EPERM; if (unlikely(task->exit_state)) @@ -508,6 +531,14 @@ static int ptrace_traceme(void) { int ret = -EPERM; + if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) + return -ERESTARTNOINTR; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + write_lock_irq(&tasklist_lock); /* Are we already being traced? */ if (!current->ptrace) { @@ -523,6 +554,7 @@ static int ptrace_traceme(void) } } write_unlock_irq(&tasklist_lock); + mutex_unlock(&current->signal->cred_guard_mutex); return ret; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 255999ba9190..b29bbfa0b044 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags, * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. */ - if (flags & SECCOMP_FILTER_FLAG_TSYNC && - mutex_lock_killable(&current->signal->cred_guard_mutex)) - goto out_put_fd; + if (flags & SECCOMP_FILTER_FLAG_TSYNC) { + if (mutex_lock_killable(&current->signal->cred_guard_mutex)) + goto out_put_fd; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + goto out_put_fd; + } + } spin_lock_irq(&current->sighand->siglock); diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c index 4db327b44586..3b7d81fb99bb 100644 --- a/tools/testing/selftests/ptrace/vmaccess.c +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -39,8 +39,15 @@ TEST(vmaccess) f = open(mm, O_RDONLY); ASSERT_GE(f, 0); close(f); - f = kill(pid, SIGCONT); - ASSERT_EQ(f, 0); + f = waitpid(-1, NULL, 0); + ASSERT_NE(f, -1); + ASSERT_NE(f, 0); + ASSERT_NE(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, -1); + ASSERT_EQ(errno, ECHILD); } TEST(attach) @@ -57,22 +64,24 @@ TEST(attach) sleep(1); k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); - ASSERT_EQ(errno, EAGAIN); - ASSERT_EQ(k, -1); + ASSERT_EQ(k, 0); k = waitpid(-1, &s, WNOHANG); ASSERT_NE(k, -1); ASSERT_NE(k, 0); ASSERT_NE(k, pid); ASSERT_EQ(WIFEXITED(s), 1); ASSERT_EQ(WEXITSTATUS(s), 0); - sleep(1); - k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFSTOPPED(s), 1); + ASSERT_EQ(WSTOPSIG(s), SIGTRAP); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGSTOP); - k = ptrace(PTRACE_DETACH, pid, 0L, 0L); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); -- 2.39.2

1 year, 5 months

5
14
0 0

[PATCH] selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros

by Muhammad Usama Anjum

Correct header file is needed for getting CLOSE_RANGE_* macros. Previously it was tested with newer glibc which didn't show the need to include the header which was a mistake. Fixes: ec54424923cf ("selftests: core: remove duplicate defines") Reported-by: Aishwarya TCV <aishwarya.tcv(a)arm.com> Link: https://lore.kernel.org/all/7161219e-0223-d699-d6f3-81abd9abf13b@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com> --- tools/testing/selftests/core/close_range_test.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c index 534576f06df1c..c59e4adb905df 100644 --- a/tools/testing/selftests/core/close_range_test.c +++ b/tools/testing/selftests/core/close_range_test.c @@ -12,6 +12,7 @@ #include <syscall.h> #include <unistd.h> #include <sys/resource.h> +#include <linux/close_range.h> #include "../kselftest_harness.h" #include "../clone3/clone3_selftests.h" -- 2.42.0

1 year, 5 months

1
4
0 0

[PATCH v10 0/4] RISC-V: mm: Make SV48 the default address space

by Charlie Jenkins

Make sv48 the default address space for mmap as some applications currently depend on this assumption. Users can now select a desired address space using a non-zero hint address to mmap. Previously, requesting the default address space from mmap by passing zero as the hint address would result in using the largest address space possible. Some applications depend on empty bits in the virtual address space, like Go and Java, so this patch provides more flexibility for application developers. -Charlie --- v10: - Move pgtable.h defintions into a no __ASSEMBLY__ region to resolve compilation conflicts (pointed out by Conor) - Will now compile with allmodconfig v9: - Raise the mmap_end default to STACK_TOP_MAX to allow the address space to grow beyond the default of sv48 on sv57 machines as suggested by Alexandre - Some of the mmap macros had unnecessary conditionals that I have removed v8: - Fix RV32 and the RV32 compat mode of RV64 (suggested by Conor) - Extract out addr and base from the mmap macros (suggested by Alexandre) v7: - Changing RLIMIT_STACK inside of an executing program does not trigger arch_pick_mmap_layout(), so rewrite tests to change RLIMIT_STACK from a script before executing tests. RLIMIT_STACK of infinity forces bottomup mmap allocation. - Make arch_get_mmap_base macro more readible by extracting out the rnd calculation. - Use MMAP_MIN_VA_BITS in TASK_UNMAPPED_BASE to support case when mmap attempts to allocate address smaller than DEFAULT_MAP_WINDOW. - Fix incorrect wording in documentation. v6: - Rebase onto the correct base v5: - Minor wording change in documentation - Change some parenthesis in arch_get_mmap_ macros - Added case for addr==0 in arch_get_mmap_ because without this, programs would crash if RLIMIT_STACK was modified before executing the program. This was tested using the libhugetlbfs tests. v4: - Split testcases/document patch into test cases, in-code documentation, and formal documentation patches - Modified the mmap_base macro to be more legible and better represent memory layout - Fixed documentation to better reflect the implmentation - Renamed DEFAULT_VA_BITS to MMAP_VA_BITS - Added additional test case for rlimit changes --- Charlie Jenkins (4): RISC-V: mm: Restrict address space for sv39,sv48,sv57 RISC-V: mm: Add tests for RISC-V mm RISC-V: mm: Update pgtable comment documentation RISC-V: mm: Document mmap changes Documentation/riscv/vm-layout.rst | 22 +++++++ arch/riscv/include/asm/elf.h | 2 +- arch/riscv/include/asm/pgtable.h | 33 ++++++++-- arch/riscv/include/asm/processor.h | 52 +++++++++++++-- tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/mm/.gitignore | 2 + tools/testing/selftests/riscv/mm/Makefile | 15 +++++ .../riscv/mm/testcases/mmap_bottomup.c | 35 ++++++++++ .../riscv/mm/testcases/mmap_default.c | 35 ++++++++++ .../selftests/riscv/mm/testcases/mmap_test.h | 64 +++++++++++++++++++ .../selftests/riscv/mm/testcases/run_mmap.sh | 12 ++++ 11 files changed, 261 insertions(+), 13 deletions(-) create mode 100644 tools/testing/selftests/riscv/mm/.gitignore create mode 100644 tools/testing/selftests/riscv/mm/Makefile create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_default.c create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_test.h create mode 100755 tools/testing/selftests/riscv/mm/testcases/run_mmap.sh -- 2.34.1

1 year, 5 months

3
11
0 0

[RFC PATCH v3 00/11] Introduce mseal()

by jeffxu＠chromium.org

From: Jeff Xu <jeffxu(a)chromium.org> This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. The new mseal() is an architecture independent syscall, and with following signature: mseal(void addr, size_t len, unsigned long types, unsigned long flags) addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to documentation patch (mseal.rst) of this patch set. Those are also fully covered by the selftest. types: bit mask to specify the sealing types. MM_SEAL_BASE MM_SEAL_PROT_PKEY MM_SEAL_DISCARD_RO_ANON MM_SEAL_SEAL The MM_SEAL_BASE: The base package includes the features common to all VMA sealing types. It prevents sealed VMAs from: 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Move or expand a different vma into the current location, via mremap(). 3> Modifying sealed VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. We consider the MM_SEAL_BASE feature, on which other sealing features will depend. For instance, it probably does not make sense to seal PROT_PKEY without sealing the BASE, and the kernel will implicitly add SEAL_BASE for SEAL_PROT_PKEY. The MM_SEAL_PROT_PKEY: Seal PROT and PKEY of the address range, i.e. mprotect() and pkey_mprotect() will be denied if the memory is sealed with MM_SEAL_PROT_PKEY. The MM_SEAL_DISCARD_RO_ANON Certain types of madvise() operations are destructive [6], such as MADV_DONTNEED, which can effectively alter region contents by discarding pages, especially when memory is anonymous. This blocks such operations for anonymous memory which is not writable to the user. The MM_SEAL_SEAL MM_SEAL_SEAL denies adding a new seal for an VMA. This is similar to F_SEAL_SEAL in fcntl. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. -------------------------------------------------------------------- Change history: =============== V3: - Abandon per-syscall approach, (Suggested by Linus Torvalds). - Organize sealing types around their functionality, such as MM_SEAL_BASE, MM_SEAL_PROT_PKEY. - Extend the scope of sealing from calls originated in userspace to both kernel and userspace. (Suggested by Linus Torvalds) - Add seal type support in mmap(). (Suggested by Pedro Falcato) - Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent destructive operations of madvise. (Suggested by Jann Horn and Stephen Röttger) - Make sealed VMAs mergeable. (Suggested by Jann Horn) - Add MAP_SEALABLE to mmap() (Detail see new discussions) - Add documentation - mseal.rst Work in progress: ================= - update man page for mseal() and mmap() Open discussions: ================= Several open discussions from V1/V2 were not incorporated into V3. I believe these are worth more discussion for future versions of this patch series. 1> mseal() vs mimmutable() mseal(bitmasks for multiple seal types) BASE + PROT_PKEY+ MM_SEAL_DISCARD_RO_ANON Apply PROT_PKEY implies BASE, same for DISCARD_RO_ANON mimmutable() (openBSD) This is equal to SEAL_BASE + SEAL_PROT_PKEY in mseal() Plus allowing downgrade from W=>NW (OpenBSD) Doesn’t have MM_SEAL_DISCARD_RO_ANON mimmutable() is designed for memory sealing in libc, and mseal() is designed for both Chrome browser and libc. For the two memory ranges that Chrome browser wants to seal, as discussed previously, the allocator still needs to free or discard memory for the sealed memory. For performance reasons, we have explored two solutions in the past: first, using PKEY-based munmap() [7], and second, separating SEAL_MPROTECT (v1 of this patch set). Recently, we have experimented with an alternative approach that allows us to remove the separation of SEAL_MPROTECT. For those two memory ranges, Chrome browser will use BASE + PROT_PKEY + DISCARD_RO_ANON for all its sealing needs. In the case of libc, the .text segment can be sealed with the BASE and PROT_PKEY, and the RO data segments can be sealed with the BASE + PROT_PKEY + DISCARD_RO_ANON. From a flexibility standpoint, separating BASE out could be beneficial for future extensions of sealing features. For instance, applications might desire downgradable "prot" permissions (X=>NX, W=>NW, R=>NR), which would conflict with SEAL_PROT_PKEY. The more sealing features integrated into a single sealing type, the fewer applications can utilize these features. For example, some applications might programmatically require DISCARD_RO_ANON memory, which would render such VMA unsuitable for sealing. I'd like to get the community's input on this. From Chrome's perspective, the separation isn't as important anymore, at least in the short term. However, I prefer the multiple bits approach because it's more extensible in the long term. 2> mseal() vs mprotect() vs madvise() for setting the seal. mprotect(): Using prot field, but prot supports unset. It's workable, i.e. let applications carry the sealing type and set in all subsequent calls to mprotect(), but it feels like this is an extra thing to care about. madvise(): uses enum, multiple sealing types might require multiple roundtrips. IMO: sealing is a major departure from other memory syscalls because it takes away capabilities. The other memory APIs add behaviors or change attributes, but sealing does the opposite: it reduces capabilities. The name of the syscall, mseal(), can help emphasize the "taking away" part. My second choice would be madvise(). 3> Other: There is also a topic about ptrace/, /proc/self/mem, Userfaultfd, which I think can be followed up using v1 thread, where it has the most context. New discussions topics: ======================= During the development of V3, I had new questions and thoughts and wished to discuss. 1> shm/aio From reading the code, it seems to me that aio/shm can mmap/munmap maps on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime of those mapping are not tied to the lifetime of the process. If those memories are sealed from userspace, then unmap will fail. This isn’t a huge problem, since the memory will eventually be freed at exit or exec. However, it feels like the solution is not complete, because of the leaks in VMA address space during the lifetime of the process. There could be two solutions to address this, which I will discuss later. 2> Brk (heap/stack) Currently, userspace applications can seal parts of the heap by calling malloc() and mseal(). This raises the question of what the expected behavior is when sealing the heap is attempted. let's assume following calls from user space: ptr = malloc(size); mprotect(ptr, size, RO); mseal(ptr, size, SEAL_PROT_PKEY); free(ptr); Technically, before mseal() is added, the user can change the protection of the heap by calling mprotect(RO). As long as the user changes the protection back to RW before free(), the memory can be reused. Adding mseal() into picture, however, the heap is then sealed partially, user can still free it, but the memory remains to be RO, and the result of brk-shrink is nondeterministic, depending on if munmap() will try to free the sealed memory.(brk uses munmap to shrink the heap). 3> Above two cases led to the third topic: There are two options to address the problem mentioned above. Option 1: A “MAP_SEALABLE” flag in mmap(). If a map is created without this flag, the mseal() operation will fail. Applications that are not concerned with sealing will expect their behavior to be unchanged. For those that are concerned, adding a flag at mmap time to opt in is not difficult. For the short term, this solves problems 1 and 2 above. The memory in shm/aio/brk will not have the MAP_SEALABLE flag at mmap(), and the same is true for the heap. Option 2: Add MM_SEAL_SEAL during mmap() It is possible to defensively set MM_SEAL_SEAL for the selected mappings at creation time. Specifically, we can find all the mmaps that we do not want to seal, and add the MM_SEAL_SEAL flag in the mmap() call. The difference between MAP_SEALABLE and MM_SEAL_SEAL is that the first option starts from a small size and incrementally increases, while the second option is the opposite. In my opinion, MAP_SEALABLE is the preferred option. Only a limited set of mappings need to be sealed, and these are typically created by the runtime. For the few dedicated applications that manage their own mappings, such as Chrome, adding an extra flag at mmap() is not a difficult task. It is also a safer option in terms of regression risk. This is the option included in this version. 4> I think it might be possible to seal the stack or other special mappings created at runtime (vdso, vsyscall, vvar). This means we can enforce and seal W^X for certain types of application. For instance, the stack is typically used in read-write mode, but in some cases, it can become executable. To defend against unintented addition of executable bit to stack, we could let the application to seal it. Sealing the heap (for adding X) requires special handling, since the heap can shrink, and shrink is implemented through munmap(). Indeed, it might be possible that all virtual memory accessible to user space, regardless of its usage pattern, could be sealed. However, this would require additional research and development work. ------------------------------------------------------------------------ v2: Use _BITUL to define MM_SEAL_XX type. Use unsigned long for seal type in sys_mseal() and other functions. Remove internal VM_SEAL_XX type and convert_user_seal_type(). Remove MM_ACTION_XX type. Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask. Add more comments in code. Add a detailed commit message. https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/ v1: https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/ ---------------------------------------------------------------- [1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b… [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge… [6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf… [7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/ Jeff Xu (11): mseal: Add mseal syscall. mseal: Wire up mseal syscall mseal: add can_modify_mm and can_modify_vma mseal: add MM_SEAL_BASE mseal: add MM_SEAL_PROT_PKEY mseal: add sealing support for mmap mseal: make sealed VMA mergeable. mseal: add MM_SEAL_DISCARD_RO_ANON mseal: add MAP_SEALABLE to mmap() selftest mm/mseal memory sealing mseal:add documentation Documentation/userspace-api/mseal.rst | 189 ++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/mips/kernel/vdso.c | 10 +- arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/userfaultfd.c | 8 +- include/linux/mm.h | 178 +- include/linux/mm_types.h | 8 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/mman-common.h | 16 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 5 + kernel/sys_ni.c | 1 + mm/Kconfig | 9 + mm/Makefile | 1 + mm/madvise.c | 14 +- mm/mempolicy.c | 2 +- mm/mlock.c | 2 +- mm/mmap.c | 77 +- mm/mprotect.c | 12 +- mm/mremap.c | 44 +- mm/mseal.c | 376 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/config | 1 + tools/testing/selftests/mm/mseal_test.c | 2141 +++++++++++++++++++ 41 files changed, 3091 insertions(+), 32 deletions(-) create mode 100644 Documentation/userspace-api/mseal.rst create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c -- 2.43.0.472.g3155946c3a-goog

1 year, 5 months

9
27
0 0

[PATCH v3 4/4] selftest/bpf: Test a perf bpf program that suppresses side effects.

by Kyle Huey

The test sets a hardware breakpoint and uses a bpf program to suppress the side effects of a perf event sample, including I/O availability signals, SIGTRAPs, and decrementing the event counter limit, if the ip matches the expected value. Then the function with the breakpoint is executed multiple times to test that all effects behave as expected. Signed-off-by: Kyle Huey <khuey(a)kylehuey.com> --- .../selftests/bpf/prog_tests/perf_skip.c | 140 ++++++++++++++++++ .../selftests/bpf/progs/test_perf_skip.c | 15 ++ 2 files changed, 155 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c b/tools/testing/selftests/bpf/prog_tests/perf_skip.c new file mode 100644 index 000000000000..0200736a8baf --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE + +#include <test_progs.h> +#include "test_perf_skip.skel.h" +#include <linux/compiler.h> +#include <linux/hw_breakpoint.h> +#include <sys/mman.h> + +#ifndef TRAP_PERF +#define TRAP_PERF 6 +#endif + +int signals_unexpected = 1; +int sigio_count, sigtrap_count; + +static void handle_sigio(int sig __always_unused) +{ + ASSERT_OK(signals_unexpected, "perf event not skipped"); + ++sigio_count; +} + +static void handle_sigtrap(int signum __always_unused, + siginfo_t *info, + void *ucontext __always_unused) +{ + ASSERT_OK(signals_unexpected, "perf event not skipped"); + ASSERT_EQ(info->si_code, TRAP_PERF, "wrong si_code"); + ++sigtrap_count; +} + +static noinline int test_function(void) +{ + asm volatile (""); + return 0; +} + +void serial_test_perf_skip(void) +{ + struct sigaction action = {}; + struct sigaction previous_sigtrap; + sighandler_t previous_sigio; + struct test_perf_skip *skel = NULL; + struct perf_event_attr attr = {}; + int perf_fd = -1; + int err; + struct f_owner_ex owner; + struct bpf_link *prog_link = NULL; + + action.sa_flags = SA_SIGINFO | SA_NODEFER; + action.sa_sigaction = handle_sigtrap; + sigemptyset(&action.sa_mask); + if (!ASSERT_OK(sigaction(SIGTRAP, &action, &previous_sigtrap), "sigaction")) + return; + + previous_sigio = signal(SIGIO, handle_sigio); + + skel = test_perf_skip__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_load")) + goto cleanup; + + attr.type = PERF_TYPE_BREAKPOINT; + attr.size = sizeof(attr); + attr.bp_type = HW_BREAKPOINT_X; + attr.bp_addr = (uintptr_t)test_function; + attr.bp_len = sizeof(long); + attr.sample_period = 1; + attr.sample_type = PERF_SAMPLE_IP; + attr.pinned = 1; + attr.exclude_kernel = 1; + attr.exclude_hv = 1; + attr.precise_ip = 3; + attr.sigtrap = 1; + attr.remove_on_exec = 1; + + perf_fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0); + if (perf_fd < 0 && (errno == ENOENT || errno == EOPNOTSUPP)) { + printf("SKIP:no PERF_TYPE_BREAKPOINT/HW_BREAKPOINT_X\n"); + test__skip(); + goto cleanup; + } + if (!ASSERT_OK(perf_fd < 0, "perf_event_open")) + goto cleanup; + + /* Configure the perf event to signal on sample. */ + err = fcntl(perf_fd, F_SETFL, O_ASYNC); + if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)")) + goto cleanup; + + owner.type = F_OWNER_TID; + owner.pid = syscall(__NR_gettid); + err = fcntl(perf_fd, F_SETOWN_EX, &owner); + if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)")) + goto cleanup; + + /* + * Allow at most one sample. A sample rejected by bpf should + * not count against this. + */ + err = ioctl(perf_fd, PERF_EVENT_IOC_REFRESH, 1); + if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_REFRESH)")) + goto cleanup; + + prog_link = bpf_program__attach_perf_event(skel->progs.handler, perf_fd); + if (!ASSERT_OK_PTR(prog_link, "bpf_program__attach_perf_event")) + goto cleanup; + + /* Configure the bpf program to suppress the sample. */ + skel->bss->ip = (uintptr_t)test_function; + test_function(); + + ASSERT_EQ(sigio_count, 0, "sigio_count"); + ASSERT_EQ(sigtrap_count, 0, "sigtrap_count"); + + /* Configure the bpf program to allow the sample. */ + skel->bss->ip = 0; + signals_unexpected = 0; + test_function(); + + ASSERT_EQ(sigio_count, 1, "sigio_count"); + ASSERT_EQ(sigtrap_count, 1, "sigtrap_count"); + + /* + * Test that the sample above is the only one allowed (by perf, not + * by bpf) + */ + test_function(); + + ASSERT_EQ(sigio_count, 1, "sigio_count"); + ASSERT_EQ(sigtrap_count, 1, "sigtrap_count"); + +cleanup: + bpf_link__destroy(prog_link); + if (perf_fd >= 0) + close(perf_fd); + test_perf_skip__destroy(skel); + + signal(SIGIO, previous_sigio); + sigaction(SIGTRAP, &previous_sigtrap, NULL); +} diff --git a/tools/testing/selftests/bpf/progs/test_perf_skip.c b/tools/testing/selftests/bpf/progs/test_perf_skip.c new file mode 100644 index 000000000000..7eb8b6de7a57 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_perf_skip.c @@ -0,0 +1,15 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +uintptr_t ip; + +SEC("perf_event") +int handler(struct bpf_perf_event_data *data) +{ + /* Skip events that have the correct ip. */ + return ip != PT_REGS_IP(&data->regs); +} + +char _license[] SEC("license") = "GPL"; -- 2.34.1

1 year, 5 months

2
2
0 0

[PATCH v7 00/39] arm64/gcs: Provide support for GCS in userspace

by Mark Brown

The arm64 Guarded Control Stack (GCS) feature provides support for hardware protected stacks of return addresses, intended to provide hardening against return oriented programming (ROP) attacks and to make it easier to gather call stacks for applications such as profiling. When GCS is active a secondary stack called the Guarded Control Stack is maintained, protected with a memory attribute which means that it can only be written with specific GCS operations. The current GCS pointer can not be directly written to by userspace. When a BL is executed the value stored in LR is also pushed onto the GCS, and when a RET is executed the top of the GCS is popped and compared to LR with a fault being raised if the values do not match. GCS operations may only be performed on GCS pages, a data abort is generated if they are not. The combination of hardware enforcement and lack of extra instructions in the function entry and exit paths should result in something which has less overhead and is more difficult to attack than a purely software implementation like clang's shadow stacks. This series implements support for use of GCS by userspace, along with support for use of GCS within KVM guests. It does not enable use of GCS by either EL1 or EL2, this will be implemented separately. Executables are started without GCS and must use a prctl() to enable it, it is expected that this will be done very early in application execution by the dynamic linker or other startup code. For dynamic linking this will be done by checking that everything in the executable is marked as GCS compatible. x86 has an equivalent feature called shadow stacks, this series depends on the x86 patches for generic memory management support for the new guarded/shadow stack page type and shares APIs as much as possible. As there has been extensive discussion with the wider community around the ABI for shadow stacks I have as far as practical kept implementation decisions close to those for x86, anticipating that review would lead to similar conclusions in the absence of strong reasoning for divergence. The main divergence I am concious of is that x86 allows shadow stack to be enabled and disabled repeatedly, freeing the shadow stack for the thread whenever disabled, while this implementation keeps the GCS allocated after disable but refuses to reenable it. This is to avoid races with things actively walking the GCS during a disable, we do anticipate that some systems will wish to disable GCS at runtime but are not aware of any demand for subsequently reenabling it. x86 uses an arch_prctl() to manage enable and disable, since only x86 and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a patch set for the equivalent RISC-V Zicfiss feature which I initially adopted fairly directly but following review feedback has been revised quite a bit. We currently maintain the x86 pattern of implicitly allocating a shadow stack for threads started with shadow stack enabled, there has been some discussion of removing this support and requiring the use of clone3() with explicit allocation of shadow stacks instead. I have no strong feelings either way, implicit allocation is not really consistent with anything else we do and creates the potential for errors around thread exit but on the other hand it is existing ABI on x86 and minimises the changes needed in userspace code. There is an open issue with support for CRIU, on x86 this required the ability to set the GCS mode via ptrace. This series supports configuring mode bits other than enable/disable via ptrace but it needs to be confirmed if this is sufficient. The series depends on support for shadow stacks in clone3(), that series includes the addition of ARCH_HAS_USER_SHADOW_STACK. https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke… It also depends on the addition of more waitpid() flags to nolibc: https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@k… You can see a branch with the full set of dependencies against Linus' tree at: https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs [1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v7: - Rebase onto v6.7-rc2 via the clone3() patch series. - Change the token used to cap the stack during signal handling to be compatible with GCSPOPM. - Fix flags for new page types. - Fold in support for clone3(). - Replace copy_to_user_gcs() with put_user_gcs(). - Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org Changes in v6: - Rebase onto v6.6-rc3. - Add some more gcsb_dsync() barriers following spec clarifications. - Due to ongoing discussion around clone()/clone3() I've not updated anything there, the behaviour is the same as on previous versions. - Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org Changes in v5: - Don't map any permissions for user GCSs, we always use EL0 accessors or use a separate mapping of the page. - Reduce the standard size of the GCS to RLIMIT_STACK/2. - Enforce a PAGE_SIZE alignment requirement on map_shadow_stack(). - Clarifications and fixes to documentation. - More tests. - Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org Changes in v4: - Implement flags for map_shadow_stack() allowing the cap and end of stack marker to be enabled independently or not at all. - Relax size and alignment requirements for map_shadow_stack(). - Add more blurb explaining the advantages of hardware enforcement. - Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org Changes in v3: - Rebase onto v6.5-rc4. - Add a GCS barrier on context switch. - Add a GCS stress test. - Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org Changes in v2: - Rebase onto v6.5-rc3. - Rework prctl() interface to allow each bit to be locked independently. - map_shadow_stack() now places the cap token based on the size requested by the caller not the actual space allocated. - Mode changes other than enable via ptrace are now supported. - Expand test coverage. - Various smaller fixes and adjustments. - Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org --- Mark Brown (39): arm64/mm: Restructure arch_validate_flags() for extensibility prctl: arch-agnostic prctl for shadow stack mman: Add map_shadow_stack() flags arm64: Document boot requirements for Guarded Control Stacks arm64/gcs: Document the ABI for Guarded Control Stacks arm64/sysreg: Add new system registers for GCS arm64/sysreg: Add definitions for architected GCS caps arm64/gcs: Add manual encodings of GCS instructions arm64/gcs: Provide put_user_gcs() arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS) arm64/mm: Allocate PIE slots for EL0 guarded control stack mm: Define VM_SHADOW_STACK for arm64 when we support GCS arm64/mm: Map pages for guarded control stack KVM: arm64: Manage GCS registers for guests arm64/gcs: Allow GCS usage at EL0 and EL1 arm64/idreg: Add overrride for GCS arm64/hwcap: Add hwcap for GCS arm64/traps: Handle GCS exceptions arm64/mm: Handle GCS data aborts arm64/gcs: Context switch GCS state for EL0 arm64/gcs: Allocate a new GCS for threads with GCS enabled arm64/gcs: Implement shadow stack prctl() interface arm64/mm: Implement map_shadow_stack() arm64/signal: Set up and restore the GCS context for signal handlers arm64/signal: Expose GCS state in signal frames arm64/ptrace: Expose GCS via ptrace and core files arm64: Add Kconfig for Guarded Control Stack (GCS) kselftest/arm64: Verify the GCS hwcap kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Add very basic GCS test program kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add test coverage for GCS mode locking selftests/arm64: Add GCS signal tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Enable GCS for the FP stress tests kselftest/clone3: Enable GCS in the clone3 selftests Documentation/admin-guide/kernel-parameters.txt | 6 + Documentation/arch/arm64/booting.rst | 22 + Documentation/arch/arm64/elf_hwcaps.rst | 3 + Documentation/arch/arm64/gcs.rst | 233 +++++++ Documentation/arch/arm64/index.rst | 1 + Documentation/filesystems/proc.rst | 2 +- arch/arm64/Kconfig | 20 + arch/arm64/include/asm/cpufeature.h | 6 + arch/arm64/include/asm/el2_setup.h | 17 + arch/arm64/include/asm/esr.h | 28 +- arch/arm64/include/asm/exception.h | 2 + arch/arm64/include/asm/gcs.h | 107 +++ arch/arm64/include/asm/hwcap.h | 1 + arch/arm64/include/asm/kvm_arm.h | 4 +- arch/arm64/include/asm/kvm_host.h | 12 + arch/arm64/include/asm/mman.h | 23 +- arch/arm64/include/asm/pgtable-prot.h | 14 +- arch/arm64/include/asm/processor.h | 7 + arch/arm64/include/asm/sysreg.h | 20 + arch/arm64/include/asm/uaccess.h | 40 ++ arch/arm64/include/uapi/asm/hwcap.h | 1 + arch/arm64/include/uapi/asm/ptrace.h | 8 + arch/arm64/include/uapi/asm/sigcontext.h | 9 + arch/arm64/kernel/cpufeature.c | 19 + arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kernel/entry-common.c | 23 + arch/arm64/kernel/idreg-override.c | 2 + arch/arm64/kernel/process.c | 81 +++ arch/arm64/kernel/ptrace.c | 59 ++ arch/arm64/kernel/signal.c | 236 ++++++- arch/arm64/kernel/traps.c | 11 + arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 + arch/arm64/kvm/sys_regs.c | 22 + arch/arm64/mm/Makefile | 1 + arch/arm64/mm/fault.c | 79 ++- arch/arm64/mm/gcs.c | 259 +++++++ arch/arm64/mm/mmap.c | 13 +- arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 55 ++ arch/x86/include/uapi/asm/mman.h | 3 - fs/proc/task_mmu.c | 3 + include/linux/mm.h | 16 +- include/uapi/asm-generic/mman.h | 4 + include/uapi/linux/elf.h | 1 + include/uapi/linux/prctl.h | 22 + kernel/sys.c | 30 + tools/testing/selftests/arm64/Makefile | 2 +- tools/testing/selftests/arm64/abi/hwcap.c | 19 + tools/testing/selftests/arm64/fp/assembler.h | 15 + tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 + tools/testing/selftests/arm64/fp/sve-test.S | 2 + tools/testing/selftests/arm64/fp/za-test.S | 2 + tools/testing/selftests/arm64/fp/zt-test.S | 2 + tools/testing/selftests/arm64/gcs/.gitignore | 5 + tools/testing/selftests/arm64/gcs/Makefile | 24 + tools/testing/selftests/arm64/gcs/asm-offsets.h | 0 tools/testing/selftests/arm64/gcs/basic-gcs.c | 428 ++++++++++++ tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++ .../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++ tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++ tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++ tools/testing/selftests/arm64/gcs/libc-gcs.c | 742 +++++++++++++++++++++ tools/testing/selftests/arm64/signal/.gitignore | 1 + .../testing/selftests/arm64/signal/test_signals.c | 17 +- .../testing/selftests/arm64/signal/test_signals.h | 6 + .../selftests/arm64/signal/test_signals_utils.c | 32 +- .../selftests/arm64/signal/test_signals_utils.h | 39 ++ .../arm64/signal/testcases/gcs_exception_fault.c | 59 ++ .../selftests/arm64/signal/testcases/gcs_frame.c | 78 +++ .../arm64/signal/testcases/gcs_write_fault.c | 67 ++ .../selftests/arm64/signal/testcases/testcases.c | 7 + .../selftests/arm64/signal/testcases/testcases.h | 1 + tools/testing/selftests/clone3/clone3.c | 37 + 73 files changed, 4234 insertions(+), 40 deletions(-) --- base-commit: 3d0134d322380292c055454d9633738733992d61 change-id: 20230303-arm64-gcs-e311ab0d8729 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 year, 5 months

4
68
0 0

[PATCH 00/15] KVM RISC-V report more ISA extensions through ONE_REG

by Anup Patel

This extends the KVM RISC-V ONE_REG interface to report more ISA extensions namely: Zbz, scalar crypto, vector crypto, Zfh[min], Zihintntl, Zvfh[min], and Zfa. This series depends upon the "riscv: report more ISA extensions through hwprobe" series.from Clement. (Link: https://lore.kernel.org/lkml/20231114141256.126749-1-cleger@rivosinc.com/) To test these patches, use KVMTOOL from the riscv_more_exts_v1 branch at: https://github.com/avpatel/kvmtool.git These patches can also be found in the riscv_kvm_more_exts_v1 branch at: https://github.com/avpatel/linux.git Anup Patel (15): KVM: riscv: selftests: Generate ISA extension reg_list using macros RISC-V: KVM: Allow Zbc extension for Guest/VM KVM: riscv: selftests: Add Zbc extension to get-reg-list test RISC-V: KVM: Allow scalar crypto extensions for Guest/VM KVM: riscv: selftests: Add scaler crypto extensions to get-reg-list test RISC-V: KVM: Allow vector crypto extensions for Guest/VM KVM: riscv: selftests: Add vector crypto extensions to get-reg-list test RISC-V: KVM: Allow Zfh[min] extensions for Guest/VM KVM: riscv: selftests: Add Zfh[min] extensions to get-reg-list test RISC-V: KVM: Allow Zihintntl extension for Guest/VM KVM: riscv: selftests: Add Zihintntl extension to get-reg-list test RISC-V: KVM: Allow Zvfh[min] extensions for Guest/VM KVM: riscv: selftests: Add Zvfh[min] extensions to get-reg-list test RISC-V: KVM: Allow Zfa extension for Guest/VM KVM: riscv: selftests: Add Zfa extension to get-reg-list test arch/riscv/include/uapi/asm/kvm.h | 27 ++ arch/riscv/kvm/vcpu_onereg.c | 54 +++ .../selftests/kvm/riscv/get-reg-list.c | 439 ++++++++---------- 3 files changed, 265 insertions(+), 255 deletions(-) -- 2.34.1

1 year, 5 months

3
32
0 0

[PATCH 00/12] selftests/net: Add TCP-AO tests

by Dmitry Safonov

Hi, An essential part of any big kernel submissions is selftests. At the beginning of TCP-AO project, I made patches to fcnal-test.sh and nettest.c to have the benefits of easy refactoring, early noticing breakages, putting a moat around the code, documenting and designing uAPI. While tests based on fcnal-test.sh/nettest.c provided initial testing* and were very easy to add, the pile of TCP-AO quickly grew out of one-binary + shell-script testing. The design of the TCP-AO testing is a bit different than one-big selftest binary as I did previously in net/ipsec.c. I found it beneficial to avoid implementing a tests runner/scheduler and delegate it to the user or Makefile. The approach is very influenced by CRIU/ZDTM testing[1]: it provides a static library with helper functions and selftest binaries that create specific scenarios. I also tried to utilize kselftest.h. test_init() function does all needed preparations. To not leave any traces after a selftest exists, it creates a network namespace and if the test wants to establish a TCP connection, a child netns. The parent and child netns have veth pair with proper ip addresses and routes set up. Both peers, the client and server are different pthreads. The treading model was chosen over forking mostly by easiness of cleanup on a failure: no need to search for children, handle SIGCHLD, make sure not to wait for a dead peer to perform anything, etc. Any thread that does exit() naturally kills the tests, sweet! The selftests are compiled currently in two variants: ipv4 and ipv6. Ipv4-mapped-ipv6 addresses might be a third variant to add, but it's not there in this version. As pretty much all tests are shared between two address families, most of the code can be shared, too. To differ in code what kind of test is running, Makefile supplies -DIPV6_TEST to compiler and ifdeffery in tests can do things that have to be different between address families. This is similar to TARGETS_C_BOTHBITS in x86 selftests and also to tests code sharing in CRIU/ZDTM. The total number of tests is 832. From them rst_ipv{4,6} has currently one flaky subtest, that may fail: > not ok 9 client connection was not reset: 0 I'll investigate what happens there. Also, unsigned-md5_ipv{4,6} are flaky because of netns counter checks: it doesn't expect that there may be retransmitted TCP segments from a previous sub-selftest. That will be fixed. Besides, key-management_ipv{4,6} has 3 sub-tests passing with XFAIL: > ok 15 # XFAIL listen() after current/rnext keys set: the socket has current/rnext keys: 100:200 > ok 16 # XFAIL listen socket, delete current key from before listen(): failed to delete the key 100:100 -16 > ok 17 # XFAIL listen socket, delete rnext key from before listen(): failed to delete the key 200:200 -16 ... > # Totals: pass:117 fail:0 xfail:3 xpass:0 skip:0 error:0 Those need some more kernel work to pass instead of xfail. The overview of selftests (see the diffstat at the bottom): ├── lib │ ├── aolib.h │ │ The header for all selftests to include. │ ├── kconfig.c │ │ Kernel kconfig detector to SKIP tests that depend on something. │ ├── netlink.c │ │ Netlink helper to add/modify/delete VETH/IPs/routes/VRFs │ │ I considered just using libmnl, but this is around 400 lines │ │ and avoids selftests dependency on out-of-tree sources/packets. │ ├── proc.c │ │ SNMP/netstat procfs parser and the counters comparator. │ ├── repair.c │ │ Heavily influenced by libsoccr and reduced to minimum TCP │ │ socket checkpoint/repair. Shouldn't be used out of selftests, │ │ though. │ ├── setup.c │ │ All the needed netns/veth/ips/etc preparations for test init. │ ├── sock.c │ │ Socket helpers: {s,g}etsockopt()s/connect()/listen()/etc. │ └── utils.c │ Random stuff (a pun intended). ├── bench-lookups.c │ The only benchmark in selftests currently: checks how well TCP-AO │ setsockopt()s perform, depending on the amount of keys on a socket. ├── connect.c │ Trivial sample, can be used as a boilerplate to write a new test. ├── connect-deny.c │ More-or-less what could be expected for TCP-AO in fcnal-test.sh ├── icmps-accept.c -> icmps-discard.c ├── icmps-discard.c │ Verifies RFC5925 (7.8) by checking that TCP-AO connection can be │ broken if ICMPs are accepted and survives when ::accept_icmps = 0 ├── key-management.c │ Key manipulations, rotations between randomized hashing algorithms │ and counter checks for those scenarios. ├── restore.c │ TCP_AO_REPAIR: verifies that a socket can be re-created without │ TCP-AO connection being interrupted. ├── rst.c │ As RST segments are signed on a separate code-path in kernel, │ verifies passive/active TCP send_reset(). ├── self-connect.c │ Verifies that TCP self-connect and also simultaneous open work. ├── seq-ext.c │ Utilizes TCP_AO_REPAIR to check that on SEQ roll-over SNE │ increment is performed and segments with different SNEs fail to │ pass verification. ├── setsockopt-closed.c │ Checks that {s,g}etsockopt()s are extendable syscalls and common │ error-paths for them. └── unsigned-md5.c Checks listen() socket for (non-)matching peers with: AO/MD5/none keys. As well as their interaction with VRFs and AO_REQUIRED flag. There are certainly more test scenarios that can be added, but even so, I'm pretty happy that this much of TCP-AO functionality and uAPIs got covered. These selftests were iteratively developed by me during TCP-AO kernel upstreaming and the resulting kernel patches would have been worse without having these tests. They provided the user-side perspective but also allowed safer refactoring with less possibility of introducing a regression. Now it's time to use them to dig a moat around the TCP-AO code! There are also people from other network companies that work on TCP-AO (+testing), so sharing these selftests will allow them to contribute and may benefit from their efforts. The following changes since commit c7402612e2e61b76177f22e6e7f705adcbecc6fe: Merge tag 'net-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2023-12-14 13:11:49 -0800) are available in the Git repository at: git@github.com:0x7f454c46/linux.git tcp-ao-selftests-v1 for you to fetch changes up to 85dc9bc676985d81f9043fd9c3a506f30851597b: selftests/net: Add TCP-AO key-management test (2023-12-15 00:44:49 +0000) ---------------------------------------------------------------- * Planning to submit basic TCP-AO tests to fcnal-test.sh/nettest.c separately. [1]: https://github.com/checkpoint-restore/criu/tree/criu-dev/test/zdtm/static Signed-off-by: Dmitry Safonov <dima(a)arista.com> --- Dmitry Safonov (12): selftests/net: Add TCP-AO library selftests/net: Verify that TCP-AO complies with ignoring ICMPs selftests/net: Add TCP-AO ICMPs accept test selftests/net: Add a test for TCP-AO keys matching selftests/net: Add test for TCP-AO add setsockopt() command selftests/net: Add TCP-AO + TCP-MD5 + no sign listen socket tests selftests/net: Add test/benchmark for removing MKTs selftests/net: Add TCP_REPAIR TCP-AO tests selftests/net: Add SEQ number extension test selftests/net: Add TCP-AO RST test selftests/net: Add TCP-AO selfconnect/simultaneous connect test selftests/net: Add TCP-AO key-management test tools/testing/selftests/Makefile | 1 + tools/testing/selftests/net/tcp_ao/.gitignore | 2 + tools/testing/selftests/net/tcp_ao/Makefile | 59 + tools/testing/selftests/net/tcp_ao/bench-lookups.c | 358 ++++++ tools/testing/selftests/net/tcp_ao/connect-deny.c | 264 +++++ tools/testing/selftests/net/tcp_ao/connect.c | 90 ++ tools/testing/selftests/net/tcp_ao/icmps-accept.c | 1 + tools/testing/selftests/net/tcp_ao/icmps-discard.c | 449 ++++++++ .../testing/selftests/net/tcp_ao/key-management.c | 1180 ++++++++++++++++++++ tools/testing/selftests/net/tcp_ao/lib/aolib.h | 605 ++++++++++ tools/testing/selftests/net/tcp_ao/lib/kconfig.c | 148 +++ tools/testing/selftests/net/tcp_ao/lib/netlink.c | 415 +++++++ tools/testing/selftests/net/tcp_ao/lib/proc.c | 273 +++++ tools/testing/selftests/net/tcp_ao/lib/repair.c | 254 +++++ tools/testing/selftests/net/tcp_ao/lib/setup.c | 342 ++++++ tools/testing/selftests/net/tcp_ao/lib/sock.c | 592 ++++++++++ tools/testing/selftests/net/tcp_ao/lib/utils.c | 30 + tools/testing/selftests/net/tcp_ao/restore.c | 236 ++++ tools/testing/selftests/net/tcp_ao/rst.c | 415 +++++++ tools/testing/selftests/net/tcp_ao/self-connect.c | 197 ++++ tools/testing/selftests/net/tcp_ao/seq-ext.c | 245 ++++ .../selftests/net/tcp_ao/setsockopt-closed.c | 835 ++++++++++++++ tools/testing/selftests/net/tcp_ao/unsigned-md5.c | 742 ++++++++++++ 23 files changed, 7733 insertions(+) --- base-commit: c7402612e2e61b76177f22e6e7f705adcbecc6fe change-id: 20231213-tcp-ao-selftests-d0f323006667 Best regards, -- Dmitry Safonov <dima(a)arista.com>

1 year, 5 months

3
27
0 0

[PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

by Lu Baolu

Hi folks, This series implements the functionality of delivering IO page faults to user space through the IOMMUFD framework for nested translation. Nested translation is a hardware feature that supports two-stage translation tables for IOMMU. The second-stage translation table is managed by the host VMM, while the first-stage translation table is owned by user space. This allows user space to control the IOMMU mappings for its devices. When an IO page fault occurs on the first-stage translation table, the IOMMU hardware can deliver the page fault to user space through the IOMMUFD framework. User space can then handle the page fault and respond to the device top-down through the IOMMUFD. This allows user space to implement its own IO page fault handling policies. User space indicates its capability of handling IO page faults by setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a hardware page table (HWPT). IOMMUFD will then set up its infrastructure for page fault delivery. On a successful return of HWPT allocation, the user can retrieve and respond to page faults by reading and writing to the file descriptor (FD) returned in out_fault_fd. The iommu selftest framework has been updated to test the IO page fault delivery and response functionality. This series is based on the latest implementation of nested translation under discussion [1] and the page fault handling framework refactoring in the IOMMU core [2]. The series and related patches are available on GitHub: [3] [1] https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.… [2] https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.i… [3] https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-… Best regards, baolu Change log: v2: - Move all iommu refactoring patches into a sparated series and discuss it in a different thread. The latest patch series [v6] is available at https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.i… - We discussed the timeout of the pending page fault messages. We agreed that we shouldn't apply any timeout policy for the page fault handling in user space. https://lore.kernel.org/linux-iommu/20230616113232.GA84678@myrica/ - Jason suggested that we adopt a simple file descriptor interface for reading and responding to I/O page requests, so that user space applications can improve performance using io_uring. https://lore.kernel.org/linux-iommu/ZJWjD1ajeem6pK3I@ziepe.ca/ v1: https://lore.kernel.org/linux-iommu/20230530053724.232765-1-baolu.lu@linux.… Lu Baolu (6): iommu: Add iommu page fault cookie helpers iommufd: Add iommu page fault uapi data iommufd: Initializing and releasing IO page fault data iommufd: Deliver fault messages to user space iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_IOPF test support iommufd/selftest: Add coverage for IOMMU_TEST_OP_TRIGGER_IOPF include/linux/iommu.h | 9 + drivers/iommu/iommu-priv.h | 15 + drivers/iommu/iommufd/iommufd_private.h | 12 + drivers/iommu/iommufd/iommufd_test.h | 8 + include/uapi/linux/iommufd.h | 65 +++++ tools/testing/selftests/iommu/iommufd_utils.h | 66 ++++- drivers/iommu/io-pgfault.c | 50 ++++ drivers/iommu/iommufd/device.c | 69 ++++- drivers/iommu/iommufd/hw_pagetable.c | 260 +++++++++++++++++- drivers/iommu/iommufd/selftest.c | 56 ++++ tools/testing/selftests/iommu/iommufd.c | 24 +- .../selftests/iommu/iommufd_fail_nth.c | 2 +- 12 files changed, 620 insertions(+), 16 deletions(-) -- 2.34.1

1 year, 5 months

6
42
0 0

[PATCH 0/3] vfio-pci support pasid attach/detach

by Yi Liu

This adds the pasid attach/detach uAPIs for userspace to attach/detach a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is enabled in this series. After this series, PASID-capable devices bound with vfio-pci can report PASID capability to userspace and VM to enable PASID usages like Shared Virtual Addressing (SVA). This series first adds the helpers for pasid attach in vfio core and then add the device cdev ioctls for pasid attach/detach, finally exposes the device PASID capability to user. It depends on iommufd pasid attach/detach series [1]. Complete code can be found at [2], tested with a draft Qemu branch[3] [1] https://lore.kernel.org/linux-iommu/20231127063428.127436-1-yi.l.liu@intel.… [2] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1%… Change log: v1: - Report PASID capability via VFIO_DEVICE_FEATURE (Alex) rfc: https://lore.kernel.org/linux-iommu/20230926093121.18676-1-yi.l.liu@intel.c… Regards, Yi Liu Kevin Tian (1): vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu (2): vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl drivers/vfio/device_cdev.c | 45 +++++++++++++++++++++ drivers/vfio/iommufd.c | 48 ++++++++++++++++++++++ drivers/vfio/pci/vfio_pci.c | 2 + drivers/vfio/pci/vfio_pci_core.c | 47 ++++++++++++++++++++++ drivers/vfio/vfio.h | 4 ++ drivers/vfio/vfio_main.c | 8 ++++ include/linux/vfio.h | 11 ++++++ include/uapi/linux/vfio.h | 68 ++++++++++++++++++++++++++++++++ 8 files changed, 233 insertions(+) -- 2.34.1

1 year, 5 months

5
30
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror December 2023