August 2025 - Linux-kselftest-mirror

[PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach

by Bernd Edlinger

This introduces signal->exec_bprm, which is used to fix the case when at least one of the sibling threads is traced, and therefore the trace process may dead-lock in ptrace_attach, but de_thread will need to wait for the tracer to continue execution. The solution is to detect this situation and allow ptrace_attach to continue by temporarily releasing the cred_guard_mutex, while de_thread() is still waiting for traced zombies to be eventually released by the tracer. In the case of the thread group leader we only have to wait for the thread to become a zombie, which may also need co-operation from the tracer due to PTRACE_O_TRACEEXIT. When a tracer wants to ptrace_attach a task that already is in execve, we simply retry the ptrace_may_access check while temporarily installing the new credentials and dumpability which are about to be used after execve completes. If the ptrace_attach happens on a thread that is a sibling-thread of the thread doing execve, it is sufficient to check against the old credentials, as this thread will be waited for, before the new credentials are installed. Other threads die quickly since the cred_guard_mutex is released, but a deadly signal is already pending. In case the mutex_lock_killable misses the signal, the non-zero current->signal->exec_bprm makes sure they release the mutex immediately and return with -ERESTARTNOINTR. This means there is no API change, unlike the previous version of this patch which was discussed here: https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.d… See tools/testing/selftests/ptrace/vmaccess.c for a test case that gets fixed by this change. Note that since the test case was originally designed to test the ptrace_attach returning an error in this situation, the test expectation needed to be adjusted, to allow the API to succeed at the first attempt. Signed-off-by: Bernd Edlinger <bernd.edlinger(a)hotmail.de> --- fs/exec.c | 69 ++++++++++++++++------- fs/proc/base.c | 6 ++ include/linux/cred.h | 1 + include/linux/sched/signal.h | 18 ++++++ kernel/cred.c | 28 +++++++-- kernel/ptrace.c | 32 +++++++++++ kernel/seccomp.c | 12 +++- tools/testing/selftests/ptrace/vmaccess.c | 23 +++++--- 8 files changed, 155 insertions(+), 34 deletions(-) v10: Changes to previous version, make the PTRACE_ATTACH retun -EAGAIN, instead of execve return -ERESTARTSYS. Added some lessions learned to the description. v11: Check old and new credentials in PTRACE_ATTACH again without changing the API. Note: I got actually one response from an automatic checker to the v11 patch, https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/ which is complaining about: >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@ 417 struct linux_binprm *bprm = task->signal->exec_bprm; 418 const struct cred *old_cred; 419 struct mm_struct *old_mm; 420 421 retval = down_write_killable(&task->signal->exec_update_lock); 422 if (retval) 423 goto unlock_creds; 424 task_lock(task); > 425 old_cred = task->real_cred; v12: Essentially identical to v11. - Fixed a minor merge conflict in linux v5.17, and fixed the above mentioned nit by adding __rcu to the declaration. - re-tested the patch with all linux versions from v5.11 to v6.6 v10 was an alternative approach which did imply an API change. But I would prefer to avoid such an API change. The difficult part is getting the right dumpability flags assigned before de_thread starts, hope you like this version. If not, the v10 is of course also acceptable. Thanks Bernd. diff --git a/fs/exec.c b/fs/exec.c index 2f2b0acec4f0..902d3b230485 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm) return 0; } -static int de_thread(struct task_struct *tsk) +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + struct task_struct *t = tsk; + bool unsafe_execve_in_progress = false; if (thread_group_empty(tsk)) goto no_thread_group; @@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) sig->notify_count--; + while_each_thread(tsk, t) { + if (unlikely(t->ptrace) + && (t != tsk->group_leader || !t->exit_state)) + unsafe_execve_in_progress = true; + } + + if (unlikely(unsafe_execve_in_progress)) { + spin_unlock_irq(lock); + sig->exec_bprm = bprm; + mutex_unlock(&sig->cred_guard_mutex); + spin_lock_irq(lock); + } + while (sig->notify_count) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); @@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + sig->group_exec_task = NULL; sig->notify_count = 0; @@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk) return 0; killed: + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + /* protects against exit_notify() and __exit_signal() */ read_lock(&tasklist_lock); sig->group_exec_task = NULL; @@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) return retval; + /* If the binary is not readable then enforce mm->dumpable=0 */ + would_dump(bprm, bprm->file); + if (bprm->have_execfd) + would_dump(bprm, bprm->executable); + + /* + * Figure out dumpability. Note that this checking only of current + * is wrong, but userspace depends on it. This should be testing + * bprm->secureexec instead. + */ + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || + is_dumpability_changed(current_cred(), bprm->cred) || + !(uid_eq(current_euid(), current_uid()) && + gid_eq(current_egid(), current_gid()))) + set_dumpable(bprm->mm, suid_dumpable); + else + set_dumpable(bprm->mm, SUID_DUMP_USER); + /* * Ensure all future errors are fatal. */ @@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm) /* * Make this the only thread in the thread group. */ - retval = de_thread(me); + retval = de_thread(me, bprm); if (retval) goto out; @@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; - /* If the binary is not readable then enforce mm->dumpable=0 */ - would_dump(bprm, bprm->file); - if (bprm->have_execfd) - would_dump(bprm, bprm->executable); - /* * Release all of the old mmap stuff */ @@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm) me->sas_ss_sp = me->sas_ss_size = 0; - /* - * Figure out dumpability. Note that this checking only of current - * is wrong, but userspace depends on it. This should be testing - * bprm->secureexec instead. - */ - if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || - !(uid_eq(current_euid(), current_uid()) && - gid_eq(current_egid(), current_gid()))) - set_dumpable(current->mm, suid_dumpable); - else - set_dumpable(current->mm, SUID_DUMP_USER); - perf_event_exec(); __set_task_comm(me, kbasename(bprm->filename), true); @@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm) if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) return -ERESTARTNOINTR; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + bprm->cred = prepare_exec_creds(); if (likely(bprm->cred)) return 0; diff --git a/fs/proc/base.c b/fs/proc/base.c index ffd54617c354..0da9adfadb48 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf, if (rv < 0) goto out_free; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + rv = -ERESTARTNOINTR; + goto out_free; + } + rv = security_setprocattr(PROC_I(inode)->op.lsm, file->f_path.dentry->d_name.name, page, count); diff --git a/include/linux/cred.h b/include/linux/cred.h index f923528d5cc4..b01e309f5686 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *); extern struct cred *cred_alloc_blank(void); extern struct cred *prepare_creds(void); extern struct cred *prepare_exec_creds(void); +extern bool is_dumpability_changed(const struct cred *, const struct cred *); extern int commit_creds(struct cred *); extern void abort_creds(struct cred *); extern const struct cred *override_creds(const struct cred *); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 0014d3adaf84..14df7073a0a8 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -234,9 +234,27 @@ struct signal_struct { struct mm_struct *oom_mm; /* recorded mm when the thread group got * killed by the oom killer */ + struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access + * against new credentials while + * de_thread is waiting for other + * traced threads to terminate. + * Set while de_thread is executing. + * The cred_guard_mutex is released + * after de_thread() has called + * zap_other_threads(), therefore + * a fatal signal is guaranteed to be + * already pending in the unlikely + * event, that + * current->signal->exec_bprm happens + * to be non-zero after the + * cred_guard_mutex was acquired. + */ + struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) + * Held while execve runs, except when + * a sibling thread is being traced. * Deprecated do not use in new code. * Use exec_update_lock instead. */ diff --git a/kernel/cred.c b/kernel/cred.c index 98cb4eca23fb..586cb6c7cf6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset) return false; } +/** + * is_dumpability_changed - Will changing creds from old to new + * affect the dumpability in commit_creds? + * + * Return: false - dumpability will not be changed in commit_creds. + * Return: true - dumpability will be changed to non-dumpable. + * + * @old: The old credentials + * @new: The new credentials + */ +bool is_dumpability_changed(const struct cred *old, const struct cred *new) +{ + if (!uid_eq(old->euid, new->euid) || + !gid_eq(old->egid, new->egid) || + !uid_eq(old->fsuid, new->fsuid) || + !gid_eq(old->fsgid, new->fsgid) || + !cred_cap_issubset(old, new)) + return true; + + return false; +} + /** * commit_creds - Install new credentials upon the current task * @new: The credentials to be assigned @@ -467,11 +489,7 @@ int commit_creds(struct cred *new) get_cred(new); /* we will require a ref for the subj creds too */ /* dumpability changes */ - if (!uid_eq(old->euid, new->euid) || - !gid_eq(old->egid, new->egid) || - !uid_eq(old->fsuid, new->fsuid) || - !gid_eq(old->fsgid, new->fsgid) || - !cred_cap_issubset(old, new)) { + if (is_dumpability_changed(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); task->pdeath_signal = 0; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 443057bee87c..eb1c450bb7d7 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -20,6 +20,7 @@ #include <linux/pagemap.h> #include <linux/ptrace.h> #include <linux/security.h> +#include <linux/binfmts.h> #include <linux/signal.h> #include <linux/uio.h> #include <linux/audit.h> @@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request, if (retval) goto unlock_creds; + if (unlikely(task->in_execve)) { + struct linux_binprm *bprm = task->signal->exec_bprm; + const struct cred __rcu *old_cred; + struct mm_struct *old_mm; + + retval = down_write_killable(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + task_lock(task); + old_cred = task->real_cred; + old_mm = task->mm; + rcu_assign_pointer(task->real_cred, bprm->cred); + task->mm = bprm->mm; + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); + rcu_assign_pointer(task->real_cred, old_cred); + task->mm = old_mm; + task_unlock(task); + up_write(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + } + write_lock_irq(&tasklist_lock); retval = -EPERM; if (unlikely(task->exit_state)) @@ -508,6 +531,14 @@ static int ptrace_traceme(void) { int ret = -EPERM; + if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) + return -ERESTARTNOINTR; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + write_lock_irq(&tasklist_lock); /* Are we already being traced? */ if (!current->ptrace) { @@ -523,6 +554,7 @@ static int ptrace_traceme(void) } } write_unlock_irq(&tasklist_lock); + mutex_unlock(&current->signal->cred_guard_mutex); return ret; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 255999ba9190..b29bbfa0b044 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags, * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. */ - if (flags & SECCOMP_FILTER_FLAG_TSYNC && - mutex_lock_killable(&current->signal->cred_guard_mutex)) - goto out_put_fd; + if (flags & SECCOMP_FILTER_FLAG_TSYNC) { + if (mutex_lock_killable(&current->signal->cred_guard_mutex)) + goto out_put_fd; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + goto out_put_fd; + } + } spin_lock_irq(&current->sighand->siglock); diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c index 4db327b44586..3b7d81fb99bb 100644 --- a/tools/testing/selftests/ptrace/vmaccess.c +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -39,8 +39,15 @@ TEST(vmaccess) f = open(mm, O_RDONLY); ASSERT_GE(f, 0); close(f); - f = kill(pid, SIGCONT); - ASSERT_EQ(f, 0); + f = waitpid(-1, NULL, 0); + ASSERT_NE(f, -1); + ASSERT_NE(f, 0); + ASSERT_NE(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, -1); + ASSERT_EQ(errno, ECHILD); } TEST(attach) @@ -57,22 +64,24 @@ TEST(attach) sleep(1); k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); - ASSERT_EQ(errno, EAGAIN); - ASSERT_EQ(k, -1); + ASSERT_EQ(k, 0); k = waitpid(-1, &s, WNOHANG); ASSERT_NE(k, -1); ASSERT_NE(k, 0); ASSERT_NE(k, pid); ASSERT_EQ(WIFEXITED(s), 1); ASSERT_EQ(WEXITSTATUS(s), 0); - sleep(1); - k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFSTOPPED(s), 1); + ASSERT_EQ(WSTOPSIG(s), SIGTRAP); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGSTOP); - k = ptrace(PTRACE_DETACH, pid, 0L, 0L); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); -- 2.39.2

8 hours, 51 minutes

12
52
0 0

[PATCH RFC 0/4] landlock: add LANDLOCK_SCOPE_MEMFD_EXEC execution

by Abhinav Saxena

This patch series introduces LANDLOCK_SCOPE_MEMFD_EXEC, a new Landlock scoping mechanism that restricts execution of anonymous memory file descriptors (memfd) created via memfd_create(2). This addresses security gaps where processes can bypass W^X policies and execute arbitrary code through anonymous memory objects. Fixes: https://github.com/landlock-lsm/linux/issues/37 SECURITY PROBLEM ================ Current Landlock filesystem restrictions do not cover memfd objects, allowing processes to: 1. Read-to-execute bypass: Create writable memfd, inject code, then execute via mmap(PROT_EXEC) or direct execve() 2. Anonymous execution: Execute code without touching the filesystem via execve("/proc/self/fd/N") where N is a memfd descriptor 3. Cross-domain access violations: Pass memfd between processes to bypass domain restrictions These scenarios can occur in sandboxed environments where filesystem access is restricted but memfd creation remains possible. IMPLEMENTATION ============== The implementation adds hierarchical execution control through domain scoping: Core Components: - is_memfd_file(): Reliable memfd detection via "memfd:" dentry prefix - domain_is_scoped(): Cross-domain hierarchy checking (moved to domain.c) - LSM hooks: mmap_file, file_mprotect, bprm_creds_for_exec - Creation-time restrictions: hook_file_alloc_security Security Matrix: Execution decisions follow domain hierarchy rules preventing both same-domain bypass attempts and cross-domain access violations while preserving legitimate hierarchical access patterns. Domain Hierarchy with LANDLOCK_SCOPE_MEMFD_EXEC: =============================================== Root (no domain) - No restrictions | +-- Domain A [SCOPE_MEMFD_EXEC] Layer 1 | +-- memfd_A (tagged with Domain A as creator) | | | +-- Domain A1 (child) [NO SCOPE] Layer 2 | | +-- Inherits Layer 1 restrictions from parent | | +-- memfd_A1 (can create, inherits restrictions) | | +-- Domain A1a [SCOPE_MEMFD_EXEC] Layer 3 | | +-- memfd_A1a (tagged with Domain A1a) | | | +-- Domain A2 (child) [SCOPE_MEMFD_EXEC] Layer 2 | +-- memfd_A2 (tagged with Domain A2 as creator) | +-- CANNOT access memfd_A1 (different subtree) | +-- Domain B [SCOPE_MEMFD_EXEC] Layer 1 +-- memfd_B (tagged with Domain B as creator) +-- CANNOT access ANY memfd from Domain A subtree Execution Decision Matrix: ======================== Executor-> | A | A1 | A1a | A2 | B | Root Creator | | | | | | ------------|-----|----|-----|----|----|----- Domain A | X | X | X | X | X | Y Domain A1 | Y | X | X | X | X | Y Domain A1a | Y | Y | X | X | X | Y Domain A2 | Y | X | X | X | X | Y Domain B | X | X | X | X | X | Y Root | Y | Y | Y | Y | Y | Y Legend: Y = Execution allowed, X = Execution denied Scenarios Covered: - Direct mmap(PROT_EXEC) on memfd files - Two-stage mmap(PROT_READ) + mprotect(PROT_EXEC) bypass attempts - execve("/proc/self/fd/N") anonymous execution - execveat() and fexecve() file descriptor execution - Cross-process memfd inheritance and IPC passing TESTING ======= All patches have been validated with: - scripts/checkpatch.pl --strict (clean) - Selftests covering same-domain restrictions, cross-domain hierarchy enforcement, and regular file isolation - KUnit tests for memfd detection edge cases DISCLAIMER ========== My understanding of Landlock scoping semantics may be limited, but this implementation reflects my current understanding based on available documentation and code analysis. I welcome feedback and corrections regarding the scoping logic and domain hierarchy enforcement. Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> --- Abhinav Saxena (4): landlock: add LANDLOCK_SCOPE_MEMFD_EXEC scope landlock: implement memfd detection landlock: add memfd exec LSM hooks and scoping selftests/landlock: add memfd execution tests include/uapi/linux/landlock.h | 5 + security/landlock/.kunitconfig | 1 + security/landlock/audit.c | 4 + security/landlock/audit.h | 1 + security/landlock/cred.c | 14 - security/landlock/domain.c | 67 ++++ security/landlock/domain.h | 4 + security/landlock/fs.c | 405 ++++++++++++++++++++- security/landlock/limits.h | 2 +- security/landlock/task.c | 67 ---- .../selftests/landlock/scoped_memfd_exec_test.c | 325 +++++++++++++++++ 11 files changed, 812 insertions(+), 83 deletions(-) --- base-commit: 5b74b2eff1eeefe43584e5b7b348c8cd3b723d38 change-id: 20250716-memfd-exec-ac0d582018c3 Best regards, -- Abhinav Saxena <xandfury(a)gmail.com>

1 day, 16 hours

3
11
0 0

[PATCH] selftests/seccomp: improve backwards compatibility for older kernels

by Wake Liu

This commit introduces checks for kernel version and seccomp filter flag support to the seccomp selftests. It also includes conditional header inclusions using __GLIBC_PREREQ. Some tests were gated by kernel version, and adjustments were made for flags introduced after kernel 5.4. This ensures the selftests can run and pass correctly on kernel versions 5.4 and later, preventing failures due to features not present in older kernels. The use of __GLIBC_PREREQ ensures proper compilation and functionality across different glibc versions in a mainline Linux kernel context. While it might appear redundant in specific build environments due to global overrides, it is crucial for upstream correctness and portability. Signed-off-by: Wake Liu <wakel(a)google.com> --- tools/testing/selftests/seccomp/seccomp_bpf.c | 108 ++++++++++++++++-- 1 file changed, 99 insertions(+), 9 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 61acbd45ffaa..9b660cff5a4a 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -13,12 +13,14 @@ * we need to use the kernel's siginfo.h file and trick glibc * into accepting it. */ +#if defined(__GLIBC__) && defined(__GLIBC_PREREQ) #if !__GLIBC_PREREQ(2, 26) # include <asm/siginfo.h> # define __have_siginfo_t 1 # define __have_sigval_t 1 # define __have_sigevent_t 1 #endif +#endif #include <errno.h> #include <linux/filter.h> @@ -300,6 +302,26 @@ int seccomp(unsigned int op, unsigned int flags, void *args) } #endif +int seccomp_flag_supported(int flag) +{ + /* + * Probes if a seccomp filter flag is supported by the kernel. + * + * When an unsupported flag is passed to seccomp(SECCOMP_SET_MODE_FILTER, ...), + * the kernel returns EINVAL. + * + * When a supported flag is passed, the kernel proceeds to validate the + * filter program pointer. By passing NULL for the filter program, + * the kernel attempts to dereference a bad address, resulting in EFAULT. + * + * Therefore, checking for EFAULT indicates that the flag itself was + * recognized and supported by the kernel. + */ + if (seccomp(SECCOMP_SET_MODE_FILTER, flag, NULL) == -1 && errno == EFAULT) + return 1; + return 0; +} + #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n])) #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ @@ -2436,13 +2458,12 @@ TEST(detect_seccomp_filter_flags) ASSERT_NE(ENOSYS, errno) { TH_LOG("Kernel does not support seccomp syscall!"); } - EXPECT_EQ(-1, ret); - EXPECT_EQ(EFAULT, errno) { - TH_LOG("Failed to detect that a known-good filter flag (0x%X) is supported!", - flag); - } - all_flags |= flag; + if (seccomp_flag_supported(flag)) + all_flags |= flag; + else + TH_LOG("Filter flag (0x%X) is not found to be supported!", + flag); } /* @@ -2870,6 +2891,12 @@ TEST_F(TSYNC, two_siblings_with_one_divergence) TEST_F(TSYNC, two_siblings_with_one_divergence_no_tid_in_err) { + /* Depends on 5189149 (seccomp: allow TSYNC and USER_NOTIF together) */ + if (!seccomp_flag_supported(SECCOMP_FILTER_FLAG_TSYNC_ESRCH)) { + SKIP(return, "Kernel does not support SECCOMP_FILTER_FLAG_TSYNC_ESRCH"); + return; + } + long ret, flags; void *status; @@ -3475,6 +3502,11 @@ TEST(user_notification_basic) TEST(user_notification_with_tsync) { + /* Depends on 5189149 (seccomp: allow TSYNC and USER_NOTIF together) */ + if (!seccomp_flag_supported(SECCOMP_FILTER_FLAG_TSYNC_ESRCH)) { + SKIP(return, "Kernel does not support SECCOMP_FILTER_FLAG_TSYNC_ESRCH"); + return; + } int ret; unsigned int flags; @@ -3966,6 +3998,13 @@ TEST(user_notification_filter_empty) TEST(user_ioctl_notification_filter_empty) { + /* Depends on 95036a7 (seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV + * when all users have exited) */ + if (!ksft_min_kernel_version(6, 11)) { + SKIP(return, "Kernel version < 6.11"); + return; + } + pid_t pid; long ret; int status, p[2]; @@ -4119,6 +4158,12 @@ int get_next_fd(int prev_fd) TEST(user_notification_addfd) { + /* Depends on 0ae71c7 (seccomp: Support atomic "addfd + send reply") */ + if (!ksft_min_kernel_version(5, 14)) { + SKIP(return, "Kernel version < 5.14"); + return; + } + pid_t pid; long ret; int status, listener, memfd, fd, nextfd; @@ -4281,6 +4326,12 @@ TEST(user_notification_addfd) TEST(user_notification_addfd_rlimit) { + /* Depends on 7cf97b1 (seccomp: Introduce addfd ioctl to seccomp user notifier) */ + if (!ksft_min_kernel_version(5, 9)) { + SKIP(return, "Kernel version < 5.9"); + return; + } + pid_t pid; long ret; int status, listener, memfd; @@ -4326,9 +4377,12 @@ TEST(user_notification_addfd_rlimit) EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1); EXPECT_EQ(errno, EMFILE); - addfd.flags = SECCOMP_ADDFD_FLAG_SEND; - EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1); - EXPECT_EQ(errno, EMFILE); + /* Depends on 0ae71c7 (seccomp: Support atomic "addfd + send reply") */ + if (ksft_min_kernel_version(5, 14)) { + addfd.flags = SECCOMP_ADDFD_FLAG_SEND; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1); + EXPECT_EQ(errno, EMFILE); + } addfd.newfd = 100; addfd.flags = SECCOMP_ADDFD_FLAG_SETFD; @@ -4356,6 +4410,12 @@ TEST(user_notification_addfd_rlimit) TEST(user_notification_sync) { + /* Depends on 48a1084 (seccomp: add the synchronous mode for seccomp_unotify) */ + if (!ksft_min_kernel_version(6, 6)) { + SKIP(return, "Kernel version < 6.6"); + return; + } + struct seccomp_notif req = {}; struct seccomp_notif_resp resp = {}; int status, listener; @@ -4520,6 +4580,12 @@ static char get_proc_stat(struct __test_metadata *_metadata, pid_t pid) TEST(user_notification_fifo) { + /* Depends on 4cbf6f6 (seccomp: Use FIFO semantics to order notifications) */ + if (!ksft_min_kernel_version(5, 19)) { + SKIP(return, "Kernel version < 5.19"); + return; + } + struct seccomp_notif_resp resp = {}; struct seccomp_notif req = {}; int i, status, listener; @@ -4623,6 +4689,12 @@ static long get_proc_syscall(struct __test_metadata *_metadata, int pid) /* Ensure non-fatal signals prior to receive are unmodified */ TEST(user_notification_wait_killable_pre_notification) { + /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */ + if (!ksft_min_kernel_version(5, 19)) { + SKIP(return, "Kernel version < 5.19"); + return; + } + struct sigaction new_action = { .sa_handler = signal_handler, }; @@ -4693,6 +4765,12 @@ TEST(user_notification_wait_killable_pre_notification) /* Ensure non-fatal signals after receive are blocked */ TEST(user_notification_wait_killable) { + /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */ + if (!ksft_min_kernel_version(5, 19)) { + SKIP(return, "Kernel version < 5.19"); + return; + } + struct sigaction new_action = { .sa_handler = signal_handler, }; @@ -4772,6 +4850,12 @@ TEST(user_notification_wait_killable) /* Ensure fatal signals after receive are not blocked */ TEST(user_notification_wait_killable_fatal) { + /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */ + if (!ksft_min_kernel_version(5, 19)) { + SKIP(return, "Kernel version < 5.19"); + return; + } + struct seccomp_notif req = {}; int listener, status; pid_t pid; @@ -4854,6 +4938,12 @@ static void *tsync_vs_dead_thread_leader_sibling(void *_args) */ TEST(tsync_vs_dead_thread_leader) { + /* Depends on bfafe5e (seccomp: release task filters when the task exits) */ + if (!ksft_min_kernel_version(6, 11)) { + SKIP(return, "Kernel version < 6.11"); + return; + } + int status; pid_t pid; long ret; -- 2.50.1.703.g449372360f-goog

3 days, 4 hours

2
2
0 0

[PATCH] selftests/timers: Skip some posix_timers tests on kernels < 6.13

by Wake Liu

Several tests in the posix_timers selftest fail on kernels older than 6.13. These tests check for timer behavior related to SIG_IGN, which was refactored in the 6.13 kernel cycle, notably by commit caf77435dd8a ("signal: Handle ignored signals in do_sigaction(action != SIG_IGN)"). To ensure the selftests pass on older, stable kernels, gate the affected tests with a ksft_min_kernel_version(6, 13) check. Signed-off-by: Wake Liu <wakel(a)google.com> --- tools/testing/selftests/timers/posix_timers.c | 21 +++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/tools/testing/selftests/timers/posix_timers.c b/tools/testing/selftests/timers/posix_timers.c index f0eceb0faf34..f228e51f8b58 100644 --- a/tools/testing/selftests/timers/posix_timers.c +++ b/tools/testing/selftests/timers/posix_timers.c @@ -256,6 +256,11 @@ static void *ignore_thread(void *arg) static void check_sig_ign(int thread) { + if (!ksft_min_kernel_version(6, 13)) { + // see caf77435dd8a + ksft_test_result_skip("Depends on refactor of posix timers in 6.13\n"); + return; + } struct tmrsig tsig = { }; struct itimerspec its; unsigned int tid = 0; @@ -342,6 +347,10 @@ static void check_sig_ign(int thread) static void check_rearm(void) { + if (!ksft_min_kernel_version(6, 13)) { + ksft_test_result_skip("Depends on refactor of posix timers in 6.13\n"); + return; + } struct tmrsig tsig = { }; struct itimerspec its; struct sigaction sa; @@ -398,6 +407,10 @@ static void check_rearm(void) static void check_delete(void) { + if (!ksft_min_kernel_version(6, 13)) { + ksft_test_result_skip("Depends on refactor of posix timers in 6.13\n"); + return; + } struct tmrsig tsig = { }; struct itimerspec its; struct sigaction sa; @@ -455,6 +468,10 @@ static inline int64_t calcdiff_ns(struct timespec t1, struct timespec t2) static void check_sigev_none(int which, const char *name) { + if (!ksft_min_kernel_version(6, 13)) { + ksft_test_result_skip("Depends on refactor of posix timers in 6.13\n"); + return; + } struct timespec start, now; struct itimerspec its; struct sigevent sev; @@ -493,6 +510,10 @@ static void check_sigev_none(int which, const char *name) static void check_gettime(int which, const char *name) { + if (!ksft_min_kernel_version(6, 13)) { + ksft_test_result_skip("Depends on refactor of posix timers in 6.13\n"); + return; + } struct itimerspec its, prev; struct timespec start, now; struct sigevent sev; -- 2.50.1.703.g449372360f-goog

2 weeks, 4 days

2
2
0 0

[PATCH v4 00/23] ARM64 PMU Partitioning

by Colton Lewis

This series creates a new PMU scheme on ARM, a partitioned PMU that allows reserving a subset of counters for more direct guest access, significantly reducing overhead. More details, including performance benchmarks, can be read in the v1 cover letter linked below. v4: * Apply Mark Brown's non-UNDEF FGT control commit to the PMU FGT controls and calculate those controls with the others in kvm_calculate_traps() * Introduce lazy context swaps for guests that only turns on for guests that have enabled partitioning and accessed PMU registers. * Rename pmu-part.c to pmu-direct.c because future features might achieve direct PMU access without partitioning. * Better explain certain commits, such as why the untrapped registers are safe to untrap. * Reduce the PMU include cleanup down to only what is still necessary and explain why. v3: https://lore.kernel.org/kvm/20250626200459.1153955-1-coltonlewis@google.com/ v2: https://lore.kernel.org/kvm/20250620221326.1261128-1-coltonlewis@google.com/ v1: https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/ Colton Lewis (21): arm64: cpufeature: Add cpucap for HPMN0 KVM: arm64: Reorganize PMU functions perf: arm_pmuv3: Introduce method to partition the PMU perf: arm_pmuv3: Generalize counter bitmasks perf: arm_pmuv3: Keep out of guest counter partition KVM: arm64: Account for partitioning in kvm_pmu_get_max_counters() KVM: arm64: Set up FGT for Partitioned PMU KVM: arm64: Writethrough trapped PMEVTYPER register KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned KVM: arm64: Writethrough trapped PMOVS register KVM: arm64: Write fast path PMU register handlers KVM: arm64: Setup MDCR_EL2 to handle a partitioned PMU KVM: arm64: Account for partitioning in PMCR_EL0 access KVM: arm64: Context swap Partitioned PMU guest registers KVM: arm64: Enforce PMU event filter at vcpu_load() KVM: arm64: Extract enum debug_owner to enum vcpu_register_owner KVM: arm64: Implement lazy PMU context swaps perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters KVM: arm64: Inject recorded guest interrupts KVM: arm64: Add ioctl to partition the PMU when supported KVM: arm64: selftests: Add test case for partitioned PMU Marc Zyngier (1): KVM: arm64: Reorganize PMU includes Mark Brown (1): KVM: arm64: Introduce non-UNDEF FGT control Documentation/virt/kvm/api.rst | 21 + arch/arm/include/asm/arm_pmuv3.h | 38 + arch/arm64/include/asm/arm_pmuv3.h | 61 +- arch/arm64/include/asm/kvm_host.h | 34 +- arch/arm64/include/asm/kvm_pmu.h | 123 +++ arch/arm64/include/asm/kvm_types.h | 7 +- arch/arm64/kernel/cpufeature.c | 8 + arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 22 + arch/arm64/kvm/debug.c | 33 +- arch/arm64/kvm/hyp/include/hyp/debug-sr.h | 6 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 181 ++++- arch/arm64/kvm/pmu-direct.c | 395 ++++++++++ arch/arm64/kvm/pmu-emul.c | 674 +--------------- arch/arm64/kvm/pmu.c | 725 ++++++++++++++++++ arch/arm64/kvm/sys_regs.c | 137 +++- arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 6 +- drivers/perf/arm_pmuv3.c | 128 +++- include/linux/perf/arm_pmu.h | 1 + include/linux/perf/arm_pmuv3.h | 14 +- include/uapi/linux/kvm.h | 4 + tools/include/uapi/linux/kvm.h | 2 + .../selftests/kvm/arm64/vpmu_counter_access.c | 62 +- 24 files changed, 1910 insertions(+), 775 deletions(-) create mode 100644 arch/arm64/kvm/pmu-direct.c base-commit: 79150772457f4d45e38b842d786240c36bb1f97f -- 2.50.0.727.gbf7dc18ff4-goog

1 month

3
31
0 0

[PATCH nf-next v6 0/2] Add IPIP flowtable SW acceleratio

by Lorenzo Bianconi

Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. --- Changes in v6: - Rebase on top of nf-next main branch - Link to v5: https://lore.kernel.org/r/20250721-nf-flowtable-ipip-v5-0-0865af9e58c6@kern… Changes in v5: - Rely on __ipv4_addr_hash() to compute the hash used as encap ID - Remove unnecessary pskb_may_pull() in nf_flow_tuple_encap() - Add nf_flow_ip4_ecanp_pop utility routine - Link to v4: https://lore.kernel.org/r/20250718-nf-flowtable-ipip-v4-0-f8bb1c18b986@kern… Changes in v4: - Use the hash value of the saddr, daddr and protocol of outer IP header as encapsulation id. - Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kern… Changes in v3: - Add outer IP header sanity checks - target nf-next tree instead of net-next - Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kern… Changes in v2: - Introduce IPIP flowtable selftest - Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern… --- Lorenzo Bianconi (2): net: netfilter: Add IPIP flowtable SW acceleration selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest include/linux/netdevice.h | 1 + net/ipv4/ipip.c | 28 +++++++++++ net/netfilter/nf_flow_table_ip.c | 56 +++++++++++++++++++++- net/netfilter/nft_flow_offload.c | 1 + .../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++ 5 files changed, 124 insertions(+), 2 deletions(-) --- base-commit: bab3ce404553de56242d7b09ad7ea5b70441ea41 change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067 Best regards, -- Lorenzo Bianconi <lorenzo(a)kernel.org>

1 month

2
5
0 0

[PATCH kvm-next V11 0/7] Add NUMA mempolicy support for KVM guest-memfd

by Shivank Garg

This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work (V17) that enabled host-mapping for guest_memfd memory [1] and can be applied directly applied on KVM tree [2] (branch kvm-next, base commit: a6ad5413, Merge branch 'guest-memfd-mmap' into HEAD) == Background == KVM's guest-memfd memory backend currently lacks support for NUMA policy enforcement, causing guest memory allocations to be distributed across host nodes according to kernel's default behavior, irrespective of any policy specified by the VMM. This limitation arises because conventional userspace NUMA control mechanisms like mbind(2) don't work since the memory isn't directly mapped to userspace when allocations occur. Fuad's work [1] provides the necessary mmap capability, and this series leverages it to enable mbind(2). == Implementation == This series implements proper NUMA policy support for guest-memfd by: 1. Adding mempolicy-aware allocation APIs to the filemap layer. 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA policy. With these changes, VMMs can now control guest memory placement by mapping guest_memfd file descriptor and using mbind(2) to specify: - Policy modes: default, bind, interleave, or preferred - Host NUMA nodes: List of target nodes for memory allocation These Policies affect only future allocations and do not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported for guest_memfd as it is unmovable by design). == Upstream Plan == Phased approach as per David's guest_memfd extension overview [3] and community calls [4]: Phase 1 (this series): 1. Focuses on shared guest_memfd support (non-CoCo VMs). 2. Builds on Fuad's host-mapping work [1]. Phase2 (future work): 1. NUMA support for private guest_memfd (CoCo VMs). 2. Depends on SNP in-place conversion support [5]. This series provides a clean integration path for NUMA-aware memory management for guest_memfd and lays the groundwork for future confidential computing NUMA capabilities. Thanks, Shivank == Changelog == - v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy. - v3: Introduced fbind() syscall for VMM memory-placement configuration. - v4-v6: Current approach using shared_policy support and vm_ops (based on suggestions from David [6] and guest_memfd bi-weekly upstream call discussion [7]). - v7: Use inodes to store NUMA policy instead of file [8]. - v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory. - v9: Rebase on top of Fuad's V13 and incorporate review comments - V10: Rebase on top of Fuad's V17. Use latest guest_memfd inode patch from Ackerley (with David's review comments). Use newer kmem_cache_create() API variant with arg parameter (Vlastimil) - V11: Rebase on kvm-next, remove RFC tag, use Ackerley's latest patch and fix a rcu race bug during kvm module unload. [1] https://lore.kernel.org/all/20250729225455.670324-1-seanjc@google.com [2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next [3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com [4] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo… [5] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com [6] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com [7] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com [8] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com Ackerley Tng (1): KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes Matthew Wilcox (Oracle) (2): mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies Shivank Garg (4): mm/mempolicy: Export memory policy symbols KVM: guest_memfd: Add slab-allocated inode cache KVM: guest_memfd: Enforce NUMA mempolicy using shared policy KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 +- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 18 +- include/uapi/linux/magic.h | 1 + mm/filemap.c | 23 +- mm/mempolicy.c | 6 + mm/readahead.c | 2 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 121 ++++++++ virt/kvm/guest_memfd.c | 262 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 9 +- 15 files changed, 412 insertions(+), 50 deletions(-) -- 2.43.0 --- == Earlier Postings == v10: https://lore.kernel.org/all/20250811090605.16057-2-shivankg@amd.com v9: https://lore.kernel.org/all/20250713174339.13981-2-shivankg@amd.com v8: https://lore.kernel.org/all/20250618112935.7629-1-shivankg@amd.com v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com

1 month

9
37
0 0

[PATCH 0/2] selftests: Centralize kselftest headers to avoid relative includes

by Bala-Vignesh-Reddy

This series centralize the handling of kselftest.h and kselftest_harness.h includes in selftests, replacing relative paths with a non-relative approach using shared -I path. Patch-1 updates the build files (Makefile and lib.mk) and include CFLAGS in sync/Makefile to resolve not found error Patch-2 applies bulk source change (Patch 2 is large but it is replaced automatically) Checked the changes with gcc-13.32 and clang 18.1 Suggested-by: Andrew Morton <akpm(a)linux-foundation.org> Link: https://lore.kernel.org/lkml/20250820143954.33d95635e504e94df01930d0@linux-… Bala-Vignesh-Reddy (2): selftests: Centralize include path for kselftest.h and kselftest_harness.h selftests: Replace relative includes with non-relative for kselftest.h and kselftest_harness.h Signed-off-by: Bala-Vignesh-Reddy <reddybalavignesh9979(a)gmail.com> tools/testing/selftests/Makefile | 4 ++++ tools/testing/selftests/acct/acct_syscall.c | 2 +- tools/testing/selftests/alsa/conf.c | 2 +- tools/testing/selftests/alsa/mixer-test.c | 2 +- tools/testing/selftests/alsa/pcm-test.c | 2 +- tools/testing/selftests/alsa/test-pcmtest-driver.c | 2 +- tools/testing/selftests/alsa/utimer-test.c | 2 +- tools/testing/selftests/arm64/abi/hwcap.c | 2 +- tools/testing/selftests/arm64/abi/ptrace.c | 2 +- tools/testing/selftests/arm64/abi/syscall-abi.c | 2 +- tools/testing/selftests/arm64/fp/fp-ptrace.c | 2 +- tools/testing/selftests/arm64/fp/fp-stress.c | 2 +- tools/testing/selftests/arm64/fp/sve-probe-vls.c | 2 +- tools/testing/selftests/arm64/fp/sve-ptrace.c | 2 +- tools/testing/selftests/arm64/fp/vec-syscfg.c | 2 +- tools/testing/selftests/arm64/fp/za-ptrace.c | 2 +- tools/testing/selftests/arm64/fp/zt-ptrace.c | 2 +- tools/testing/selftests/arm64/gcs/gcs-stress.c | 2 +- tools/testing/selftests/arm64/pauth/pac.c | 2 +- tools/testing/selftests/arm64/tags/tags_test.c | 2 +- tools/testing/selftests/bpf/xskxceiver.c | 2 +- tools/testing/selftests/breakpoints/breakpoint_test.c | 2 +- tools/testing/selftests/breakpoints/breakpoint_test_arm64.c | 2 +- tools/testing/selftests/breakpoints/step_after_suspend_test.c | 2 +- tools/testing/selftests/cachestat/test_cachestat.c | 2 +- tools/testing/selftests/capabilities/test_execve.c | 2 +- tools/testing/selftests/capabilities/validate_cap.c | 2 +- tools/testing/selftests/cgroup/test_core.c | 2 +- tools/testing/selftests/cgroup/test_cpu.c | 2 +- tools/testing/selftests/cgroup/test_cpuset.c | 2 +- tools/testing/selftests/cgroup/test_freezer.c | 2 +- tools/testing/selftests/cgroup/test_hugetlb_memcg.c | 2 +- tools/testing/selftests/cgroup/test_kill.c | 2 +- tools/testing/selftests/cgroup/test_kmem.c | 2 +- tools/testing/selftests/cgroup/test_memcontrol.c | 2 +- tools/testing/selftests/cgroup/test_pids.c | 2 +- tools/testing/selftests/cgroup/test_zswap.c | 2 +- tools/testing/selftests/clone3/clone3.c | 2 +- .../testing/selftests/clone3/clone3_cap_checkpoint_restore.c | 2 +- tools/testing/selftests/clone3/clone3_clear_sighand.c | 2 +- tools/testing/selftests/clone3/clone3_selftests.h | 2 +- tools/testing/selftests/clone3/clone3_set_tid.c | 2 +- tools/testing/selftests/connector/proc_filter.c | 2 +- tools/testing/selftests/core/close_range_test.c | 2 +- tools/testing/selftests/core/unshare_test.c | 2 +- tools/testing/selftests/coredump/stackdump_test.c | 2 +- tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 2 +- tools/testing/selftests/drivers/dma-buf/udmabuf.c | 2 +- tools/testing/selftests/drivers/ntsync/ntsync.c | 2 +- .../testing/selftests/drivers/s390x/uvdevice/test_uvdevice.c | 2 +- tools/testing/selftests/exec/check-exec.c | 2 +- tools/testing/selftests/exec/execveat.c | 2 +- tools/testing/selftests/exec/load_address.c | 2 +- tools/testing/selftests/exec/non-regular.c | 2 +- tools/testing/selftests/exec/null-argv.c | 2 +- tools/testing/selftests/exec/recursion-depth.c | 2 +- tools/testing/selftests/fchmodat2/fchmodat2_test.c | 2 +- tools/testing/selftests/filelock/ofdlocks.c | 2 +- tools/testing/selftests/filesystems/anon_inode_test.c | 2 +- tools/testing/selftests/filesystems/binderfs/binderfs_test.c | 2 +- tools/testing/selftests/filesystems/devpts_pts.c | 2 +- tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c | 2 +- tools/testing/selftests/filesystems/eventfd/eventfd_test.c | 2 +- tools/testing/selftests/filesystems/file_stressor.c | 2 +- tools/testing/selftests/filesystems/kernfs_test.c | 2 +- .../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +- .../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +- tools/testing/selftests/filesystems/nsfs/iterate_mntns.c | 2 +- tools/testing/selftests/filesystems/overlayfs/dev_in_maps.c | 2 +- .../selftests/filesystems/overlayfs/set_layers_via_fds.c | 2 +- .../testing/selftests/filesystems/statmount/listmount_test.c | 2 +- .../testing/selftests/filesystems/statmount/statmount_test.c | 2 +- .../selftests/filesystems/statmount/statmount_test_ns.c | 2 +- tools/testing/selftests/filesystems/utils.c | 2 +- tools/testing/selftests/hid/hid_common.h | 2 +- tools/testing/selftests/intel_pstate/aperf.c | 2 +- tools/testing/selftests/iommu/iommufd_utils.h | 2 +- tools/testing/selftests/ipc/msgque.c | 2 +- tools/testing/selftests/ir/ir_loopback.c | 2 +- tools/testing/selftests/kcmp/kcmp_test.c | 2 +- tools/testing/selftests/kselftest_harness.h | 2 +- tools/testing/selftests/kselftest_harness/harness-selftest.c | 2 +- tools/testing/selftests/landlock/audit.h | 2 +- tools/testing/selftests/landlock/common.h | 2 +- tools/testing/selftests/lib.mk | 2 ++ tools/testing/selftests/lsm/lsm_get_self_attr_test.c | 2 +- tools/testing/selftests/lsm/lsm_list_modules_test.c | 2 +- tools/testing/selftests/lsm/lsm_set_self_attr_test.c | 2 +- tools/testing/selftests/media_tests/media_device_open.c | 2 +- tools/testing/selftests/media_tests/media_device_test.c | 2 +- tools/testing/selftests/membarrier/membarrier_test_impl.h | 2 +- tools/testing/selftests/mincore/mincore_selftest.c | 4 ++-- tools/testing/selftests/mm/compaction_test.c | 2 +- tools/testing/selftests/mm/cow.c | 2 +- tools/testing/selftests/mm/droppable.c | 2 +- tools/testing/selftests/mm/guard-regions.c | 2 +- tools/testing/selftests/mm/gup_longterm.c | 2 +- tools/testing/selftests/mm/gup_test.c | 2 +- tools/testing/selftests/mm/hmm-tests.c | 2 +- tools/testing/selftests/mm/hugepage-mmap.c | 2 +- tools/testing/selftests/mm/hugepage-mremap.c | 2 +- tools/testing/selftests/mm/hugetlb-madvise.c | 2 +- tools/testing/selftests/mm/hugetlb-read-hwpoison.c | 2 +- tools/testing/selftests/mm/hugetlb-soft-offline.c | 2 +- tools/testing/selftests/mm/hugetlb_dio.c | 2 +- tools/testing/selftests/mm/hugetlb_fault_after_madv.c | 2 +- tools/testing/selftests/mm/hugetlb_madv_vs_map.c | 2 +- tools/testing/selftests/mm/ksm_functional_tests.c | 2 +- tools/testing/selftests/mm/ksm_tests.c | 2 +- tools/testing/selftests/mm/madv_populate.c | 2 +- tools/testing/selftests/mm/map_fixed_noreplace.c | 2 +- tools/testing/selftests/mm/map_hugetlb.c | 2 +- tools/testing/selftests/mm/map_populate.c | 2 +- tools/testing/selftests/mm/mdwe_test.c | 2 +- tools/testing/selftests/mm/memfd_secret.c | 2 +- tools/testing/selftests/mm/merge.c | 2 +- tools/testing/selftests/mm/migration.c | 2 +- tools/testing/selftests/mm/mkdirty.c | 2 +- tools/testing/selftests/mm/mlock-random-test.c | 2 +- tools/testing/selftests/mm/mlock2-tests.c | 2 +- tools/testing/selftests/mm/mrelease_test.c | 2 +- tools/testing/selftests/mm/mremap_dontunmap.c | 2 +- tools/testing/selftests/mm/mremap_test.c | 2 +- tools/testing/selftests/mm/mseal_test.c | 2 +- tools/testing/selftests/mm/on-fault-limit.c | 2 +- tools/testing/selftests/mm/pagemap_ioctl.c | 2 +- tools/testing/selftests/mm/pfnmap.c | 2 +- tools/testing/selftests/mm/pkey-helpers.h | 2 +- tools/testing/selftests/mm/process_madv.c | 2 +- tools/testing/selftests/mm/soft-dirty.c | 2 +- tools/testing/selftests/mm/split_huge_page_test.c | 2 +- tools/testing/selftests/mm/thuge-gen.c | 2 +- tools/testing/selftests/mm/transhuge-stress.c | 2 +- tools/testing/selftests/mm/uffd-common.h | 2 +- tools/testing/selftests/mm/uffd-wp-mremap.c | 2 +- tools/testing/selftests/mm/va_high_addr_switch.c | 2 +- tools/testing/selftests/mm/virtual_address_range.c | 2 +- tools/testing/selftests/mm/vm_util.c | 2 +- tools/testing/selftests/mm/vm_util.h | 2 +- tools/testing/selftests/mount_setattr/mount_setattr_test.c | 2 +- .../move_mount_set_group/move_mount_set_group_test.c | 2 +- tools/testing/selftests/mqueue/mq_open_tests.c | 2 +- tools/testing/selftests/mqueue/mq_perf_tests.c | 2 +- .../selftests/mseal_system_mappings/sysmap_is_sealed.c | 4 ++-- tools/testing/selftests/nci/nci_dev.c | 2 +- tools/testing/selftests/net/af_unix/diag_uid.c | 2 +- tools/testing/selftests/net/af_unix/msg_oob.c | 2 +- tools/testing/selftests/net/af_unix/scm_inq.c | 2 +- tools/testing/selftests/net/af_unix/scm_pidfd.c | 2 +- tools/testing/selftests/net/af_unix/scm_rights.c | 2 +- tools/testing/selftests/net/af_unix/unix_connect.c | 2 +- tools/testing/selftests/net/bind_timewait.c | 2 +- tools/testing/selftests/net/bind_wildcard.c | 2 +- tools/testing/selftests/net/can/test_raw_filter.c | 2 +- tools/testing/selftests/net/cmsg_sender.c | 2 +- tools/testing/selftests/net/epoll_busy_poll.c | 2 +- tools/testing/selftests/net/gro.c | 2 +- tools/testing/selftests/net/ip_local_port_range.c | 2 +- tools/testing/selftests/net/ipsec.c | 2 +- tools/testing/selftests/net/netfilter/conntrack_dump_flush.c | 2 +- tools/testing/selftests/net/netlink-dumps.c | 2 +- tools/testing/selftests/net/proc_net_pktgen.c | 2 +- tools/testing/selftests/net/psock_fanout.c | 2 +- tools/testing/selftests/net/psock_tpacket.c | 2 +- tools/testing/selftests/net/reuseaddr_ports_exhausted.c | 2 +- tools/testing/selftests/net/reuseport_bpf.c | 2 +- tools/testing/selftests/net/reuseport_bpf_numa.c | 2 +- tools/testing/selftests/net/rxtimestamp.c | 2 +- tools/testing/selftests/net/sk_so_peek_off.c | 2 +- tools/testing/selftests/net/so_incoming_cpu.c | 2 +- tools/testing/selftests/net/socket.c | 2 +- tools/testing/selftests/net/tap.c | 2 +- tools/testing/selftests/net/tcp_ao/lib/setup.c | 2 +- tools/testing/selftests/net/tcp_fastopen_backup_key.c | 2 +- tools/testing/selftests/net/tls.c | 2 +- tools/testing/selftests/net/toeplitz.c | 2 +- tools/testing/selftests/net/tun.c | 2 +- tools/testing/selftests/net/udpgso_bench_tx.c | 2 +- tools/testing/selftests/openat2/helpers.h | 2 +- tools/testing/selftests/openat2/openat2_test.c | 2 +- tools/testing/selftests/openat2/rename_attack_test.c | 2 +- tools/testing/selftests/openat2/resolve_test.c | 2 +- tools/testing/selftests/pci_endpoint/pci_endpoint_test.c | 2 +- tools/testing/selftests/perf_events/mmap.c | 2 +- tools/testing/selftests/perf_events/remove_on_exec.c | 2 +- tools/testing/selftests/perf_events/sigtrap_threads.c | 2 +- tools/testing/selftests/perf_events/watermark_signal.c | 2 +- tools/testing/selftests/pid_namespace/pid_max.c | 2 +- tools/testing/selftests/pid_namespace/regression_enomem.c | 2 +- tools/testing/selftests/pidfd/pidfd.h | 2 +- tools/testing/selftests/pidfd/pidfd_bind_mount.c | 2 +- tools/testing/selftests/pidfd/pidfd_fdinfo_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_file_handle_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_getfd_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_info_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_open_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_poll_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_setattr_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_setns_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_wait.c | 2 +- tools/testing/selftests/pidfd/pidfd_xattr_test.c | 2 +- tools/testing/selftests/prctl/set-anon-vma-name-test.c | 2 +- tools/testing/selftests/prctl/set-process-name.c | 2 +- tools/testing/selftests/proc/proc-maps-race.c | 2 +- tools/testing/selftests/proc/proc-pid-vm.c | 2 +- tools/testing/selftests/ptrace/get_set_sud.c | 2 +- tools/testing/selftests/ptrace/get_syscall_info.c | 2 +- tools/testing/selftests/ptrace/set_syscall_info.c | 2 +- tools/testing/selftests/ptrace/vmaccess.c | 2 +- tools/testing/selftests/resctrl/resctrl.h | 2 +- tools/testing/selftests/ring-buffer/map_test.c | 2 +- tools/testing/selftests/riscv/abi/pointer_masking.c | 2 +- tools/testing/selftests/riscv/hwprobe/cbo.c | 2 +- tools/testing/selftests/riscv/hwprobe/hwprobe.c | 2 +- tools/testing/selftests/riscv/hwprobe/which-cpus.c | 2 +- tools/testing/selftests/riscv/mm/mmap_bottomup.c | 2 +- tools/testing/selftests/riscv/mm/mmap_default.c | 2 +- tools/testing/selftests/riscv/mm/mmap_test.h | 2 +- tools/testing/selftests/riscv/sigreturn/sigreturn.c | 2 +- tools/testing/selftests/riscv/vector/v_initval.c | 2 +- tools/testing/selftests/riscv/vector/vstate_prctl.c | 2 +- tools/testing/selftests/rseq/basic_percpu_ops_test.c | 2 +- tools/testing/selftests/rseq/rseq.c | 2 +- tools/testing/selftests/rtc/rtctest.c | 2 +- tools/testing/selftests/seccomp/seccomp_benchmark.c | 2 +- tools/testing/selftests/seccomp/seccomp_bpf.c | 2 +- tools/testing/selftests/sgx/main.c | 2 +- tools/testing/selftests/signal/mangle_uc_sigmask.c | 2 +- tools/testing/selftests/signal/sas.c | 2 +- tools/testing/selftests/sparc64/drivers/adi-test.c | 2 +- tools/testing/selftests/sync/Makefile | 2 +- tools/testing/selftests/sync/sync_test.c | 2 +- tools/testing/selftests/sync/synctest.h | 2 +- tools/testing/selftests/syscall_user_dispatch/sud_test.c | 2 +- tools/testing/selftests/tdx/tdx_guest_test.c | 2 +- tools/testing/selftests/timens/timens.h | 2 +- tools/testing/selftests/timers/adjtick.c | 2 +- tools/testing/selftests/timers/alarmtimer-suspend.c | 2 +- tools/testing/selftests/timers/change_skew.c | 2 +- tools/testing/selftests/timers/clocksource-switch.c | 2 +- tools/testing/selftests/timers/freq-step.c | 2 +- tools/testing/selftests/timers/inconsistency-check.c | 2 +- tools/testing/selftests/timers/leap-a-day.c | 2 +- tools/testing/selftests/timers/leapcrash.c | 2 +- tools/testing/selftests/timers/mqueue-lat.c | 2 +- tools/testing/selftests/timers/nanosleep.c | 2 +- tools/testing/selftests/timers/nsleep-lat.c | 2 +- tools/testing/selftests/timers/posix_timers.c | 2 +- tools/testing/selftests/timers/raw_skew.c | 2 +- tools/testing/selftests/timers/rtcpie.c | 2 +- tools/testing/selftests/timers/set-2038.c | 2 +- tools/testing/selftests/timers/set-tai.c | 2 +- tools/testing/selftests/timers/set-timer-lat.c | 2 +- tools/testing/selftests/timers/set-tz.c | 2 +- tools/testing/selftests/timers/skew_consistency.c | 2 +- tools/testing/selftests/timers/threadtest.c | 2 +- tools/testing/selftests/timers/valid-adjtimex.c | 2 +- tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c | 2 +- tools/testing/selftests/tty/tty_tstamp_update.c | 2 +- tools/testing/selftests/uevent/uevent_filtering.c | 2 +- tools/testing/selftests/user_events/abi_test.c | 2 +- tools/testing/selftests/user_events/dyn_test.c | 2 +- tools/testing/selftests/user_events/ftrace_test.c | 2 +- tools/testing/selftests/user_events/perf_test.c | 2 +- tools/testing/selftests/user_events/user_events_selftests.h | 2 +- tools/testing/selftests/vDSO/vdso_test_abi.c | 2 +- tools/testing/selftests/vDSO/vdso_test_chacha.c | 2 +- tools/testing/selftests/vDSO/vdso_test_clock_getres.c | 2 +- tools/testing/selftests/vDSO/vdso_test_correctness.c | 2 +- tools/testing/selftests/vDSO/vdso_test_getcpu.c | 2 +- tools/testing/selftests/vDSO/vdso_test_getrandom.c | 2 +- tools/testing/selftests/vDSO/vdso_test_gettimeofday.c | 2 +- tools/testing/selftests/x86/corrupt_xstate_header.c | 2 +- tools/testing/selftests/x86/helpers.h | 2 +- tools/testing/selftests/x86/lam.c | 2 +- tools/testing/selftests/x86/syscall_numbering.c | 2 +- tools/testing/selftests/x86/test_mremap_vdso.c | 2 +- tools/testing/selftests/x86/test_vsyscall.c | 2 +- tools/testing/selftests/x86/xstate.h | 2 +- 280 files changed, 286 insertions(+), 280 deletions(-) -- 2.43.0

1 month

3
17
0 0

[PATCH v3 0/4] selftests/resctrl: Enable MBM and MBA tests on AMD

by Babu Moger

The MBM (Memory Bandwidth Monitoring) and MBA (Memory Bandwidth Allocation) features are not enabled for AMD systems. The reason was lack of perf counters to compare the resctrl test results. Starting with the commit 25e56847821f ("perf/x86/amd/uncore: Add memory controller support"), AMD now supports the UMC (Unified Memory Controller) perf events. These events can be used to compare the test results. This series adds the support to detect the UMC events and enable MBM/MBA tests for AMD systems. v3: Note: Based the series on top of latest kselftests/master 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0 (tag: v6.10-rc1). Also applied the patches from the series https://lore.kernel.org/lkml/20240531131142.1716-1-ilpo.jarvinen@linux.inte… Separated the fix patch. Renamed the imc to just mc to make it generic. Changed the search string "uncore_imc_" and "amd_umc_" Changes related rebase to latest kselftest tree. v2: Changes. a. Rebased on top of tip/master (Apr 25, 2024) b. Addressed Ilpo comments except the one about close call. It seems more clear to keep READ and WRITE separate. https://lore.kernel.org/lkml/8e4badb7-6cc5-61f1-e041-d902209a90d5@linux.int… c. Used ksft_perror call when applicable. d. Added vendor check for non contiguous CBM check. v1: https://lore.kernel.org/lkml/cover.1708637563.git.babu.moger@amd.com/ Babu Moger (4): selftests/resctrl: Rename variables and functions to generic names selftests/resctrl: Pass sysfs controller name of the vendor selftests/resctrl: Add support for MBM and MBA tests on AMD selftests/resctrl: Enable MBA/MBA tests on AMD tools/testing/selftests/resctrl/mba_test.c | 25 +- tools/testing/selftests/resctrl/mbm_test.c | 23 +- tools/testing/selftests/resctrl/resctrl.h | 2 +- tools/testing/selftests/resctrl/resctrl_val.c | 305 ++++++++++-------- tools/testing/selftests/resctrl/resctrlfs.c | 2 +- 5 files changed, 191 insertions(+), 166 deletions(-) -- 2.34.1

1 month, 1 week

5
15
0 0

[PATCH v19 00/27] riscv control-flow integrity for usermode

by Deepak Gupta

Basics and overview =================== Software with larger attack surfaces (e.g. network facing apps like databases, browsers or apps relying on browser runtimes) suffer from memory corruption issues which can be utilized by attackers to bend control flow of the program to eventually gain control (by making their payload executable). Attackers are able to perform such attacks by leveraging call-sites which rely on indirect calls or return sites which rely on obtaining return address from stack memory. To mitigate such attacks, risc-v extension zicfilp enforces that all indirect calls must land on a landing pad instruction `lpad` else cpu will raise software check exception (a new cpu exception cause code on riscv). Similarly for return flow, risc-v extension zicfiss extends architecture with - `sspush` instruction to push return address on a shadow stack - `sspopchk` instruction to pop return address from shadow stack and compare with input operand (i.e. return address on stack) - `sspopchk` to raise software check exception if comparision above was a mismatch - Protection mechanism using which shadow stack is not writeable via regular store instructions More information an details can be found at extensions github repo [1]. Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel CET [3] and branch target identification (BTI) [4] on arm. Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack. x86 and arm64 support for user mode shadow stack is already in mainline. Kernel awareness for user control flow integrity ================================================ This series picks up Samuel Holland's envcfg changes [2] as well. So if those are being applied independently, they should be removed from this series. Enabling: In order to maintain compatibility and not break anything in user mode, kernel doesn't enable control flow integrity cpu extensions on binary by default. Instead exposes a prctl interface to enable, disable and lock the shadow stack or landing pad feature for a task. This allows userspace (loader) to enumerate if all objects in its address space are compiled with shadow stack and landing pad support and accordingly enable the feature. Additionally if a subsequent `dlopen` happens on a library, user mode can take a decision again to disable the feature (if incoming library is not compiled with support) OR terminate the task (if user mode policy is strict to have all objects in address space to be compiled with control flow integirty cpu feature). prctl to enable shadow stack results in allocating shadow stack from virtual memory and activating for user address space. x86 and arm64 are also following same direction due to similar reason(s). clone/fork: On clone and fork, cfi state for task is inherited by child. Shadow stack is part of virtual memory and is a writeable memory from kernel perspective (writeable via a restricted set of instructions aka shadow stack instructions) Thus kernel changes ensure that this memory is converted into read-only when fork/clone happens and COWed when fault is taken due to sspush, sspopchk or ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled, kernel will automatically allocate a shadow stack for that clone call. map_shadow_stack: x86 introduced `map_shadow_stack` system call to allow user space to explicitly map shadow stack memory in its address space. It is useful to allocate shadow for different contexts managed by a single thread (green threads or contexts) risc-v implements this system call as well. signal management: If shadow stack is enabled for a task, kernel performs an asynchronous control flow diversion to deliver the signal and eventually expects userspace to issue sigreturn so that original execution can be resumed. Even though resume context is prepared by kernel, it is in user space memory and is subject to memory corruption and corruption bugs can be utilized by attacker in this race window to perform arbitrary sigreturn and eventually bypass cfi mechanism. Another issue is how to ensure that cfi related state on sigcontext area is not trampled by legacy apps or apps compiled with old kernel headers. In order to mitigate control-flow hijacting, kernel prepares a token and place it on shadow stack before signal delivery and places address of token in sigcontext structure. During sigreturn, kernel obtains address of token from sigcontext struture, reads token from shadow stack and validates it and only then allow sigreturn to succeed. Compatiblity issue is solved by adopting dynamic sigcontext management introduced for vector extension. This series re-factor the code little bit to allow future sigcontext management easy (as proposed by Andy Chiu from SiFive) config and compilation: Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this config option picks the kernel support for user control flow integrity. This optin is presented only if toolchain has shadow stack and landing pad support. And is on purpose guarded by toolchain support. Reason being that eventually vDSO also needs to be compiled in with shadow stack and landing pad support. vDSO compile patches are not included as of now because landing pad labeling scheme is yet to settle for usermode runtime. To get more information on kernel interactions with respect to zicfilp and zicfiss, patch series adds documentation for `zicfilp` and `zicfiss` in following: Documentation/arch/riscv/zicfiss.rst Documentation/arch/riscv/zicfilp.rst How to test this series ======================= Toolchain --------- $ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev $ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static" $ make -j$(nproc) Qemu ---- Get the lastest qemu $ cd qemu $ mkdir build $ cd build $ ../configure --target-list=riscv64-softmmu $ make -j$(nproc) Opensbi ------- $ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi $ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic Linux ----- Running defconfig is fine. CFI is enabled by default if the toolchain supports it. $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) In case you're building your own rootfs using toolchain, please make sure you pick following patch to ensure that vDSO compiled with lpad and shadow stack. "arch/riscv: compile vdso with landing pad" Branch where above patch can be picked https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1 Running ------- Modify your qemu command to have: -bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin -cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true vDSO related Opens (in the flux) ================================= I am listing these opens for laying out plan and what to expect in future patch sets. And of course for the sake of discussion. Shadow stack and landing pad enabling in vDSO ---------------------------------------------- vDSO must have shadow stack and landing pad support compiled in for task to have shadow stack and landing pad support. This patch series doesn't enable that (yet). Enabling shadow stack support in vDSO should be straight forward (intend to do that in next versions of patch set). Enabling landing pad support in vDSO requires some collaboration with toolchain folks to follow a single label scheme for all object binaries. This is necessary to ensure that all indirect call-sites are setting correct label and target landing pads are decorated with same label scheme. How many vDSOs --------------- Shadow stack instructions are carved out of zimop (may be operations) and if CPU doesn't implement zimop, they're illegal instructions. Kernel could be running on a CPU which may or may not implement zimop. And thus kernel will have to carry 2 different vDSOs and expose the appropriate one depending on whether CPU implements zimop or not. References ========== [1] - https://github.com/riscv/riscv-cfi [2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c… [3] - https://lwn.net/Articles/889475/ [4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific… [5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i… [6] - https://lwn.net/Articles/940403/ To: Thomas Gleixner <tglx(a)linutronix.de> To: Ingo Molnar <mingo(a)redhat.com> To: Borislav Petkov <bp(a)alien8.de> To: Dave Hansen <dave.hansen(a)linux.intel.com> To: x86(a)kernel.org To: H. Peter Anvin <hpa(a)zytor.com> To: Andrew Morton <akpm(a)linux-foundation.org> To: Liam R. Howlett <Liam.Howlett(a)oracle.com> To: Vlastimil Babka <vbabka(a)suse.cz> To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> To: Paul Walmsley <paul.walmsley(a)sifive.com> To: Palmer Dabbelt <palmer(a)dabbelt.com> To: Albert Ou <aou(a)eecs.berkeley.edu> To: Conor Dooley <conor(a)kernel.org> To: Rob Herring <robh(a)kernel.org> To: Krzysztof Kozlowski <krzk+dt(a)kernel.org> To: Arnd Bergmann <arnd(a)arndb.de> To: Christian Brauner <brauner(a)kernel.org> To: Peter Zijlstra <peterz(a)infradead.org> To: Oleg Nesterov <oleg(a)redhat.com> To: Eric Biederman <ebiederm(a)xmission.com> To: Kees Cook <kees(a)kernel.org> To: Jonathan Corbet <corbet(a)lwn.net> To: Shuah Khan <shuah(a)kernel.org> To: Jann Horn <jannh(a)google.com> To: Conor Dooley <conor+dt(a)kernel.org> To: Miguel Ojeda <ojeda(a)kernel.org> To: Alex Gaynor <alex.gaynor(a)gmail.com> To: Boqun Feng <boqun.feng(a)gmail.com> To: Gary Guo <gary(a)garyguo.net> To: Björn Roy Baron <bjorn3_gh(a)protonmail.com> To: Benno Lossin <benno.lossin(a)proton.me> To: Andreas Hindborg <a.hindborg(a)kernel.org> To: Alice Ryhl <aliceryhl(a)google.com> To: Trevor Gross <tmgross(a)umich.edu> Cc: linux-kernel(a)vger.kernel.org Cc: linux-fsdevel(a)vger.kernel.org Cc: linux-mm(a)kvack.org Cc: linux-riscv(a)lists.infradead.org Cc: devicetree(a)vger.kernel.org Cc: linux-arch(a)vger.kernel.org Cc: linux-doc(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: alistair.francis(a)wdc.com Cc: richard.henderson(a)linaro.org Cc: jim.shu(a)sifive.com Cc: andybnac(a)gmail.com Cc: kito.cheng(a)sifive.com Cc: charlie(a)rivosinc.com Cc: atishp(a)rivosinc.com Cc: evan(a)rivosinc.com Cc: cleger(a)rivosinc.com Cc: alexghiti(a)rivosinc.com Cc: samitolvanen(a)google.com Cc: broonie(a)kernel.org Cc: rick.p.edgecombe(a)intel.com Cc: rust-for-linux(a)vger.kernel.org changelog --------- v19: - riscv_nousercfi was `int`. changed it to unsigned long. Thanks to Alex Ghiti for reporting it. It was a bug. - ELP is cleared on trap entry only when CONFIG_64BIT. - restore ssp back on return to usermode was being done before `riscv_v_context_nesting_end` on trap exit path. If kernel shadow stack were enabled this would result in kernel operating on user shadow stack and panic (as I found in my testing of kcfi patch series). So fixed that. v18: - rebased on 6.16-rc1 - uprobe handling clears ELP in sstatus image in pt_regs - vdso was missing shadow stack elf note for object files. added that. Additional asm file for vdso needed the elf marker flag. toolchain should complain if `-fcf-protection=full` and marker is missing for object generated from asm file. Asked toolchain folks to fix this. Although no reason to gate the merge on that. - Split up compile options for march and fcf-protection in vdso Makefile - CONFIG_RISCV_USER_CFI option is moved under "Kernel features" menu Added `arch/riscv/configs/hardening.config` fragment which selects CONFIG_RISCV_USER_CFI v17: - fixed warnings due to empty macros in usercfi.h (reported by alexg) - fixed prefixes in commit titles reported by alexg - took below uprobe with fcfi v2 patch from Zong Li and squashed it with "riscv/traps: Introduce software check exception and uprobe handling" https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/ v16: - If FWFT is not implemented or returns error for shadow stack activation, then no_usercfi is set to disable shadow stack. Although this should be picked up by extension validation and activation. Fixed this bug for zicfilp and zicfiss both. Thanks to Charlie Jenkins for reporting this. - If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by Charlie Jenkins. - Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to keep it off till we have more hardware availibility with RVA23 profile and zimop/zcmop implemented. Else this will start breaking people's workflow - Includes the fix if "!RV64 and !SBI" then definitions for FWFT in asm-offsets.c error. v15: - Toolchain has been updated to include `-fcf-protection` flag. This exists for x86 as well. Updated kernel patches to compile vDSO and selftest to compile with `fcf-protection=full` flag. - selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI. - Patch to enable shadow stack for kernel wasn't hidden behind CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that. v14: - rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants. - Took Radim's suggestions on bitfields. - Placed cfi_state at the end of thread_info block so that current situation is not disturbed with respect to member fields of thread_info in single cacheline. v13: - cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses riscv_has_extension_unlikely() - uses nops(count) to create nop slide - RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it - changed ternaries to simply use implicit casting to convert to bool. - kernel command line allows to disable zicfilp and zicfiss independently. updated kernel-parameters.txt. - ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace kselftest. - cosmetic and grammatical changes to documentation. v12: - It seems like I had accidently squashed arch agnostic indirect branch tracking prctl and riscv implementation of those prctls. Split them again. - set_shstk_status/set_indir_lp_status perform CSR writes only when CPU support is available. As suggested by Zong Li. - Some minor clean up in kselftests as suggested by Zong Li. v11: - patch "arch/riscv: compile vdso with landing pad" was unconditionally selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to to `lpad 0`. v10: - dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch is not that interesting to this patch series for risc-v. There are instances in arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch to expedite merging in riscv tree. - Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to validate presence of cfi based on config. - Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure we add single vdso object with cfi enabled. But a vdso object with scheme of zero labeled landing pad is least common denominator and should work with all objects of zero labeled as well as function-signature labeled objects. v9: - rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion") - dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs) - dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs) v8: - rebased on palmer/for-next - dropped samuel holland's `envcfg` context switch patches. they are in parlmer/for-next v7: - Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv" Instead using `deactivate_mm` flow to clean up. see here for more context https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.… - Changed the header include in `kselftest`. Hopefully this fixes compile issue faced by Zong Li at SiFive. - Cleaned up an orphaned change to `mm/mmap.c` in below patch "riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE" - Lock interfaces for shadow stack and indirect branch tracking expect arg == 0 Any future evolution of this interface should accordingly define how arg should be setup. - `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper `is_shadow_stack_vma`. - Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv… v6: - Picked up Samuel Holland's changes as is with `envcfg` placed in `thread` instead of `thread_info` - fixed unaligned newline escapes in kselftest - cleaned up messages in kselftest and included test output in commit message - fixed a bug in clone path reported by Zong Li - fixed a build issue if CONFIG_RISCV_ISA_V is not selected (this was introduced due to re-factoring signal context management code) v5: - rebased on v6.12-rc1 - Fixed schema related issues in device tree file - Fixed some of the documentation related issues in zicfilp/ss.rst (style issues and added index) - added `SHADOW_STACK_SET_MARKER` so that implementation can define base of shadow stack. - Fixed warnings on definitions added in usercfi.h when CONFIG_RISCV_USER_CFI is not selected. - Adopted context header based signal handling as proposed by Andy Chiu - Added support for enabling kernel mode access to shadow stack using FWFT (https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…) - Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv… (Note: I had an issue in my workflow due to which version number wasn't picked up correctly while sending out patches) v4: - rebased on 6.11-rc6 - envcfg: Converged with Samuel Holland's patches for envcfg management on per- thread basis. - vma_is_shadow_stack is renamed to is_vma_shadow_stack - picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch - signal context: using extended context management to maintain compatibility. - fixed `-Wmissing-prototypes` compiler warnings for prctl functions - Documentation fixes and amending typos. - Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/ v3: - envcfg logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been picked on per task basis, even though CPU didn't implement it. Fixed in this series. - dt-bindings As suggested, split into separate commit. fixed the messaging that spec is in public review - arch_is_shadow_stack change arch_is_shadow_stack changed to vma_is_shadow_stack - hwprobe zicfiss / zicfilp if present will get enumerated in hwprobe - selftests As suggested, added object and binary filenames to .gitignore Selftest binary anyways need to be compiled with cfi enabled compiler which will make sure that landing pad and shadow stack are enabled. Thus removed separate enable/disable tests. Cleaned up tests a bit. - Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ v2: - Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow integrity for user mode programs can be compiled in the kernel. - Enabling of control flow integrity for user programs is left to user runtime - This patch series introduces arch agnostic `prctls` to enable shadow stack and indirect branch tracking. And implements them on riscv. --- Changes in v19: - Link to v18: https://lore.kernel.org/r/20250711-v5_user_cfi_series-v18-0-a8ee62f9f38e@ri… Changes in v18: - Link to v17: https://lore.kernel.org/r/20250604-v5_user_cfi_series-v17-0-4565c2cf869f@ri… Changes in v17: - Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri… Changes in v16: - Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri… Changes in v15: - changelog posted just below cover letter - Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri… Changes in v14: - changelog posted just below cover letter - Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri… Changes in v13: - changelog posted just below cover letter - Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri… Changes in v12: - changelog posted just below cover letter - Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri… Changes in v11: - changelog posted just below cover letter - Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri… --- Andy Chiu (1): riscv: signal: abstract header saving for setup_sigcontext Deepak Gupta (25): mm: VM_SHADOW_STACK definition for riscv dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml) riscv: zicfiss / zicfilp enumeration riscv: zicfiss / zicfilp extension csr and bit definitions riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE riscv/mm: manufacture shadow stack pte riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs riscv/mm: write protect and shadow stack riscv/mm: Implement map_shadow_stack() syscall riscv/shstk: If needed allocate a new shadow stack on clone riscv: Implements arch agnostic shadow stack prctls prctl: arch-agnostic prctl for indirect branch tracking riscv: Implements arch agnostic indirect branch tracking prctls riscv/traps: Introduce software check exception and uprobe handling riscv/signal: save and restore of shadow stack for signal riscv/kernel: update __show_regs to print shadow stack register riscv/ptrace: riscv cfi status and state via ptrace and in core files riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe riscv: kernel command line option to opt out of user cfi riscv: enable kernel access to shadow stack memory via FWFT sbi call riscv: create a config for shadow stack and landing pad instr support riscv: Documentation for landing pad / indirect branch tracking riscv: Documentation for shadow stack on riscv kselftest/riscv: kselftest for user mode cfi Jim Shu (1): arch/riscv: compile vdso with landing pad and shadow stack note Documentation/admin-guide/kernel-parameters.txt | 8 + Documentation/arch/riscv/index.rst | 2 + Documentation/arch/riscv/zicfilp.rst | 115 +++++ Documentation/arch/riscv/zicfiss.rst | 179 +++++++ .../devicetree/bindings/riscv/extensions.yaml | 14 + arch/riscv/Kconfig | 21 + arch/riscv/Makefile | 5 +- arch/riscv/configs/hardening.config | 4 + arch/riscv/include/asm/asm-prototypes.h | 1 + arch/riscv/include/asm/assembler.h | 44 ++ arch/riscv/include/asm/cpufeature.h | 12 + arch/riscv/include/asm/csr.h | 16 + arch/riscv/include/asm/entry-common.h | 2 + arch/riscv/include/asm/hwcap.h | 2 + arch/riscv/include/asm/mman.h | 26 + arch/riscv/include/asm/mmu_context.h | 7 + arch/riscv/include/asm/pgtable.h | 30 +- arch/riscv/include/asm/processor.h | 1 + arch/riscv/include/asm/thread_info.h | 3 + arch/riscv/include/asm/usercfi.h | 95 ++++ arch/riscv/include/asm/vector.h | 3 + arch/riscv/include/uapi/asm/hwprobe.h | 2 + arch/riscv/include/uapi/asm/ptrace.h | 34 ++ arch/riscv/include/uapi/asm/sigcontext.h | 1 + arch/riscv/kernel/Makefile | 1 + arch/riscv/kernel/asm-offsets.c | 10 + arch/riscv/kernel/cpufeature.c | 27 + arch/riscv/kernel/entry.S | 38 ++ arch/riscv/kernel/head.S | 27 + arch/riscv/kernel/process.c | 27 +- arch/riscv/kernel/ptrace.c | 95 ++++ arch/riscv/kernel/signal.c | 148 +++++- arch/riscv/kernel/sys_hwprobe.c | 2 + arch/riscv/kernel/sys_riscv.c | 10 + arch/riscv/kernel/traps.c | 54 ++ arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++ arch/riscv/kernel/vdso/Makefile | 11 +- arch/riscv/kernel/vdso/flush_icache.S | 4 + arch/riscv/kernel/vdso/getcpu.S | 4 + arch/riscv/kernel/vdso/rt_sigreturn.S | 4 + arch/riscv/kernel/vdso/sys_hwprobe.S | 4 + arch/riscv/kernel/vdso/vgetrandom-chacha.S | 5 +- arch/riscv/mm/init.c | 2 +- arch/riscv/mm/pgtable.c | 16 + include/linux/cpu.h | 4 + include/linux/mm.h | 7 + include/uapi/linux/elf.h | 2 + include/uapi/linux/prctl.h | 27 + kernel/sys.c | 30 ++ tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/cfi/.gitignore | 3 + tools/testing/selftests/riscv/cfi/Makefile | 16 + tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++ tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++ tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++ tools/testing/selftests/riscv/cfi/shadowstack.h | 27 + 56 files changed, 2389 insertions(+), 30 deletions(-) --- base-commit: a2a05801de77ca5122fc34e3eb84d6359ef70389 change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2 -- - debug

1 month, 1 week

8
50
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror August 2025