March 2024 - Linux-kselftest-mirror

[PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach

by Bernd Edlinger

This introduces signal->exec_bprm, which is used to fix the case when at least one of the sibling threads is traced, and therefore the trace process may dead-lock in ptrace_attach, but de_thread will need to wait for the tracer to continue execution. The solution is to detect this situation and allow ptrace_attach to continue by temporarily releasing the cred_guard_mutex, while de_thread() is still waiting for traced zombies to be eventually released by the tracer. In the case of the thread group leader we only have to wait for the thread to become a zombie, which may also need co-operation from the tracer due to PTRACE_O_TRACEEXIT. When a tracer wants to ptrace_attach a task that already is in execve, we simply retry the ptrace_may_access check while temporarily installing the new credentials and dumpability which are about to be used after execve completes. If the ptrace_attach happens on a thread that is a sibling-thread of the thread doing execve, it is sufficient to check against the old credentials, as this thread will be waited for, before the new credentials are installed. Other threads die quickly since the cred_guard_mutex is released, but a deadly signal is already pending. In case the mutex_lock_killable misses the signal, the non-zero current->signal->exec_bprm makes sure they release the mutex immediately and return with -ERESTARTNOINTR. This means there is no API change, unlike the previous version of this patch which was discussed here: https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.d… See tools/testing/selftests/ptrace/vmaccess.c for a test case that gets fixed by this change. Note that since the test case was originally designed to test the ptrace_attach returning an error in this situation, the test expectation needed to be adjusted, to allow the API to succeed at the first attempt. Signed-off-by: Bernd Edlinger <bernd.edlinger(a)hotmail.de> --- fs/exec.c | 69 ++++++++++++++++------- fs/proc/base.c | 6 ++ include/linux/cred.h | 1 + include/linux/sched/signal.h | 18 ++++++ kernel/cred.c | 28 +++++++-- kernel/ptrace.c | 32 +++++++++++ kernel/seccomp.c | 12 +++- tools/testing/selftests/ptrace/vmaccess.c | 23 +++++--- 8 files changed, 155 insertions(+), 34 deletions(-) v10: Changes to previous version, make the PTRACE_ATTACH retun -EAGAIN, instead of execve return -ERESTARTSYS. Added some lessions learned to the description. v11: Check old and new credentials in PTRACE_ATTACH again without changing the API. Note: I got actually one response from an automatic checker to the v11 patch, https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/ which is complaining about: >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@ 417 struct linux_binprm *bprm = task->signal->exec_bprm; 418 const struct cred *old_cred; 419 struct mm_struct *old_mm; 420 421 retval = down_write_killable(&task->signal->exec_update_lock); 422 if (retval) 423 goto unlock_creds; 424 task_lock(task); > 425 old_cred = task->real_cred; v12: Essentially identical to v11. - Fixed a minor merge conflict in linux v5.17, and fixed the above mentioned nit by adding __rcu to the declaration. - re-tested the patch with all linux versions from v5.11 to v6.6 v10 was an alternative approach which did imply an API change. But I would prefer to avoid such an API change. The difficult part is getting the right dumpability flags assigned before de_thread starts, hope you like this version. If not, the v10 is of course also acceptable. Thanks Bernd. diff --git a/fs/exec.c b/fs/exec.c index 2f2b0acec4f0..902d3b230485 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm) return 0; } -static int de_thread(struct task_struct *tsk) +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + struct task_struct *t = tsk; + bool unsafe_execve_in_progress = false; if (thread_group_empty(tsk)) goto no_thread_group; @@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) sig->notify_count--; + while_each_thread(tsk, t) { + if (unlikely(t->ptrace) + && (t != tsk->group_leader || !t->exit_state)) + unsafe_execve_in_progress = true; + } + + if (unlikely(unsafe_execve_in_progress)) { + spin_unlock_irq(lock); + sig->exec_bprm = bprm; + mutex_unlock(&sig->cred_guard_mutex); + spin_lock_irq(lock); + } + while (sig->notify_count) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); @@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + sig->group_exec_task = NULL; sig->notify_count = 0; @@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk) return 0; killed: + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + /* protects against exit_notify() and __exit_signal() */ read_lock(&tasklist_lock); sig->group_exec_task = NULL; @@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) return retval; + /* If the binary is not readable then enforce mm->dumpable=0 */ + would_dump(bprm, bprm->file); + if (bprm->have_execfd) + would_dump(bprm, bprm->executable); + + /* + * Figure out dumpability. Note that this checking only of current + * is wrong, but userspace depends on it. This should be testing + * bprm->secureexec instead. + */ + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || + is_dumpability_changed(current_cred(), bprm->cred) || + !(uid_eq(current_euid(), current_uid()) && + gid_eq(current_egid(), current_gid()))) + set_dumpable(bprm->mm, suid_dumpable); + else + set_dumpable(bprm->mm, SUID_DUMP_USER); + /* * Ensure all future errors are fatal. */ @@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm) /* * Make this the only thread in the thread group. */ - retval = de_thread(me); + retval = de_thread(me, bprm); if (retval) goto out; @@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; - /* If the binary is not readable then enforce mm->dumpable=0 */ - would_dump(bprm, bprm->file); - if (bprm->have_execfd) - would_dump(bprm, bprm->executable); - /* * Release all of the old mmap stuff */ @@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm) me->sas_ss_sp = me->sas_ss_size = 0; - /* - * Figure out dumpability. Note that this checking only of current - * is wrong, but userspace depends on it. This should be testing - * bprm->secureexec instead. - */ - if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || - !(uid_eq(current_euid(), current_uid()) && - gid_eq(current_egid(), current_gid()))) - set_dumpable(current->mm, suid_dumpable); - else - set_dumpable(current->mm, SUID_DUMP_USER); - perf_event_exec(); __set_task_comm(me, kbasename(bprm->filename), true); @@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm) if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) return -ERESTARTNOINTR; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + bprm->cred = prepare_exec_creds(); if (likely(bprm->cred)) return 0; diff --git a/fs/proc/base.c b/fs/proc/base.c index ffd54617c354..0da9adfadb48 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf, if (rv < 0) goto out_free; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + rv = -ERESTARTNOINTR; + goto out_free; + } + rv = security_setprocattr(PROC_I(inode)->op.lsm, file->f_path.dentry->d_name.name, page, count); diff --git a/include/linux/cred.h b/include/linux/cred.h index f923528d5cc4..b01e309f5686 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *); extern struct cred *cred_alloc_blank(void); extern struct cred *prepare_creds(void); extern struct cred *prepare_exec_creds(void); +extern bool is_dumpability_changed(const struct cred *, const struct cred *); extern int commit_creds(struct cred *); extern void abort_creds(struct cred *); extern const struct cred *override_creds(const struct cred *); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 0014d3adaf84..14df7073a0a8 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -234,9 +234,27 @@ struct signal_struct { struct mm_struct *oom_mm; /* recorded mm when the thread group got * killed by the oom killer */ + struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access + * against new credentials while + * de_thread is waiting for other + * traced threads to terminate. + * Set while de_thread is executing. + * The cred_guard_mutex is released + * after de_thread() has called + * zap_other_threads(), therefore + * a fatal signal is guaranteed to be + * already pending in the unlikely + * event, that + * current->signal->exec_bprm happens + * to be non-zero after the + * cred_guard_mutex was acquired. + */ + struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) + * Held while execve runs, except when + * a sibling thread is being traced. * Deprecated do not use in new code. * Use exec_update_lock instead. */ diff --git a/kernel/cred.c b/kernel/cred.c index 98cb4eca23fb..586cb6c7cf6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset) return false; } +/** + * is_dumpability_changed - Will changing creds from old to new + * affect the dumpability in commit_creds? + * + * Return: false - dumpability will not be changed in commit_creds. + * Return: true - dumpability will be changed to non-dumpable. + * + * @old: The old credentials + * @new: The new credentials + */ +bool is_dumpability_changed(const struct cred *old, const struct cred *new) +{ + if (!uid_eq(old->euid, new->euid) || + !gid_eq(old->egid, new->egid) || + !uid_eq(old->fsuid, new->fsuid) || + !gid_eq(old->fsgid, new->fsgid) || + !cred_cap_issubset(old, new)) + return true; + + return false; +} + /** * commit_creds - Install new credentials upon the current task * @new: The credentials to be assigned @@ -467,11 +489,7 @@ int commit_creds(struct cred *new) get_cred(new); /* we will require a ref for the subj creds too */ /* dumpability changes */ - if (!uid_eq(old->euid, new->euid) || - !gid_eq(old->egid, new->egid) || - !uid_eq(old->fsuid, new->fsuid) || - !gid_eq(old->fsgid, new->fsgid) || - !cred_cap_issubset(old, new)) { + if (is_dumpability_changed(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); task->pdeath_signal = 0; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 443057bee87c..eb1c450bb7d7 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -20,6 +20,7 @@ #include <linux/pagemap.h> #include <linux/ptrace.h> #include <linux/security.h> +#include <linux/binfmts.h> #include <linux/signal.h> #include <linux/uio.h> #include <linux/audit.h> @@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request, if (retval) goto unlock_creds; + if (unlikely(task->in_execve)) { + struct linux_binprm *bprm = task->signal->exec_bprm; + const struct cred __rcu *old_cred; + struct mm_struct *old_mm; + + retval = down_write_killable(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + task_lock(task); + old_cred = task->real_cred; + old_mm = task->mm; + rcu_assign_pointer(task->real_cred, bprm->cred); + task->mm = bprm->mm; + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); + rcu_assign_pointer(task->real_cred, old_cred); + task->mm = old_mm; + task_unlock(task); + up_write(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + } + write_lock_irq(&tasklist_lock); retval = -EPERM; if (unlikely(task->exit_state)) @@ -508,6 +531,14 @@ static int ptrace_traceme(void) { int ret = -EPERM; + if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) + return -ERESTARTNOINTR; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + write_lock_irq(&tasklist_lock); /* Are we already being traced? */ if (!current->ptrace) { @@ -523,6 +554,7 @@ static int ptrace_traceme(void) } } write_unlock_irq(&tasklist_lock); + mutex_unlock(&current->signal->cred_guard_mutex); return ret; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 255999ba9190..b29bbfa0b044 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags, * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. */ - if (flags & SECCOMP_FILTER_FLAG_TSYNC && - mutex_lock_killable(&current->signal->cred_guard_mutex)) - goto out_put_fd; + if (flags & SECCOMP_FILTER_FLAG_TSYNC) { + if (mutex_lock_killable(&current->signal->cred_guard_mutex)) + goto out_put_fd; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + goto out_put_fd; + } + } spin_lock_irq(&current->sighand->siglock); diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c index 4db327b44586..3b7d81fb99bb 100644 --- a/tools/testing/selftests/ptrace/vmaccess.c +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -39,8 +39,15 @@ TEST(vmaccess) f = open(mm, O_RDONLY); ASSERT_GE(f, 0); close(f); - f = kill(pid, SIGCONT); - ASSERT_EQ(f, 0); + f = waitpid(-1, NULL, 0); + ASSERT_NE(f, -1); + ASSERT_NE(f, 0); + ASSERT_NE(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, -1); + ASSERT_EQ(errno, ECHILD); } TEST(attach) @@ -57,22 +64,24 @@ TEST(attach) sleep(1); k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); - ASSERT_EQ(errno, EAGAIN); - ASSERT_EQ(k, -1); + ASSERT_EQ(k, 0); k = waitpid(-1, &s, WNOHANG); ASSERT_NE(k, -1); ASSERT_NE(k, 0); ASSERT_NE(k, pid); ASSERT_EQ(WIFEXITED(s), 1); ASSERT_EQ(WEXITSTATUS(s), 0); - sleep(1); - k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFSTOPPED(s), 1); + ASSERT_EQ(WSTOPSIG(s), SIGTRAP); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGSTOP); - k = ptrace(PTRACE_DETACH, pid, 0L, 0L); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); -- 2.39.2

1 day, 14 hours

16
69
0 0

[PATCH v7 0/6] mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC

by jeffxu＠chromium.org

From: Jeff Xu <jeffxu(a)google.com> Since Linux introduced the memfd feature, memfd have always had their execute bit set, and the memfd_create() syscall doesn't allow setting it differently. However, in a secure by default system, such as ChromeOS, (where all executables should come from the rootfs, which is protected by Verified boot), this executable nature of memfd opens a door for NoExec bypass and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm process created a memfd to share the content with an external process, however the memfd is overwritten and used for executing arbitrary code and root escalation. [2] lists more VRP in this kind. On the other hand, executable memfd has its legit use, runc uses memfd’s seal and executable feature to copy the contents of the binary then execute them, for such system, we need a solution to differentiate runc's use of executable memfds and an attacker's [3]. To address those above, this set of patches add following: 1> Let memfd_create() set X bit at creation time. 2> Let memfd to be sealed for modifying X bit. 3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of X bit.For example, if a container has vm.memfd_noexec=2, then memfd_create() without MFD_NOEXEC_SEAL will be rejected. 4> A new security hook in memfd_create(). This make it possible to a new LSM, which rejects or allows executable memfd based on its security policy. Change history: v7: - patch 2/6: remove #ifdef and MAX_PATH (memfd_test.c). - patch 3/6: check capability (CAP_SYS_ADMIN) from userns instead of global ns (pid_sysctl.h). Add a tab (pid_namespace.h). - patch 5/6: remove #ifdef (memfd_test.c) - patch 6/6: remove unneeded security_move_mount(security.c). v6:https://lore.kernel.org/lkml/20221206150233.1963717-1-jeffxu@google.com/ - Address comment and move "#ifdef CONFIG_" from .c file to pid_sysctl.h v5:https://lore.kernel.org/lkml/20221206152358.1966099-1-jeffxu@google.com/ - Pass vm.memfd_noexec from current ns to child ns. - Fix build issue detected by kernel test robot. - Add missing security.c v3:https://lore.kernel.org/lkml/20221202013404.163143-1-jeffxu@google.com/ - Address API design comments in v2. - Let memfd_create() to set X bit at creation time. - A new pid namespace sysctl: vm.memfd_noexec to control behavior of X bit. - A new security hook in memfd_create(). v2:https://lore.kernel.org/lkml/20220805222126.142525-1-jeffxu@google.com/ - address comments in V1. - add sysctl (vm.mfd_noexec) to set the default file permissions of memfd_create to be non-executable. v1:https://lwn.net/Articles/890096/ [1] https://crbug.com/1305411 [2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20me… [3] https://lwn.net/Articles/781013/ Daniel Verkamp (2): mm/memfd: add F_SEAL_EXEC selftests/memfd: add tests for F_SEAL_EXEC Jeff Xu (4): mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC mm/memfd: security hook for memfd_create include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 + include/linux/pid_namespace.h | 19 ++ include/linux/security.h | 6 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/memfd.h | 4 + kernel/pid_namespace.c | 5 + kernel/pid_sysctl.h | 59 ++++ mm/memfd.c | 61 +++- mm/shmem.c | 6 + security/security.c | 5 + tools/testing/selftests/memfd/fuse_test.c | 1 + tools/testing/selftests/memfd/memfd_test.c | 341 ++++++++++++++++++++- 13 files changed, 510 insertions(+), 3 deletions(-) create mode 100644 kernel/pid_sysctl.h base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8 -- 2.39.0.rc1.256.g54fd8350bd-goog

2 months, 2 weeks

9
25
0 0

[RFC PATCH 00/11] New KVM ioctl to link a gmem inode to a new gmem file

by Ackerley Tng

Hello, This patchset builds upon the code at https://lore.kernel.org/lkml/20230718234512.1690985-1-seanjc@google.com/T/. This code is available at https://github.com/googleprodkernel/linux-cc/tree/kvm-gmem-link-migrate-rfc…. In guest_mem v11, a split file/inode model was proposed, where memslot bindings belong to the file and pages belong to the inode. This model lends itself well to having different VMs use separate files pointing to the same inode. This RFC proposes an ioctl, KVM_LINK_GUEST_MEMFD, that takes a VM and a gmem fd, and returns another gmem fd referencing a different file and associated with VM. This RFC also includes an update to KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to migrate memory context (slot->arch.lpage_info and kvm->mem_attr_array) from source to destination vm, intra-host. Intended usage of the two ioctls: 1. Source VM’s fd is passed to destination VM via unix sockets 2. Destination VM uses new ioctl KVM_LINK_GUEST_MEMFD to link source VM’s fd to a new fd. 3. Destination VM will pass new fds to KVM_SET_USER_MEMORY_REGION, which will bind the new file, pointing to the same inode that the source VM’s file points to, to memslots 4. Use KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to move kvm->mem_attr_array and slot->arch.lpage_info to the destination VM. 5. Run the destination VM as per normal Some other approaches considered were: + Using the linkat() syscall, but that requires a mount/directory for a source fd to be linked to + Using the dup() syscall, but that only duplicates the fd, and both fds point to the same file --- Ackerley Tng (11): KVM: guest_mem: Refactor out kvm_gmem_alloc_file() KVM: guest_mem: Add ioctl KVM_LINK_GUEST_MEMFD KVM: selftests: Add tests for KVM_LINK_GUEST_MEMFD ioctl KVM: selftests: Test transferring private memory to another VM KVM: x86: Refactor sev's flag migration_in_progress to kvm struct KVM: x86: Refactor common code out of sev.c KVM: x86: Refactor common migration preparation code out of sev_vm_move_enc_context_from KVM: x86: Let moving encryption context be configurable KVM: x86: Handle moving of memory context for intra-host migration KVM: selftests: Generalize migration functions from sev_migrate_tests.c KVM: selftests: Add tests for migration of private mem arch/x86/include/asm/kvm_host.h | 4 +- arch/x86/kvm/svm/sev.c | 85 ++----- arch/x86/kvm/svm/svm.h | 3 +- arch/x86/kvm/x86.c | 221 +++++++++++++++++- arch/x86/kvm/x86.h | 6 + include/linux/kvm_host.h | 18 ++ include/uapi/linux/kvm.h | 8 + tools/testing/selftests/kvm/Makefile | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 42 ++++ .../selftests/kvm/include/kvm_util_base.h | 31 +++ .../kvm/x86_64/private_mem_migrate_tests.c | 93 ++++++++ .../selftests/kvm/x86_64/sev_migrate_tests.c | 48 ++-- virt/kvm/guest_mem.c | 151 ++++++++++-- virt/kvm/kvm_main.c | 10 + virt/kvm/kvm_mm.h | 7 + 15 files changed, 596 insertions(+), 132 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_migrate_tests.c -- 2.41.0.640.ga95def55d0-goog

7 months

3
15
0 0

[PATCH v6 0/5] userfaultfd move option

by Suren Baghdasaryan

This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has long been implemented and maintained by Andrea in his local tree [1], but was not upstreamed due to lack of use cases where this approach would be better than allocating a new page and copying the contents. Previous upstraming attempts could be found at [6] and [7]. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [3]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [3]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. TODOs for follow-up improvements: - cross-mm support. Known differences from single-mm and missing pieces: - memcg recharging (might need to isolate pages in the process) - mm counters - cross-mm deposit table moves - cross-mm test - document the address space where src and dest reside in struct uffdio_move - TLB flush batching. Will require extensive changes to PTL locking in move_pages_pte(). OTOH that might let us reuse parts of mremap code. Changes since v5 [10]: - added logic to split large folios in move_pages_pte(), per David Hildenbrand - added check for PAE before split_huge_pmd() to avoid the split if the move operation can't be done - replaced calls to default_huge_page_size() with read_pmd_pagesize() in uffd_move_pmd test, per David Hildenbrand - fixed the condition in uffd_move_test_common() checking if area alignment is needed Changes since v4 [9]: - added Acked-by in patch 1, per Peter Xu - added description for ctx, mm and mode parameters of move_pages(), per kernel test robot - added Reviewed-by's, per Peter Xu and Axel Rasmussen - removed unused operations in uffd_test_case_ops - refactored uffd-unit-test changes to avoid using global variables and handle pmd moves without page size overrides, per Peter Xu Changes since v3 [8]: - changed retry path in folio_lock_anon_vma_read() to unlock and then relock RCU, per Peter Xu - removed cross-mm support from initial patchset, per David Hildenbrand - replaced BUG_ONs with VM_WARN_ON or WARN_ON_ONCE, per David Hildenbrand - added missing cache flushing, per Lokesh Gidra and Peter Xu - updated manpage text in the patch description, per Peter Xu - renamed internal functions from "remap" to "move", per Peter Xu - added mmap_changing check after taking mmap_lock, per Peter Xu - changed uffd context check to ensure dst_mm is registered onto uffd we are operating on, Peter Xu and David Hildenbrand - changed to non-maybe variants of maybe*_mkwrite(), per David Hildenbrand - fixed warning for CONFIG_TRANSPARENT_HUGEPAGE=n, per kernel test robot - comments cleanup, per David Hildenbrand and Peter Xu - checks for VM_IO,VM_PFNMAP,VM_HUGETLB,..., per David Hildenbrand - prevent moving pinned pages, per Peter Xu - changed uffd tests to call move uffd_test_ctx_clear() at the end of the test run instead of in the beginning of the next run - added support for testcase-specific ops - added test for moving PMD-aligned blocks Changes since v2 [5]: - renamed UFFDIO_REMAP to UFFDIO_MOVE, per David Hildenbrand - rebase over mm-unstable to use folio_move_anon_rmap(), per David Hildenbrand - added text for manpage explaining DONTFORK and KSM requirements for this feature, per David Hildenbrand - check for anon_vma changes in the fast path of folio_lock_anon_vma_read, per Peter Xu - updated the title and description of the first patch, per David Hildenbrand - updating comments in folio_lock_anon_vma_read() explaining the need for anon_vma checks, per David Hildenbrand - changed all mapcount checks to PageAnonExclusive, per Jann Horn and David Hildenbrand - changed counters in remap_swap_pte() from MM_ANONPAGES to MM_SWAPENTS, per Jann Horn - added a check for PTE change after folio is locked in remap_pages_pte(), per Jann Horn - added handling of PMD migration entries and bailout when pmd_devmap(), per Jann Horn - added checks to ensure both src and dst VMAs are writable, per Peter Xu - added UFFD_FEATURE_MOVE, per Peter Xu - removed obsolete comments, per Peter Xu - renamed remap_anon_pte to remap_present_pte, per Peter Xu - added a comment for folio_get_anon_vma() explaining the need for anon_vma checks, per Peter Xu - changed error handling in remap_pages() to make it more clear, per Peter Xu - changed EFAULT to EAGAIN to retry when a hugepage appears or disappears from under us, per Peter Xu - added links to previous upstreaming attempts, per David Hildenbrand Changes since v1 [4]: - add mmget_not_zero in userfaultfd_remap, per Jann Horn - removed extern from function definitions, per Matthew Wilcox - converted to folios in remap_pages_huge_pmd, per Matthew Wilcox - use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand - handle pgtable transfers between MMs, per Jann Horn - ignore concurrent A/D pte bit changes, per Jann Horn - split functions into smaller units, per David Hildenbrand - test for folio_test_large in remap_anon_pte, per Matthew Wilcox - use pte_swp_exclusive for swapcount check, per David Hildenbrand - eliminated use of mmu_notifier_invalidate_range_start_nonblock, per Jann Horn - simplified THP alignment checks, per Jann Horn - refactored the loop inside remap_pages, per Jann Horn - additional clarifying comments, per Jann Horn Main changes since Andrea's last version [1]: - Trivial translations from page to folio, mmap_sem to mmap_lock - Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its possible failure - Move pte mapping into remap_pages_pte to allow for retries when source page or anon_vma is contended. Since pte_offset_map_nolock() start RCU read section, we can't block anymore after mapping a pte, so have to unmap the ptesm do the locking and retry. - Add and use anon_vma_trylock_write() to avoid blocking while in RCU read section. - Accommodate changes in mmu_notifier_range_init() API, switch to mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in RCU read section. - Open-code now removed __swp_swapcount() - Replace pmd_read_atomic() with pmdp_get_lockless() - Add new selftest for UFFDIO_MOVE [1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc… [2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha… [3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj… [4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/ [5] https://lore.kernel.org/all/20230923013148.1390521-1-surenb@google.com/ [6] https://lore.kernel.org/all/1425575884-2574-21-git-send-email-aarcange@redh… [7] https://lore.kernel.org/all/cover.1547251023.git.blake.caldwell@colorado.ed… [8] https://lore.kernel.org/all/20231009064230.2952396-1-surenb@google.com/ [9] https://lore.kernel.org/all/20231028003819.652322-1-surenb@google.com/ [10] https://lore.kernel.org/all/20231121171643.3719880-1-surenb@google.com/ Andrea Arcangeli (2): mm/rmap: support move to different root anon_vma in folio_move_anon_rmap() userfaultfd: UFFDIO_MOVE uABI Suren Baghdasaryan (3): selftests/mm: call uffd_test_ctx_clear at the end of the test selftests/mm: add uffd_test_case_ops to allow test case-specific operations selftests/mm: add UFFDIO_MOVE ioctl test Documentation/admin-guide/mm/userfaultfd.rst | 3 + fs/userfaultfd.c | 72 +++ include/linux/rmap.h | 5 + include/linux/userfaultfd_k.h | 11 + include/uapi/linux/userfaultfd.h | 29 +- mm/huge_memory.c | 122 ++++ mm/khugepaged.c | 3 + mm/rmap.c | 30 + mm/userfaultfd.c | 614 +++++++++++++++++++ tools/testing/selftests/mm/uffd-common.c | 39 +- tools/testing/selftests/mm/uffd-common.h | 9 + tools/testing/selftests/mm/uffd-stress.c | 5 +- tools/testing/selftests/mm/uffd-unit-tests.c | 192 ++++++ 13 files changed, 1130 insertions(+), 4 deletions(-) -- 2.43.0.rc2.451.g8631bc7472-goog

7 months, 2 weeks

7
43
0 0

[PATCH v2 00/25] Enable FRED with KVM VMX

by Xin Li

This patch set enables the Intel flexible return and event delivery (FRED) architecture with KVM VMX to allow guests to utilize FRED. The FRED architecture defines simple new transitions that change privilege level (ring transitions). The FRED architecture was designed with the following goals: 1) Improve overall performance and response time by replacing event delivery through the interrupt descriptor table (IDT event delivery) and event return by the IRET instruction with lower latency transitions. 2) Improve software robustness by ensuring that event delivery establishes the full supervisor context and that event return establishes the full user context. The new transitions defined by the FRED architecture are FRED event delivery and, for returning from events, two FRED return instructions. FRED event delivery can effect a transition from ring 3 to ring 0, but it is used also to deliver events incident to ring 0. One FRED instruction (ERETU) effects a return from ring 0 to ring 3, while the other (ERETS) returns while remaining in ring 0. Collectively, FRED event delivery and the FRED return instructions are FRED transitions. Intel VMX architecture is extended to run FRED guests, and the major changes are: 1) New VMCS fields for FRED context management, which includes two new event data VMCS fields, eight new guest FRED context VMCS fields and eight new host FRED context VMCS fields. 2) VMX nested-exception support for proper virtualization of stack levels introduced with FRED architecture. Search for the latest FRED spec in most search engines with this search pattern: site:intel.com FRED (flexible return and event delivery) specification As the native FRED patches are committed in the tip tree "x86/fred" branch: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=x86/fred, and we have received a good amount of review comments for v1, it's time to send out v2 based on this branch for further help from the community. Patch 1-2 are cleanups to VMX basic and misc MSRs, which were sent out earlier as a preparation for FRED changes: https://lore.kernel.org/kvm/20240206182032.1596-1-xin3.li@intel.com/T/#u Patch 3-15 add FRED support to VMX. Patch 16-21 add FRED support to nested VMX. Patch 22 exposes FRED and its baseline features to KVM guests. Patch 23-25 add FRED selftests. There is also a counterpart qemu patch set for FRED at: https://lore.kernel.org/qemu-devel/20231109072012.8078-1-xin3.li@intel.com/…, which works with this patch set to allow KVM to run FRED guests. Changes since v1: * Always load the secondary VM exit controls (Sean Christopherson). * Remove FRED VM entry/exit controls consistency checks in setup_vmcs_config() (Sean Christopherson). * Clear FRED VM entry/exit controls if FRED is not enumerated (Chao Gao). * Use guest_can_use() to trace FRED enumeration in a vcpu (Chao Gao). * Enable FRED MSRs intercept if FRED is no longer enumerated in CPUID (Chao Gao). * Move guest FRED states init into __vmx_vcpu_reset() (Chao Gao). * Don't use guest_cpuid_has() in vmx_prepare_switch_to_{host,guest}(), which are called from IRQ-disabled context (Chao Gao). * Reset msr_guest_fred_rsp0 in __vmx_vcpu_reset() (Chao Gao). * Fail host requested FRED MSRs access if KVM cannot virtualize FRED (Chao Gao). * Handle the case FRED MSRs are valid but KVM cannot virtualize FRED (Chao Gao). * Add sanity checks when writing to FRED MSRs. * Explain why it is ok to only check CR4.FRED in kvm_is_fred_enabled() (Chao Gao). * Document event data should be equal to CR2/DR6/IA32_XFD_ERR instead of using WARN_ON() (Chao Gao). * Zero event data if a #NM was not caused by extended feature disable (Chao Gao). * Set the nested flag when there is an original interrupt (Chao Gao). * Dump guest FRED states only if guest has FRED enabled (Nikolay Borisov). * Add a prerequisite to SHADOW_FIELD_R[OW] macros * Remove hyperv TLFS related changes (Jeremi Piotrowski). * Use kvm_cpu_cap_has() instead of cpu_feature_enabled() to decouple KVM's capability to virtualize a feature and host's enabling of a feature (Chao Gao). Xin Li (25): KVM: VMX: Cleanup VMX basic information defines and usages KVM: VMX: Cleanup VMX misc information defines and usages KVM: VMX: Add support for the secondary VM exit controls KVM: x86: Mark CR4.FRED as not reserved KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID KVM: VMX: Set intercept for FRED MSRs KVM: VMX: Initialize VMCS FRED fields KVM: VMX: Switch FRED RSP0 between host and guest KVM: VMX: Add support for FRED context save/restore KVM: x86: Add kvm_is_fred_enabled() KVM: VMX: Handle FRED event data KVM: VMX: Handle VMX nested exception for FRED KVM: VMX: Disable FRED if FRED consistency checks fail KVM: VMX: Dump FRED context in dump_vmcs() KVM: VMX: Invoke vmx_set_cpu_caps() before nested setup KVM: nVMX: Add support for the secondary VM exit controls KVM: nVMX: Add a prerequisite to SHADOW_FIELD_R[OW] macros KVM: nVMX: Add FRED VMCS fields KVM: nVMX: Add support for VMX FRED controls KVM: nVMX: Add VMCS FRED states checking KVM: x86: Allow FRED/LKGS/WRMSRNS to be exposed to guests KVM: selftests: Run debug_regs test with FRED enabled KVM: selftests: Add a new VM guest mode to run user level code KVM: selftests: Add fred exception tests Documentation/virt/kvm/x86/nested-vmx.rst | 19 + arch/x86/include/asm/kvm_host.h | 8 +- arch/x86/include/asm/msr-index.h | 15 +- arch/x86/include/asm/vmx.h | 59 ++- arch/x86/kvm/cpuid.c | 4 +- arch/x86/kvm/governed_features.h | 1 + arch/x86/kvm/kvm_cache_regs.h | 17 + arch/x86/kvm/svm/svm.c | 4 +- arch/x86/kvm/vmx/capabilities.h | 30 +- arch/x86/kvm/vmx/nested.c | 329 ++++++++++++--- arch/x86/kvm/vmx/nested.h | 2 +- arch/x86/kvm/vmx/vmcs.h | 1 + arch/x86/kvm/vmx/vmcs12.c | 19 + arch/x86/kvm/vmx/vmcs12.h | 38 ++ arch/x86/kvm/vmx/vmcs_shadow_fields.h | 80 ++-- arch/x86/kvm/vmx/vmx.c | 385 +++++++++++++++--- arch/x86/kvm/vmx/vmx.h | 15 +- arch/x86/kvm/x86.c | 103 ++++- arch/x86/kvm/x86.h | 5 +- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/include/kvm_util_base.h | 1 + .../selftests/kvm/include/x86_64/processor.h | 36 ++ tools/testing/selftests/kvm/lib/kvm_util.c | 5 +- .../selftests/kvm/lib/x86_64/processor.c | 15 +- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 4 +- .../testing/selftests/kvm/x86_64/debug_regs.c | 50 ++- .../testing/selftests/kvm/x86_64/fred_test.c | 297 ++++++++++++++ 27 files changed, 1320 insertions(+), 223 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/fred_test.c base-commit: e13841907b8fda0ae0ce1ec03684665f578416a8 -- 2.43.0

1 year, 2 months

8
93
0 0

[PATCH v3 0/3] riscv: mm: Extend mappable memory up to hint address

by Charlie Jenkins

On riscv, mmap currently returns an address from the largest address space that can fit entirely inside of the hint address. This makes it such that the hint address is almost never returned. This patch raises the mappable area up to and including the hint address. This allows mmap to often return the hint address, which allows a performance improvement over searching for a valid address as well as making the behavior more similar to other architectures. Note that a previous patch introduced stronger semantics compared to other architectures for riscv mmap. On riscv, mmap will not use bits in the upper bits of the virtual address depending on the hint address. On other architectures, a random address is returned in the address space requested. On all architectures the hint address will be returned if it is available. This allows riscv applications to configure how many bits in the virtual address should be left empty. This has the two benefits of being able to request address spaces that are smaller than the default and doesn't require the application to know the page table layout of riscv. Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com> --- Changes in v3: - Add back forgotten semi-colon - Fix test cases - Add support for rv32 - Change cover letter name so it's not the same as patch 1 - Link to v2: https://lore.kernel.org/r/20240130-use_mmap_hint_address-v2-0-f34ebfd33053@… Changes in v2: - Add back forgotten "mmap_end = STACK_TOP_MAX" - Link to v1: https://lore.kernel.org/r/20240129-use_mmap_hint_address-v1-0-4c74da813ba1@… --- Charlie Jenkins (3): riscv: mm: Use hint address in mmap if available selftests: riscv: Generalize mm selftests docs: riscv: Define behavior of mmap Documentation/arch/riscv/vm-layout.rst | 16 ++-- arch/riscv/include/asm/processor.h | 27 +++--- tools/testing/selftests/riscv/mm/mmap_bottomup.c | 23 +---- tools/testing/selftests/riscv/mm/mmap_default.c | 23 +---- tools/testing/selftests/riscv/mm/mmap_test.h | 107 ++++++++++++++--------- 5 files changed, 83 insertions(+), 113 deletions(-) --- base-commit: 556e2d17cae620d549c5474b1ece053430cd50bc change-id: 20240119-use_mmap_hint_address-f9f4b1b6f5f1 -- - Charlie

1 year, 3 months

5
19
0 0

[PATCH] tools: Copy linux/align.h into tools/

by Ackerley Tng

This provides alignment macros for use in selftests. Also clean up tools/include/linux/bitmap.h's inline definition of IS_ALIGNED(). Signed-off-by: Ackerley Tng <ackerleytng(a)google.com> --- tools/include/linux/align.h | 15 +++++++++++++++ tools/include/linux/bitmap.h | 2 +- 2 files changed, 16 insertions(+), 1 deletion(-) create mode 100644 tools/include/linux/align.h diff --git a/tools/include/linux/align.h b/tools/include/linux/align.h new file mode 100644 index 000000000000..2b4acec7b95a --- /dev/null +++ b/tools/include/linux/align.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_ALIGN_H +#define _LINUX_ALIGN_H + +#include <linux/const.h> + +/* @a is a power of 2 value */ +#define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) +#define ALIGN_DOWN(x, a) __ALIGN_KERNEL((x) - ((a) - 1), (a)) +#define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) +#define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a))) +#define PTR_ALIGN_DOWN(p, a) ((typeof(p))ALIGN_DOWN((unsigned long)(p), (a))) +#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0) + +#endif /* _LINUX_ALIGN_H */ diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h index f3566ea0f932..8c6852dba04f 100644 --- a/tools/include/linux/bitmap.h +++ b/tools/include/linux/bitmap.h @@ -3,6 +3,7 @@ #define _TOOLS_LINUX_BITMAP_H #include <string.h> +#include <linux/align.h> #include <linux/bitops.h> #include <linux/find.h> #include <stdlib.h> @@ -126,7 +127,6 @@ static inline bool bitmap_and(unsigned long *dst, const unsigned long *src1, #define BITMAP_MEM_ALIGNMENT (8 * sizeof(unsigned long)) #endif #define BITMAP_MEM_MASK (BITMAP_MEM_ALIGNMENT - 1) -#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0) static inline bool bitmap_equal(const unsigned long *src1, const unsigned long *src2, unsigned int nbits) -- 2.39.2.722.g9855ee24e9-goog

1 year, 3 months

2
3
0 0

[PATCH 1/1] KVM: selftests: add kvmclock drift test

by Dongli Zhang

There is kvmclock drift issue during the vCPU hotplug. It has been fixed by the commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug"). This is to add the test to verify if the master clock is updated when we write 0 to MSR_IA32_TSC from the host side. Here is the usage example on the KVM with the bugfix reverted. $ ./kvm_clock_drift -v -p 5 kvmclock based on old pvclock_vcpu_time_info: 5012221999 version: 2 tsc_timestamp: 3277968 system_time: 11849519 tsc_to_system_mul: 2152530255 tsc_shift: 0 flags: 1 kvmclock based on new pvclock_vcpu_time_info: 5012222411 version: 4 tsc_timestamp: 9980576184 system_time: 5012222411 tsc_to_system_mul: 2152530255 tsc_shift: 0 flags: 1 ==== Test Assertion Failure ==== x86_64/kvm_clock_drift.c:216: clock_old == clock_new pid=14257 tid=14257 errno=4 - Interrupted system call 1 0x000000000040277b: main at kvm_clock_drift.c:216 2 0x00007f7766fa7e44: ?? ??:0 3 0x000000000040286d: _start at ??:? kvmclock drift detected, old=5012221999, new=5012222411 Signed-off-by: Dongli Zhang <dongli.zhang(a)oracle.com> --- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_clock_drift.c | 223 ++++++++++++++++++ 2 files changed, 224 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_clock_drift.c diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 4412b42d95de..c665d0d8d348 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -84,6 +84,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/hyperv_features TEST_GEN_PROGS_x86_64 += x86_64/hyperv_ipi TEST_GEN_PROGS_x86_64 += x86_64/hyperv_svm_test TEST_GEN_PROGS_x86_64 += x86_64/hyperv_tlb_flush +TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_drift TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test diff --git a/tools/testing/selftests/kvm/x86_64/kvm_clock_drift.c b/tools/testing/selftests/kvm/x86_64/kvm_clock_drift.c new file mode 100644 index 000000000000..324f0dbc5762 --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/kvm_clock_drift.c @@ -0,0 +1,223 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * The kvmclock drift test. Emulate vCPU hotplug and online to verify if + * there is kvmclock drift. + * + * Adapted from steal_time.c + * + * Copyright (C) 2020, Red Hat, Inc. + * Copyright (C) 2024 Oracle and/or its affiliates. + */ + +#include <asm/kvm_para.h> +#include <asm/pvclock.h> +#include <asm/pvclock-abi.h> +#include <sys/stat.h> + +#include "kvm_util.h" +#include "processor.h" + +#define NR_VCPUS 2 +#define NR_SLOTS 2 +#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info) +/* + * KVMCLOCK_GPA is identity mapped + */ +#define KVMCLOCK_GPA (1 << 30) + +static uint64_t kvmclock_gpa = KVMCLOCK_GPA; + +static void guest_code(int cpu) +{ + struct pvclock_vcpu_time_info *kvmclock; + + /* + * vCPU#0 is to detect the change of pvclock_vcpu_time_info + */ + if (cpu == 0) { + GUEST_SYNC(0); + + kvmclock = (struct pvclock_vcpu_time_info *) kvmclock_gpa; + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, kvmclock_gpa | KVM_MSR_ENABLED); + + /* + * Backup the pvclock_vcpu_time_info before vCPU#1 hotplug + */ + kvmclock[1] = kvmclock[0]; + + GUEST_SYNC(2); + /* + * Enter the guest to update pvclock_vcpu_time_info + */ + GUEST_SYNC(4); + } + + /* + * vCPU#1 is to emulate the vCPU hotplug + */ + if (cpu == 1) { + GUEST_SYNC(1); + /* + * This is after the host side MSR_IA32_TSC + */ + GUEST_SYNC(3); + } +} + +static void run_vcpu(struct kvm_vcpu *vcpu) +{ + struct ucall uc; + + vcpu_run(vcpu); + + switch (get_ucall(vcpu, &uc)) { + case UCALL_SYNC: + case UCALL_DONE: + break; + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + default: + TEST_ASSERT(false, "Unexpected exit: %s", + exit_reason_str(vcpu->run->exit_reason)); + } +} + +static void kvmclock_dump(struct pvclock_vcpu_time_info *kvmclock) +{ + pr_info(" version: %u\n", kvmclock->version); + pr_info(" tsc_timestamp: %lu\n", kvmclock->tsc_timestamp); + pr_info(" system_time: %lu\n", kvmclock->system_time); + pr_info(" tsc_to_system_mul: %u\n", kvmclock->tsc_to_system_mul); + pr_info(" tsc_shift: %d\n", kvmclock->tsc_shift); + pr_info(" flags: %u\n", kvmclock->flags); + pr_info("\n"); +} + +#define CLOCKSOURCE_PATH "/sys/devices/system/clocksource/clocksource0/current_clocksource" + +static void check_clocksource(void) +{ + char *clk_name; + struct stat st; + FILE *fp; + + fp = fopen(CLOCKSOURCE_PATH, "r"); + if (!fp) { + pr_info("failed to open clocksource file: %d; assuming TSC.\n", + errno); + return; + } + + if (fstat(fileno(fp), &st)) { + pr_info("failed to stat clocksource file: %d; assuming TSC.\n", + errno); + goto out; + } + + clk_name = malloc(st.st_size); + TEST_ASSERT(clk_name, "failed to allocate buffer to read file\n"); + + if (!fgets(clk_name, st.st_size, fp)) { + pr_info("failed to read clocksource file: %d; assuming TSC.\n", + ferror(fp)); + goto out; + } + + TEST_ASSERT(!strncmp(clk_name, "tsc\n", st.st_size), + "clocksource not supported: %s", clk_name); +out: + fclose(fp); +} + +int main(int argc, char *argv[]) +{ + struct pvclock_vcpu_time_info *kvmclock; + struct kvm_vcpu *vcpus[NR_VCPUS]; + uint64_t clock_old, clock_new; + bool verbose = false; + unsigned int gpages; + struct kvm_vm *vm; + int period = 2; + uint64_t tsc; + int opt; + + check_clocksource(); + + while ((opt = getopt(argc, argv, "p:vh")) != -1) { + switch (opt) { + case 'p': + period = atoi_positive("The period (seconds) between vCPU hotplug", + optarg); + break; + case 'v': + verbose = true; + break; + case 'h': + default: + pr_info("usage: %s [-p period (seconds)] [-v]\n", argv[0]); + exit(1); + } + } + + vm = vm_create_with_vcpus(NR_VCPUS, guest_code, vcpus); + gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, + KVMCLOCK_SIZE * NR_SLOTS); + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, + KVMCLOCK_GPA, 1, gpages, 0); + virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages); + + vcpu_args_set(vcpus[0], 1, 0); + vcpu_args_set(vcpus[1], 1, 1); + + /* + * Run vCPU#0 and vCPU#1 to update both pvclock_vcpu_time_info and + * master clock + */ + run_vcpu(vcpus[0]); + run_vcpu(vcpus[1]); + + /* + * Run vCPU#0 to backup the current pvclock_vcpu_time_info + */ + run_vcpu(vcpus[0]); + + sleep(period); + + /* + * Emulate the hotplug of vCPU#1 + */ + vcpu_set_msr(vcpus[1], MSR_IA32_TSC, 0); + + /* + * Emulate the online of vCPU#1 + */ + run_vcpu(vcpus[1]); + + /* + * Run vCPU#0 to backup the new pvclock_vcpu_time_info to detect + * if there is any change or kvmclock drift + */ + run_vcpu(vcpus[0]); + + kvmclock = addr_gva2hva(vm, kvmclock_gpa); + tsc = kvmclock[0].tsc_timestamp; + clock_old = __pvclock_read_cycles(&kvmclock[1], tsc); + clock_new = __pvclock_read_cycles(&kvmclock[0], tsc); + + if (verbose) { + pr_info("kvmclock based on old pvclock_vcpu_time_info: %lu\n", + clock_old); + kvmclock_dump(&kvmclock[1]); + pr_info("kvmclock based on new pvclock_vcpu_time_info: %lu\n", + clock_new); + kvmclock_dump(&kvmclock[0]); + } + + TEST_ASSERT(clock_old == clock_new, + "kvmclock drift detected, old=%lu, new=%lu", + clock_old, clock_new); + + kvm_vm_free(vm); + + return 0; +} base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6 -- 2.34.1

1 year, 3 months

3
3
0 0

[PATCH v3] lib: Convert test_user_copy to KUnit test

by Vitor Massaru Iha

This adds the conversion of the runtime tests of test_user_copy fuctions, from `lib/test_user_copy.c`to KUnit tests. Signed-off-by: Vitor Massaru Iha <vitor(a)massaru.org> --- v2: * splitted patch in 3: - Allows to install and load modules in root filesystem; - Provides an userspace memory context when tests are compiled as module; - Convert test_user_copy to KUnit test; * removed entry for CONFIG_TEST_USER_COPY; * replaced pr_warn to KUNIT_EXPECT_FALSE_MSG in test macro to decrease the diff; v3: * rebased with last kunit branch * Please apply this commit from kunit-fixes: 3f37d14b8a3152441f36b6bc74000996679f0998 And these from patchwork: https://patchwork.kernel.org/patch/11676331/ https://patchwork.kernel.org/patch/11676335/ --- lib/Kconfig.debug | 28 ++++++++------ lib/Makefile | 2 +- lib/{test_user_copy.c => user_copy_kunit.c} | 42 +++++++++------------ 3 files changed, 35 insertions(+), 37 deletions(-) rename lib/{test_user_copy.c => user_copy_kunit.c} (91%) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 9ad9210d70a1..f699a3624ae7 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2078,18 +2078,6 @@ config TEST_VMALLOC If unsure, say N. -config TEST_USER_COPY - tristate "Test user/kernel boundary protections" - depends on m - help - This builds the "test_user_copy" module that runs sanity checks - on the copy_to/from_user infrastructure, making sure basic - user/kernel boundary testing is working. If it fails to load, - a regression has been detected in the user/kernel memory boundary - protections. - - If unsure, say N. - config TEST_BPF tristate "Test BPF filter functionality" depends on m && NET @@ -2154,6 +2142,22 @@ config SYSCTL_KUNIT_TEST If unsure, say N. +config USER_COPY_KUNIT + tristate "KUnit Test for user/kernel boundary protections" + depends on KUNIT + depends on m + help + This builds the "user_copy_kunit" module that runs sanity checks + on the copy_to/from_user infrastructure, making sure basic + user/kernel boundary testing is working. If it fails to load, + a regression has been detected in the user/kernel memory boundary + protections. + + For more information on KUnit and unit tests in general please refer + to the KUnit documentation in Documentation/dev-tools/kunit/. + + If unsure, say N. + config LIST_KUNIT_TEST tristate "KUnit Test for Kernel Linked-list structures" if !KUNIT_ALL_TESTS depends on KUNIT diff --git a/lib/Makefile b/lib/Makefile index b1c42c10073b..8c145f85accc 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -78,7 +78,6 @@ obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o obj-$(CONFIG_TEST_SORT) += test_sort.o -obj-$(CONFIG_TEST_USER_COPY) += test_user_copy.o obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o obj-$(CONFIG_TEST_PRINTF) += test_printf.o @@ -318,3 +317,4 @@ obj-$(CONFIG_OBJAGG) += objagg.o # KUnit tests obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o +obj-$(CONFIG_USER_COPY_KUNIT) += user_copy_kunit.o diff --git a/lib/test_user_copy.c b/lib/user_copy_kunit.c similarity index 91% rename from lib/test_user_copy.c rename to lib/user_copy_kunit.c index 5ff04d8fe971..a10ddd15b4cd 100644 --- a/lib/test_user_copy.c +++ b/lib/user_copy_kunit.c @@ -16,6 +16,7 @@ #include <linux/slab.h> #include <linux/uaccess.h> #include <linux/vmalloc.h> +#include <kunit/test.h> /* * Several 32-bit architectures support 64-bit {get,put}_user() calls. @@ -35,7 +36,7 @@ ({ \ int cond = (condition); \ if (cond) \ - pr_warn("[%d] " msg "\n", __LINE__, ##__VA_ARGS__); \ + KUNIT_EXPECT_FALSE_MSG(test, cond, msg, ##__VA_ARGS__); \ cond; \ }) @@ -44,7 +45,7 @@ static bool is_zeroed(void *from, size_t size) return memchr_inv(from, 0x0, size) == NULL; } -static int test_check_nonzero_user(char *kmem, char __user *umem, size_t size) +static int test_check_nonzero_user(struct kunit *test, char *kmem, char __user *umem, size_t size) { int ret = 0; size_t start, end, i, zero_start, zero_end; @@ -102,7 +103,7 @@ static int test_check_nonzero_user(char *kmem, char __user *umem, size_t size) return ret; } -static int test_copy_struct_from_user(char *kmem, char __user *umem, +static int test_copy_struct_from_user(struct kunit *test, char *kmem, char __user *umem, size_t size) { int ret = 0; @@ -177,7 +178,7 @@ static int test_copy_struct_from_user(char *kmem, char __user *umem, return ret; } -static int __init test_user_copy_init(void) +static void user_copy_test(struct kunit *test) { int ret = 0; char *kmem; @@ -192,16 +193,14 @@ static int __init test_user_copy_init(void) #endif kmem = kmalloc(PAGE_SIZE * 2, GFP_KERNEL); - if (!kmem) - return -ENOMEM; + KUNIT_EXPECT_FALSE_MSG(test, kmem == NULL, "kmalloc failed"); user_addr = vm_mmap(NULL, 0, PAGE_SIZE * 2, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, 0); if (user_addr >= (unsigned long)(TASK_SIZE)) { - pr_warn("Failed to allocate user memory\n"); kfree(kmem); - return -ENOMEM; + KUNIT_FAIL(test, "Failed to allocate user memory"); } usermem = (char __user *)user_addr; @@ -245,9 +244,9 @@ static int __init test_user_copy_init(void) #undef test_legit /* Test usage of check_nonzero_user(). */ - ret |= test_check_nonzero_user(kmem, usermem, 2 * PAGE_SIZE); + ret |= test_check_nonzero_user(test, kmem, usermem, 2 * PAGE_SIZE); /* Test usage of copy_struct_from_user(). */ - ret |= test_copy_struct_from_user(kmem, usermem, 2 * PAGE_SIZE); + ret |= test_copy_struct_from_user(test, kmem, usermem, 2 * PAGE_SIZE); /* * Invalid usage: none of these copies should succeed. @@ -309,23 +308,18 @@ static int __init test_user_copy_init(void) vm_munmap(user_addr, PAGE_SIZE * 2); kfree(kmem); - - if (ret == 0) { - pr_info("tests passed.\n"); - return 0; - } - - return -EINVAL; } -module_init(test_user_copy_init); - -static void __exit test_user_copy_exit(void) -{ - pr_info("unloaded.\n"); -} +static struct kunit_case user_copy_test_cases[] = { + KUNIT_CASE(user_copy_test), + {} +}; -module_exit(test_user_copy_exit); +static struct kunit_suite user_copy_test_suite = { + .name = "user_copy", + .test_cases = user_copy_test_cases, +}; +kunit_test_suites(&user_copy_test_suite); MODULE_AUTHOR("Kees Cook <keescook(a)chromium.org>"); MODULE_LICENSE("GPL"); base-commit: d43c7fb05765152d4d4a39a8ef957c4ea14d8847 -- 2.26.2

1 year, 4 months

4
11
0 0

[PATCH 0/2] selftests/nolibc: run-user improvements

by Thomas Weißschuh

A small bugfix for "run-user XARCH=ppc64le" and run-user support for run-tests.sh. Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Thomas Weißschuh (2): selftests/nolibc: introduce QEMU_ARCH_USER selftests/nolibc: run-tests.sh: enable testing via qemu-user tools/testing/selftests/nolibc/Makefile | 5 ++++- tools/testing/selftests/nolibc/run-tests.sh | 22 +++++++++++++++++++--- 2 files changed, 23 insertions(+), 4 deletions(-) --- base-commit: ba335752620565c25c3028fff9496bb8ef373602 change-id: 20770915-nolibc-run-user-845375a3ec4f Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

1 year, 4 months

5
8
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror March 2024