October 2024 - Linux-kselftest-mirror

[PATCH v4 0/3] Add test to verify probe of devices from discoverable buses

by Nícolas F. R. A. Prado

This is part of an effort to improve detection of regressions impacting device probe on all platforms. The recently merged DT kselftest [3] detects probe issues for all devices described statically in the DT. That leaves out devices discovered at run-time from discoverable buses. This is where this test comes in. All of the devices that are connected through discoverable buses (ie USB and PCI), and which are internal and therefore always present, can be described based on their position in the system topology in a per-platform YAML file so they can be checked for. The test will check that the device has been instantiated and bound to a driver. Patch 1 introduces the test. Patch 2 and 3 add the device definitions for the google,spherion machine (Acer Chromebook 514) and XPS 13 as examples. This is the output from the test running on Spherion: TAP version 13 Using board file: boards/google,spherion.yaml 1..8 ok 1 /usb2-controller(a)11200000/1.4.1/camera.device ok 2 /usb2-controller(a)11200000/1.4.1/camera.0.driver ok 3 /usb2-controller(a)11200000/1.4.1/camera.1.driver ok 4 /usb2-controller(a)11200000/1.4.2/bluetooth.device ok 5 /usb2-controller(a)11200000/1.4.2/bluetooth.0.driver ok 6 /usb2-controller(a)11200000/1.4.2/bluetooth.1.driver ok 7 /pci-controller(a)11230000/0.0/0.0/wifi.device ok 8 /pci-controller(a)11230000/0.0/0.0/wifi.driver Totals: pass:8 fail:0 xfail:0 xpass:0 skip:0 error:0 [3] https://lore.kernel.org/all/20230828211424.2964562-1-nfraprado@collabora.co… Changes in v4: - Dropped RFC tag - Fixed 'busses' misspelling - Link to v3: https://lore.kernel.org/all/20231227123643.52348-1-nfraprado@collabora.com Changes in v3: - Reverted approach of encoding stable device reference in test file from device match fields (from modalias) back to HW topology (from v1) - Changed board file description to YAML - Rewrote test script in python to handle YAML and support x86 platforms - Link to v2: https://lore.kernel.org/all/20231127233558.868365-1-nfraprado@collabora.com Changes in v2: - Changed approach of encoding stable device reference in test file from HW topology to device match fields (the ones from modalias) - Better documented test format - Link to v1: https://lore.kernel.org/all/20231024211818.365844-1-nfraprado@collabora.com --- Nícolas F. R. A. Prado (3): kselftest: Add test to verify probe of devices from discoverable buses kselftest: devices: Add sample board file for google,spherion kselftest: devices: Add sample board file for XPS 13 9300 tools/testing/selftests/Makefile | 1 + tools/testing/selftests/devices/Makefile | 4 + .../devices/boards/Dell Inc.,XPS 13 9300.yaml | 40 +++ .../selftests/devices/boards/google,spherion.yaml | 50 ++++ tools/testing/selftests/devices/ksft.py | 90 ++++++ .../selftests/devices/test_discoverable_devices.py | 318 +++++++++++++++++++++ 6 files changed, 503 insertions(+) --- base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d change-id: 20240122-discoverable-devs-ksft-9d501e312688 Best regards, -- Nícolas F. R. A. Prado <nfraprado(a)collabora.com>

18 hours, 42 minutes

4
9
0 0

[PATCH RESEND] x86: checksum: Fix unaligned checksums on < i686

by David Gow

The checksum_32 code was originally written to only handle 2-byte aligned buffers, but was later extended to support arbitrary alignment. However, the non-PPro variant doesn't apply the carry before jumping to the 2- or 4-byte aligned versions, which clear CF. This causes the new checksum_kunit test to fail, as it runs with a large number of different possible alignments and both with and without carries. For example: ./tools/testing/kunit/kunit.py run --arch i386 --kconfig_add CONFIG_M486=y checksum Gives: KTAP version 1 # Subtest: checksum 1..3 ok 1 test_csum_fixed_random_inputs # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267 Expected result == expec, but result == 65281 (0xff01) expec == 65280 (0xff00) not ok 2 test_csum_all_carry_inputs # test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:314 Expected result == expec, but result == 65535 (0xffff) expec == 65534 (0xfffe) not ok 3 test_csum_no_carry_inputs With this patch, it passes. KTAP version 1 # Subtest: checksum 1..3 ok 1 test_csum_fixed_random_inputs ok 2 test_csum_all_carry_inputs ok 3 test_csum_no_carry_inputs I also tested it on a real 486DX2, with the same results. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: David Gow <davidgow(a)google.com> --- Re-sending this from [1]. While there's an argument that the whole 32-bit checksum code could do with rewriting, it's: (a) worth fixing before someone takes the time to rewrite it, and (b) worth any future rewrite starting from a point where the tests pass I don't think there should be any downside to this fix: it only affects ancient computers, and adds a single instruction which isn't in a loop. Cheers, -- David [1]: https://lore.kernel.org/lkml/20230704083206.693155-2-davidgow@google.com/ --- arch/x86/lib/checksum_32.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/lib/checksum_32.S b/arch/x86/lib/checksum_32.S index 68f7fa3e1322..a5123b29b403 100644 --- a/arch/x86/lib/checksum_32.S +++ b/arch/x86/lib/checksum_32.S @@ -62,6 +62,7 @@ SYM_FUNC_START(csum_partial) jl 8f movzbl (%esi), %ebx adcl %ebx, %eax + adcl $0, %eax roll $8, %eax inc %esi testl $2, %esi -- 2.45.2.1089.g2a221341d9-goog

1 week, 4 days

2
2
0 0

[PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach

by Bernd Edlinger

This introduces signal->exec_bprm, which is used to fix the case when at least one of the sibling threads is traced, and therefore the trace process may dead-lock in ptrace_attach, but de_thread will need to wait for the tracer to continue execution. The solution is to detect this situation and allow ptrace_attach to continue by temporarily releasing the cred_guard_mutex, while de_thread() is still waiting for traced zombies to be eventually released by the tracer. In the case of the thread group leader we only have to wait for the thread to become a zombie, which may also need co-operation from the tracer due to PTRACE_O_TRACEEXIT. When a tracer wants to ptrace_attach a task that already is in execve, we simply retry the ptrace_may_access check while temporarily installing the new credentials and dumpability which are about to be used after execve completes. If the ptrace_attach happens on a thread that is a sibling-thread of the thread doing execve, it is sufficient to check against the old credentials, as this thread will be waited for, before the new credentials are installed. Other threads die quickly since the cred_guard_mutex is released, but a deadly signal is already pending. In case the mutex_lock_killable misses the signal, the non-zero current->signal->exec_bprm makes sure they release the mutex immediately and return with -ERESTARTNOINTR. This means there is no API change, unlike the previous version of this patch which was discussed here: https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.d… See tools/testing/selftests/ptrace/vmaccess.c for a test case that gets fixed by this change. Note that since the test case was originally designed to test the ptrace_attach returning an error in this situation, the test expectation needed to be adjusted, to allow the API to succeed at the first attempt. Signed-off-by: Bernd Edlinger <bernd.edlinger(a)hotmail.de> --- fs/exec.c | 69 ++++++++++++++++------- fs/proc/base.c | 6 ++ include/linux/cred.h | 1 + include/linux/sched/signal.h | 18 ++++++ kernel/cred.c | 28 +++++++-- kernel/ptrace.c | 32 +++++++++++ kernel/seccomp.c | 12 +++- tools/testing/selftests/ptrace/vmaccess.c | 23 +++++--- 8 files changed, 155 insertions(+), 34 deletions(-) v10: Changes to previous version, make the PTRACE_ATTACH retun -EAGAIN, instead of execve return -ERESTARTSYS. Added some lessions learned to the description. v11: Check old and new credentials in PTRACE_ATTACH again without changing the API. Note: I got actually one response from an automatic checker to the v11 patch, https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/ which is complaining about: >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@ 417 struct linux_binprm *bprm = task->signal->exec_bprm; 418 const struct cred *old_cred; 419 struct mm_struct *old_mm; 420 421 retval = down_write_killable(&task->signal->exec_update_lock); 422 if (retval) 423 goto unlock_creds; 424 task_lock(task); > 425 old_cred = task->real_cred; v12: Essentially identical to v11. - Fixed a minor merge conflict in linux v5.17, and fixed the above mentioned nit by adding __rcu to the declaration. - re-tested the patch with all linux versions from v5.11 to v6.6 v10 was an alternative approach which did imply an API change. But I would prefer to avoid such an API change. The difficult part is getting the right dumpability flags assigned before de_thread starts, hope you like this version. If not, the v10 is of course also acceptable. Thanks Bernd. diff --git a/fs/exec.c b/fs/exec.c index 2f2b0acec4f0..902d3b230485 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm) return 0; } -static int de_thread(struct task_struct *tsk) +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + struct task_struct *t = tsk; + bool unsafe_execve_in_progress = false; if (thread_group_empty(tsk)) goto no_thread_group; @@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) sig->notify_count--; + while_each_thread(tsk, t) { + if (unlikely(t->ptrace) + && (t != tsk->group_leader || !t->exit_state)) + unsafe_execve_in_progress = true; + } + + if (unlikely(unsafe_execve_in_progress)) { + spin_unlock_irq(lock); + sig->exec_bprm = bprm; + mutex_unlock(&sig->cred_guard_mutex); + spin_lock_irq(lock); + } + while (sig->notify_count) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); @@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + sig->group_exec_task = NULL; sig->notify_count = 0; @@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk) return 0; killed: + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + /* protects against exit_notify() and __exit_signal() */ read_lock(&tasklist_lock); sig->group_exec_task = NULL; @@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) return retval; + /* If the binary is not readable then enforce mm->dumpable=0 */ + would_dump(bprm, bprm->file); + if (bprm->have_execfd) + would_dump(bprm, bprm->executable); + + /* + * Figure out dumpability. Note that this checking only of current + * is wrong, but userspace depends on it. This should be testing + * bprm->secureexec instead. + */ + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || + is_dumpability_changed(current_cred(), bprm->cred) || + !(uid_eq(current_euid(), current_uid()) && + gid_eq(current_egid(), current_gid()))) + set_dumpable(bprm->mm, suid_dumpable); + else + set_dumpable(bprm->mm, SUID_DUMP_USER); + /* * Ensure all future errors are fatal. */ @@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm) /* * Make this the only thread in the thread group. */ - retval = de_thread(me); + retval = de_thread(me, bprm); if (retval) goto out; @@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; - /* If the binary is not readable then enforce mm->dumpable=0 */ - would_dump(bprm, bprm->file); - if (bprm->have_execfd) - would_dump(bprm, bprm->executable); - /* * Release all of the old mmap stuff */ @@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm) me->sas_ss_sp = me->sas_ss_size = 0; - /* - * Figure out dumpability. Note that this checking only of current - * is wrong, but userspace depends on it. This should be testing - * bprm->secureexec instead. - */ - if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || - !(uid_eq(current_euid(), current_uid()) && - gid_eq(current_egid(), current_gid()))) - set_dumpable(current->mm, suid_dumpable); - else - set_dumpable(current->mm, SUID_DUMP_USER); - perf_event_exec(); __set_task_comm(me, kbasename(bprm->filename), true); @@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm) if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) return -ERESTARTNOINTR; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + bprm->cred = prepare_exec_creds(); if (likely(bprm->cred)) return 0; diff --git a/fs/proc/base.c b/fs/proc/base.c index ffd54617c354..0da9adfadb48 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf, if (rv < 0) goto out_free; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + rv = -ERESTARTNOINTR; + goto out_free; + } + rv = security_setprocattr(PROC_I(inode)->op.lsm, file->f_path.dentry->d_name.name, page, count); diff --git a/include/linux/cred.h b/include/linux/cred.h index f923528d5cc4..b01e309f5686 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *); extern struct cred *cred_alloc_blank(void); extern struct cred *prepare_creds(void); extern struct cred *prepare_exec_creds(void); +extern bool is_dumpability_changed(const struct cred *, const struct cred *); extern int commit_creds(struct cred *); extern void abort_creds(struct cred *); extern const struct cred *override_creds(const struct cred *); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 0014d3adaf84..14df7073a0a8 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -234,9 +234,27 @@ struct signal_struct { struct mm_struct *oom_mm; /* recorded mm when the thread group got * killed by the oom killer */ + struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access + * against new credentials while + * de_thread is waiting for other + * traced threads to terminate. + * Set while de_thread is executing. + * The cred_guard_mutex is released + * after de_thread() has called + * zap_other_threads(), therefore + * a fatal signal is guaranteed to be + * already pending in the unlikely + * event, that + * current->signal->exec_bprm happens + * to be non-zero after the + * cred_guard_mutex was acquired. + */ + struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) + * Held while execve runs, except when + * a sibling thread is being traced. * Deprecated do not use in new code. * Use exec_update_lock instead. */ diff --git a/kernel/cred.c b/kernel/cred.c index 98cb4eca23fb..586cb6c7cf6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset) return false; } +/** + * is_dumpability_changed - Will changing creds from old to new + * affect the dumpability in commit_creds? + * + * Return: false - dumpability will not be changed in commit_creds. + * Return: true - dumpability will be changed to non-dumpable. + * + * @old: The old credentials + * @new: The new credentials + */ +bool is_dumpability_changed(const struct cred *old, const struct cred *new) +{ + if (!uid_eq(old->euid, new->euid) || + !gid_eq(old->egid, new->egid) || + !uid_eq(old->fsuid, new->fsuid) || + !gid_eq(old->fsgid, new->fsgid) || + !cred_cap_issubset(old, new)) + return true; + + return false; +} + /** * commit_creds - Install new credentials upon the current task * @new: The credentials to be assigned @@ -467,11 +489,7 @@ int commit_creds(struct cred *new) get_cred(new); /* we will require a ref for the subj creds too */ /* dumpability changes */ - if (!uid_eq(old->euid, new->euid) || - !gid_eq(old->egid, new->egid) || - !uid_eq(old->fsuid, new->fsuid) || - !gid_eq(old->fsgid, new->fsgid) || - !cred_cap_issubset(old, new)) { + if (is_dumpability_changed(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); task->pdeath_signal = 0; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 443057bee87c..eb1c450bb7d7 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -20,6 +20,7 @@ #include <linux/pagemap.h> #include <linux/ptrace.h> #include <linux/security.h> +#include <linux/binfmts.h> #include <linux/signal.h> #include <linux/uio.h> #include <linux/audit.h> @@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request, if (retval) goto unlock_creds; + if (unlikely(task->in_execve)) { + struct linux_binprm *bprm = task->signal->exec_bprm; + const struct cred __rcu *old_cred; + struct mm_struct *old_mm; + + retval = down_write_killable(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + task_lock(task); + old_cred = task->real_cred; + old_mm = task->mm; + rcu_assign_pointer(task->real_cred, bprm->cred); + task->mm = bprm->mm; + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); + rcu_assign_pointer(task->real_cred, old_cred); + task->mm = old_mm; + task_unlock(task); + up_write(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + } + write_lock_irq(&tasklist_lock); retval = -EPERM; if (unlikely(task->exit_state)) @@ -508,6 +531,14 @@ static int ptrace_traceme(void) { int ret = -EPERM; + if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) + return -ERESTARTNOINTR; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + write_lock_irq(&tasklist_lock); /* Are we already being traced? */ if (!current->ptrace) { @@ -523,6 +554,7 @@ static int ptrace_traceme(void) } } write_unlock_irq(&tasklist_lock); + mutex_unlock(&current->signal->cred_guard_mutex); return ret; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 255999ba9190..b29bbfa0b044 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags, * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. */ - if (flags & SECCOMP_FILTER_FLAG_TSYNC && - mutex_lock_killable(&current->signal->cred_guard_mutex)) - goto out_put_fd; + if (flags & SECCOMP_FILTER_FLAG_TSYNC) { + if (mutex_lock_killable(&current->signal->cred_guard_mutex)) + goto out_put_fd; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + goto out_put_fd; + } + } spin_lock_irq(&current->sighand->siglock); diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c index 4db327b44586..3b7d81fb99bb 100644 --- a/tools/testing/selftests/ptrace/vmaccess.c +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -39,8 +39,15 @@ TEST(vmaccess) f = open(mm, O_RDONLY); ASSERT_GE(f, 0); close(f); - f = kill(pid, SIGCONT); - ASSERT_EQ(f, 0); + f = waitpid(-1, NULL, 0); + ASSERT_NE(f, -1); + ASSERT_NE(f, 0); + ASSERT_NE(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, -1); + ASSERT_EQ(errno, ECHILD); } TEST(attach) @@ -57,22 +64,24 @@ TEST(attach) sleep(1); k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); - ASSERT_EQ(errno, EAGAIN); - ASSERT_EQ(k, -1); + ASSERT_EQ(k, 0); k = waitpid(-1, &s, WNOHANG); ASSERT_NE(k, -1); ASSERT_NE(k, 0); ASSERT_NE(k, pid); ASSERT_EQ(WIFEXITED(s), 1); ASSERT_EQ(WEXITSTATUS(s), 0); - sleep(1); - k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFSTOPPED(s), 1); + ASSERT_EQ(WSTOPSIG(s), SIGTRAP); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGSTOP); - k = ptrace(PTRACE_DETACH, pid, 0L, 0L); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); -- 2.39.2

1 week, 4 days

17
71
0 0

[PATCH] selftests/ftrace: Add test dependency

by Thibault Ferrante

test_duplicates miss a running dependency and leads to test failures on kernel with specific configuration. Signed-off-by: Thibault Ferrante <thibault.ferrante(a)canonical.com> --- .../testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc b/tools/testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc index d3a79da215c8..0b5e4543e70b 100644 --- a/tools/testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc +++ b/tools/testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc @@ -1,7 +1,7 @@ #!/bin/sh # SPDX-License-Identifier: GPL-2.0 # description: Generic dynamic event - check if duplicate events are caught -# requires: dynamic_events "e[:[<group>/][<event>]] <attached-group>.<attached-event> [<args>]":README +# requires: dynamic_events events/syscalls/sys_enter_openat "e[:[<group>/][<event>]] <attached-group>.<attached-event> [<args>]":README echo 0 > events/enable -- 2.39.2

1 week, 4 days

2
1
0 0

[PATCH] selftests/ftrace: Test toplevel-enable for instance

by Zheng Yejian

'available_events' is actually not required by 'test.d/event/toplevel-enable.tc' and its Existence has been tested in 'test.d/00basic/basic4.tc'. So the require of 'available_events' can be dropped and then we can add 'instance' flag to test 'test.d/event/toplevel-enable.tc' for instance. Test result show as below: # ./ftracetest test.d/event/toplevel-enable.tc === Ftrace unit tests === [1] event tracing - enable/disable with top level files [PASS] [2] (instance) event tracing - enable/disable with top level files [PASS] # of passed: 2 # of failed: 0 # of unresolved: 0 # of untested: 0 # of unsupported: 0 # of xfailed: 0 # of undefined(test bug): 0 Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com> --- tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc b/tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc index 93c10ea42a68..8b8e1aea985b 100644 --- a/tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc +++ b/tools/testing/selftests/ftrace/test.d/event/toplevel-enable.tc @@ -1,7 +1,8 @@ #!/bin/sh # SPDX-License-Identifier: GPL-2.0 # description: event tracing - enable/disable with top level files -# requires: available_events set_event events/enable +# requires: set_event events/enable +# flags: instance do_reset() { echo > set_event -- 2.25.1

1 week, 4 days

2
4
0 0

[PATCH v3] selftests/ftrace: traceonoff_triggers: strip off names

by Yipeng Zou

The func_traceonoff_triggers.tc sometimes goes to fail on my board, Kunpeng-920. [root@localhost]# ./ftracetest ./test.d/ftrace/func_traceonoff_triggers.tc -l fail.log === Ftrace unit tests === [1] ftrace - test for function traceon/off triggers [FAIL] [2] (instance) ftrace - test for function traceon/off triggers [UNSUPPORTED] I look up the log, and it shows that the md5sum is different between csum1 and csum2. ++ cnt=611 ++ sleep .1 +++ cnt_trace +++ grep -v '^#' trace +++ wc -l ++ cnt2=611 ++ '[' 611 -ne 611 ']' +++ cat tracing_on ++ on=0 ++ '[' 0 '!=' 0 ']' +++ md5sum trace ++ csum1='76896aa74362fff66a6a5f3cf8a8a500 trace' ++ sleep .1 +++ md5sum trace ++ csum2='ee8625a21c058818fc26e45c1ed3f6de trace' ++ '[' '76896aa74362fff66a6a5f3cf8a8a500 trace' '!=' 'ee8625a21c058818fc26e45c1ed3f6de trace' ']' ++ fail 'Tracing file is still changing' ++ echo Tracing file is still changing Tracing file is still changing ++ exit_fail ++ exit 1 So I directly dump the trace file before md5sum, the diff shows that: [root@localhost]# diff trace_1.log trace_2.log -y --suppress-common-lines dockerd-12285 [036] d.... 18385.510290: sched_stat | <...>-12285 [036] d.... 18385.510290: sched_stat dockerd-12285 [036] d.... 18385.510291: sched_swit | <...>-12285 [036] d.... 18385.510291: sched_swit <...>-740 [044] d.... 18385.602859: sched_stat | kworker/44:1-740 [044] d.... 18385.602859: sched_stat <...>-740 [044] d.... 18385.602860: sched_swit | kworker/44:1-740 [044] d.... 18385.602860: sched_swit And we can see that <...> filed be filled with names. We can strip off the names there to fix that. After strip off the names: kworker/u257:0-12 [019] d..2. 2528.758910: sched_stat | -12 [019] d..2. 2528.758910: sched_stat_runtime: comm=k kworker/u257:0-12 [019] d..2. 2528.758912: sched_swit | -12 [019] d..2. 2528.758912: sched_switch: prev_comm=kw <idle>-0 [000] d.s5. 2528.762318: sched_waki | -0 [000] d.s5. 2528.762318: sched_waking: comm=sshd pi <idle>-0 [037] dNh2. 2528.762326: sched_wake | -0 [037] dNh2. 2528.762326: sched_wakeup: comm=sshd pi <idle>-0 [037] d..2. 2528.762334: sched_swit | -0 [037] d..2. 2528.762334: sched_switch: prev_comm=sw Fixes: d87b29179aa0 ("selftests: ftrace: Use md5sum to take less time of checking logs") Suggested-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> Signed-off-by: Yipeng Zou <zouyipeng(a)huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- .../ftrace/test.d/ftrace/func_traceonoff_triggers.tc | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc b/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc index aee22289536b..1b57771dbfdf 100644 --- a/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc +++ b/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc @@ -90,9 +90,10 @@ if [ $on != "0" ]; then fail "Tracing is not off" fi -csum1=`md5sum trace` +# Cannot rely on names being around as they are only cached, strip them +csum1=`cat trace | sed -e 's/^ *[^ ]*$-[0-9][0-9]*$/\1/' | md5sum` sleep $SLEEP_TIME -csum2=`md5sum trace` +csum2=`cat trace | sed -e 's/^ *[^ ]*$-[0-9][0-9]*$/\1/' | md5sum` if [ "$csum1" != "$csum2" ]; then fail "Tracing file is still changing" -- 2.34.1

1 week, 4 days

2
3
0 0

[PATCH v3 0/4] selftests/resctrl: Enable MBM and MBA tests on AMD

by Babu Moger

The MBM (Memory Bandwidth Monitoring) and MBA (Memory Bandwidth Allocation) features are not enabled for AMD systems. The reason was lack of perf counters to compare the resctrl test results. Starting with the commit 25e56847821f ("perf/x86/amd/uncore: Add memory controller support"), AMD now supports the UMC (Unified Memory Controller) perf events. These events can be used to compare the test results. This series adds the support to detect the UMC events and enable MBM/MBA tests for AMD systems. v3: Note: Based the series on top of latest kselftests/master 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0 (tag: v6.10-rc1). Also applied the patches from the series https://lore.kernel.org/lkml/20240531131142.1716-1-ilpo.jarvinen@linux.inte… Separated the fix patch. Renamed the imc to just mc to make it generic. Changed the search string "uncore_imc_" and "amd_umc_" Changes related rebase to latest kselftest tree. v2: Changes. a. Rebased on top of tip/master (Apr 25, 2024) b. Addressed Ilpo comments except the one about close call. It seems more clear to keep READ and WRITE separate. https://lore.kernel.org/lkml/8e4badb7-6cc5-61f1-e041-d902209a90d5@linux.int… c. Used ksft_perror call when applicable. d. Added vendor check for non contiguous CBM check. v1: https://lore.kernel.org/lkml/cover.1708637563.git.babu.moger@amd.com/ Babu Moger (4): selftests/resctrl: Rename variables and functions to generic names selftests/resctrl: Pass sysfs controller name of the vendor selftests/resctrl: Add support for MBM and MBA tests on AMD selftests/resctrl: Enable MBA/MBA tests on AMD tools/testing/selftests/resctrl/mba_test.c | 25 +- tools/testing/selftests/resctrl/mbm_test.c | 23 +- tools/testing/selftests/resctrl/resctrl.h | 2 +- tools/testing/selftests/resctrl/resctrl_val.c | 305 ++++++++++-------- tools/testing/selftests/resctrl/resctrlfs.c | 2 +- 5 files changed, 191 insertions(+), 166 deletions(-) -- 2.34.1

2 months, 2 weeks

5
15
0 0

[PATCH v7 0/6] mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC

by jeffxu＠chromium.org

From: Jeff Xu <jeffxu(a)google.com> Since Linux introduced the memfd feature, memfd have always had their execute bit set, and the memfd_create() syscall doesn't allow setting it differently. However, in a secure by default system, such as ChromeOS, (where all executables should come from the rootfs, which is protected by Verified boot), this executable nature of memfd opens a door for NoExec bypass and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm process created a memfd to share the content with an external process, however the memfd is overwritten and used for executing arbitrary code and root escalation. [2] lists more VRP in this kind. On the other hand, executable memfd has its legit use, runc uses memfd’s seal and executable feature to copy the contents of the binary then execute them, for such system, we need a solution to differentiate runc's use of executable memfds and an attacker's [3]. To address those above, this set of patches add following: 1> Let memfd_create() set X bit at creation time. 2> Let memfd to be sealed for modifying X bit. 3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of X bit.For example, if a container has vm.memfd_noexec=2, then memfd_create() without MFD_NOEXEC_SEAL will be rejected. 4> A new security hook in memfd_create(). This make it possible to a new LSM, which rejects or allows executable memfd based on its security policy. Change history: v7: - patch 2/6: remove #ifdef and MAX_PATH (memfd_test.c). - patch 3/6: check capability (CAP_SYS_ADMIN) from userns instead of global ns (pid_sysctl.h). Add a tab (pid_namespace.h). - patch 5/6: remove #ifdef (memfd_test.c) - patch 6/6: remove unneeded security_move_mount(security.c). v6:https://lore.kernel.org/lkml/20221206150233.1963717-1-jeffxu@google.com/ - Address comment and move "#ifdef CONFIG_" from .c file to pid_sysctl.h v5:https://lore.kernel.org/lkml/20221206152358.1966099-1-jeffxu@google.com/ - Pass vm.memfd_noexec from current ns to child ns. - Fix build issue detected by kernel test robot. - Add missing security.c v3:https://lore.kernel.org/lkml/20221202013404.163143-1-jeffxu@google.com/ - Address API design comments in v2. - Let memfd_create() to set X bit at creation time. - A new pid namespace sysctl: vm.memfd_noexec to control behavior of X bit. - A new security hook in memfd_create(). v2:https://lore.kernel.org/lkml/20220805222126.142525-1-jeffxu@google.com/ - address comments in V1. - add sysctl (vm.mfd_noexec) to set the default file permissions of memfd_create to be non-executable. v1:https://lwn.net/Articles/890096/ [1] https://crbug.com/1305411 [2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20me… [3] https://lwn.net/Articles/781013/ Daniel Verkamp (2): mm/memfd: add F_SEAL_EXEC selftests/memfd: add tests for F_SEAL_EXEC Jeff Xu (4): mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC mm/memfd: security hook for memfd_create include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 + include/linux/pid_namespace.h | 19 ++ include/linux/security.h | 6 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/memfd.h | 4 + kernel/pid_namespace.c | 5 + kernel/pid_sysctl.h | 59 ++++ mm/memfd.c | 61 +++- mm/shmem.c | 6 + security/security.c | 5 + tools/testing/selftests/memfd/fuse_test.c | 1 + tools/testing/selftests/memfd/memfd_test.c | 341 ++++++++++++++++++++- 13 files changed, 510 insertions(+), 3 deletions(-) create mode 100644 kernel/pid_sysctl.h base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8 -- 2.39.0.rc1.256.g54fd8350bd-goog

3 months, 1 week

9
25
0 0

[RFC PATCH 00/39] 1G page support for guest_memfd

by Ackerley Tng

Hello, This patchset is our exploration of how to support 1G pages in guest_memfd, and how the pages will be used in Confidential VMs. The patchset covers: + How to get 1G pages + Allowing mmap() of guest_memfd to userspace so that both private and shared memory can use the same physical pages + Splitting and reconstructing pages to support conversions and mmap() + How the VM, userspace and guest_memfd interact to support conversions + Selftests to test all the above + Selftests also demonstrate the conversion flow between VM, userspace and guest_memfd. Why 1G pages in guest memfd? Bring guest_memfd to performance and memory savings parity with VMs that are backed by HugeTLBfs. + Performance is improved with 1G pages by more TLB hits and faster page walks on TLB misses. + Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO). Options for 1G page support: 1. HugeTLB 2. Contiguous Memory Allocator (CMA) 3. Other suggestions are welcome! Comparison between options: 1. HugeTLB + Refactor HugeTLB to separate allocator from the rest of HugeTLB + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs + Pro: Can provide iterative steps toward new future allocator + Unexplored: Managing userspace-visible changes + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used, but not when future allocator is used 2. CMA + Port some HugeTLB features to be applied on CMA + Pro: Clean slate What would refactoring HugeTLB involve? (Some refactoring was done in this RFC, more can be done.) 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB + Brings more modularity to HugeTLB + No functionality change intended + Likely step towards HugeTLB's integration into core-mm 2. guest_memfd will use just the allocator component of HugeTLB, not including the complex parts of HugeTLB like + Userspace reservations (resv_map) + Shared PMD mappings + Special page walkers What features would need to be ported to CMA? + Improved allocation guarantees + Per NUMA node pool of huge pages + Subpools per guest_memfd + Memory savings + Something like HugeTLB Vmemmap Optimization + Configuration/reporting features + Configuration of number of pages available (and per NUMA node) at and after host boot + Reporting of memory usage/availability statistics at runtime HugeTLB was picked as the source of 1G pages for this RFC because it allows a graceful transition, and retains memory savings from HVO. To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a confidential VM were to be scheduled on that host, some HugeTLBfs pages would have to be given up and returned to CMA for guest_memfd pages to be rebuilt from that memory. This requires memory to be reserved for HVO to be removed and reapplied on the new guest_memfd memory. This not only slows down memory allocation but also trims the benefits of HVO. Memory would have to be reserved on the host to facilitate these transitions. Improving how guest_memfd uses the allocator in a future revision of this RFC: To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB should be limited to these allocator functions: + reserve(node, page_size, num_pages) => opaque handle + Used when a guest_memfd inode is created to reserve memory from backend allocator + allocate(handle, mempolicy, page_size) => folio + To allocate a folio from guest_memfd's reservation + split(handle, folio, target_page_size) => void + To take a huge folio, and split it to smaller folios, restore to filemap + reconstruct(handle, first_folio, nr_pages) => void + To take a folio, and reconstruct a huge folio out of nr_pages from the first_folio + free(handle, folio) => void + To return folio to guest_memfd's reservation + error(handle, folio) => void + To handle memory errors + unreserve(handle) => void + To return guest_memfd's reservation to allocator backend Userspace should only provide a page size when creating a guest_memfd and should not have to specify HugeTLB. Overview of patches: + Patches 01-12 + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from HugeTLB, and to expose HugeTLB functions. + Patches 13-16 + Letting guest_memfd use HugeTLB + Creation of each guest_memfd reserves pages from HugeTLB's global hstate and puts it into the guest_memfd inode's subpool + Each folio allocation takes a page from the guest_memfd inode's subpool + Patches 17-21 + Selftests for new HugeTLB features in guest_memfd + Patches 22-24 + More small changes on the HugeTLB side to expose functions needed by guest_memfd + Patch 25: + Uses the newly available functions from patches 22-24 to split HugeTLB pages. In this patch, HugeTLB folios are always split to 4K before any usage, private or shared. + Patches 26-28 + Allow mmap() in guest_memfd and faulting in shared pages + Patch 29 + Enables conversion between private/shared pages + Patch 30 + Required to zero folios after conversions to avoid leaking initialized kernel memory + Patch 31-38 + Add selftests to test mapping pages to userspace, guest/host memory sharing and update conversions tests + Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd + Patch 39 + Dynamically split and reconstruct HugeTLB pages instead of always splitting before use. All earlier selftests are expected to still pass. TODOs: + Add logic to wait for safe_refcount [1] + Look into lazy splitting/reconstruction of pages + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the mem_attr_array and faultability updated, the pages in the requested range are also split/reconstructed as necessary. We want to look into delaying splitting/reconstruction to fault time. + Solve race between folios being faulted in and being truncated + When running private_mem_conversions_test with more than 1 vCPU, a folio getting truncated may get faulted in by another process, causing elevated mapcounts when the folio is freed (VM_BUG_ON_FOLIO). + Add intermediate splits (1G should first split to 2M and not split directly to 4K) + Use guest's lock instead of hugetlb_lock + Use multi-index xarray/replace xarray with some other data struct for faultability flag + Refactor HugeTLB better, present generic allocator interface Please let us know your thoughts on: + HugeTLB as the choice of transitional allocator backend + Refactoring HugeTLB to provide generic allocator interface + Shared/private conversion flow + Requiring user to request kernel to unmap pages from userspace using madvise(MADV_DONTNEED) + Failing conversion on elevated mapcounts/pincounts/refcounts + Process of splitting/reconstructing page + Anything else! [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quic… Ackerley Tng (37): mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma() mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv() mm: hugetlb: Remove unnecessary check for avoid_reserve mm: mempolicy: Refactor out policy_node_nodemask() mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to interpret mempolicy instead of vma mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol mm: hugetlb: Refactor out hugetlb_alloc_folio mm: truncate: Expose preparation steps for truncate_inode_pages_final mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() mm: hugetlb: Add option to create new subpool without using surplus mm: hugetlb: Expose hugetlb_acct_memory() mm: hugetlb: Move and expose hugetlb_zero_partial_page() KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes KVM: guest_memfd: hugetlb: initialization and cleanup KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd KVM: selftests: Support various types of backing sources for private memory KVM: selftests: Update test for various private memory backing source types KVM: selftests: Add private_mem_conversions_test.sh KVM: selftests: Test that guest_memfd usage is reported via hugetlb mm: hugetlb: Expose vmemmap optimization functions mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages mm: hugetlb: Add functions to add/move/remove from hugetlb lists KVM: guest_memfd: Track faultability within a struct kvm_gmem_private KVM: guest_memfd: Allow mmapping guest_memfd files KVM: guest_memfd: Use vm_type to determine default faultability KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl KVM: guest_memfd: Handle folio preparation for guest_memfd mmap KVM: selftests: Allow vm_set_memory_attributes to be used without asserting return value of 0 KVM: selftests: Test using guest_memfd memory from userspace KVM: selftests: Test guest_memfd memory sharing between guest and host KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able guest_memfd KVM: selftests: Test that pinned pages block KVM from setting memory attributes to PRIVATE KVM: selftests: Refactor vm_mem_add to be more flexible KVM: selftests: Add helper to perform madvise by memslots KVM: selftests: Update private_mem_conversions_test for mmap()able guest_memfd Vishal Annapurve (2): KVM: guest_memfd: Split HugeTLB pages for guest_memfd use KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page fs/hugetlbfs/inode.c | 35 +- include/linux/hugetlb.h | 54 +- include/linux/kvm_host.h | 1 + include/linux/mempolicy.h | 2 + include/linux/mm.h | 1 + include/uapi/linux/kvm.h | 26 + include/uapi/linux/magic.h | 1 + mm/hugetlb.c | 346 ++-- mm/hugetlb_vmemmap.h | 11 - mm/mempolicy.c | 36 +- mm/truncate.c | 26 +- tools/include/linux/kernel.h | 4 +- tools/testing/selftests/kvm/Makefile | 3 + .../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++ .../selftests/kvm/guest_memfd_pin_test.c | 104 ++ .../selftests/kvm/guest_memfd_sharing_test.c | 160 ++ .../testing/selftests/kvm/guest_memfd_test.c | 238 ++- .../testing/selftests/kvm/include/kvm_util.h | 45 +- .../testing/selftests/kvm/include/test_util.h | 18 + tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++-- tools/testing/selftests/kvm/lib/test_util.c | 99 ++ .../kvm/x86_64/private_mem_conversions_test.c | 158 +- .../x86_64/private_mem_conversions_test.sh | 91 + .../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +- virt/kvm/guest_memfd.c | 1563 ++++++++++++++++- virt/kvm/kvm_main.c | 17 + virt/kvm/kvm_mm.h | 16 + 27 files changed, 3288 insertions(+), 443 deletions(-) create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh -- 2.46.0.598.g6f2099f65c-goog

7 months, 2 weeks

17
129
0 0

[PATCH v2 00/19] iommufd: Add VIOMMU infrastructure (Part-1)

by Nicolin Chen

This series introduces a new VIOMMU infrastructure and related ioctls. IOMMUFD has been using the HWPT infrastructure for all cases, including a nested IO page table support. Yet, there're limitations for an HWPT-based structure to support some advanced HW-accelerated features, such as CMDQV on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU environment, it is not straightforward for nested HWPTs to share the same parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone. The new VIOMMU object is an additional layer, between the nested HWPT and its parent HWPT, to give to both the IOMMUFD core and an IOMMU driver an additional structure to support HW-accelerated feature: ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested0 |--->| viommu0 ------------------ ---------------- | | HW-accel feats | ---------------------------- On a multi-IOMMU system, the VIOMMU object can be instanced to the number of vIOMMUs in a guest VM, while holding the same parent HWPT to share the stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested0 |--->| viommu0 ------------------ ---------------- | | VMID0 | ---------------------------- ---------------------------- ---------------- | | paging_hwpt0 | | hwpt_nested1 |--->| viommu1 ------------------ ---------------- | | VMID1 | ---------------------------- As an initial part-1, add ioctls to support a VIOMMU-based invalidation: IOMMUFD_CMD_VIOMMU_ALLOC to allocate a VIOMMU object IOMMUFD_CMD_VIOMMU_SET/UNSET_VDEV_ID to set/clear device's virtual ID (Resue IOMMUFD_CMD_HWPT_INVALIDATE for a VIOMMU object to flush cache by a given driver data) Worth noting that the VDEV_ID is for a per-VIOMMU device list for drivers to look up the device's physical instance from its virtual ID in a VM. It is essential for a VIOMMU-based invalidation where the request contains a device's virtual ID for its device cache flush, e.g. ATC invalidation. As for the implementation of the series, add an IOMMU_VIOMMU_TYPE_DEFAULT type for a core-allocated-core-managed VIOMMU object, allowing drivers to simply hook a default viommu ops for viommu-based invalidation alone. And provide some viommu helpers to drivers for VDEV_ID translation and parent domain lookup. Add VIOMMU invalidation support to ARM SMMUv3 driver for a real world use case. This adds supports of arm-smmuv-v3's CMDQ_OP_ATC_INV and CMDQ_OP_CFGI_CD/ALL commands, supplementing HWPT-based invalidations. In the future, drivers will also be able to choose a driver-managed type to hold its own structure by adding a new type to enum iommu_viommu_type. More VIOMMU-based structures and ioctls will be introduced in part-2/3 to support a driver-managed VIOMMU, e.g. VQUEUE object for a HW accelerated queue, VIRQ (or VEVENT) object for IRQ injections. Although we repurposed the VIOMMU object from an earlier RFC discussion, for a referece: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ This series is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v2 Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p1-v2 Changelog v2 * Limited vdev_id to one per idev * Added a rw_sem to protect the vdev_id list * Reworked driver-level APIs with proper lockings * Added a new viommu_api file for IOMMUFD_DRIVER config * Dropped useless iommu_dev point from the viommu structure * Added missing index numnbers to new types in the uAPI header * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one * Reworked mock_viommu_cache_invalidate() using the new iommu helper * Reordered details of set/unset_vdev_id handlers for proper lockings * Added arm_smmu_cache_invalidate_user patch from Jason's nesting series v1 https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/ Thanks! Nicolin Jason Gunthorpe (3): iommu: Add iommu_copy_struct_from_full_user_array helper iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED iommu/arm-smmu-v3: Update comments about ATS and bypass Nicolin Chen (16): iommufd: Reorder struct forward declarations iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl iommu: Pass in a viommu pointer to domain_alloc_user op iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl iommufd/selftest: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID test coverage iommufd/viommu: Add cache_invalidate for IOMMU_VIOMMU_TYPE_DEFAULT iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE iommufd/viommu: Add vdev_id helpers for IOMMU drivers iommufd/selftest: Add mock_viommu_invalidate_user op iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command iommufd/selftest: Add VIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl iommufd/viommu: Add iommufd_viommu_to_parent_domain helper iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate drivers/iommu/amd/iommu.c | 1 + drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 218 ++++++++++++++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 + drivers/iommu/intel/iommu.c | 1 + drivers/iommu/iommufd/Makefile | 5 +- drivers/iommu/iommufd/device.c | 12 + drivers/iommu/iommufd/hw_pagetable.c | 59 +++- drivers/iommu/iommufd/iommufd_private.h | 37 +++ drivers/iommu/iommufd/iommufd_test.h | 30 ++ drivers/iommu/iommufd/main.c | 12 + drivers/iommu/iommufd/selftest.c | 101 ++++++- drivers/iommu/iommufd/viommu.c | 196 +++++++++++++ drivers/iommu/iommufd/viommu_api.c | 53 ++++ include/linux/iommu.h | 56 +++- include/linux/iommufd.h | 51 +++- include/uapi/linux/iommufd.h | 117 +++++++- tools/testing/selftests/iommu/iommufd.c | 259 +++++++++++++++++- tools/testing/selftests/iommu/iommufd_utils.h | 126 +++++++++ 18 files changed, 1299 insertions(+), 38 deletions(-) create mode 100644 drivers/iommu/iommufd/viommu.c create mode 100644 drivers/iommu/iommufd/viommu_api.c -- 2.43.0

7 months, 3 weeks

7
148
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror October 2024