Patch 1 fixes a KVM+rseq bug where KVM's handling of TIF_NOTIFY_RESUME, e.g. for task migration, clears the flag without informing rseq and leads to stale data in userspace's rseq struct.
Patch 2 is a cleanup to try and make future bugs less likely. It's also a baby step towards moving and renaming tracehook_notify_resume() since it has nothing to do with tracing.
Patch 3 is a fix/cleanup to stop overriding x86's unistd_{32,64}.h when the include path (intentionally) omits tools' uapi headers. KVM's selftests do exactly that so that they can pick up the uapi headers from the installed kernel headers, and still use various tools/ headers that mirror kernel code, e.g. linux/types.h. This allows the new test in patch 4 to reference __NR_rseq without having to manually define it.
Patch 4 is a regression test for the KVM+rseq bug.
Patch 5 is a cleanup made possible by patch 3.
v2: - Don't touch rseq_cs when handling KVM case so that rseq_syscall() will still detect a naughty userspace. [Mathieu] - Use a sequence counter + retry in the test to ensure the process isn't migrated between sched_getcpu() and reading rseq.cpu_id, i.e. to avoid a flaky test. [Mathieu] - Add Mathieu's ack for patch 2. - Add more comments in the test.
v1: https://lkml.kernel.org/r/20210818001210.4073390-1-seanjc@google.com
Sean Christopherson (5): KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() tools: Move x86 syscall number fallbacks to .../uapi/ KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs KVM: selftests: Remove __NR_userfaultfd syscall fallback
arch/arm/kernel/signal.c | 1 - arch/arm64/kernel/signal.c | 1 - arch/csky/kernel/signal.c | 4 +- arch/mips/kernel/signal.c | 4 +- arch/powerpc/kernel/signal.c | 4 +- arch/s390/kernel/signal.c | 1 - include/linux/tracehook.h | 2 + kernel/entry/common.c | 4 +- kernel/rseq.c | 14 +- .../x86/include/{ => uapi}/asm/unistd_32.h | 0 .../x86/include/{ => uapi}/asm/unistd_64.h | 3 - tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 3 + tools/testing/selftests/kvm/rseq_test.c | 154 ++++++++++++++++++ 14 files changed, 175 insertions(+), 21 deletions(-) rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%) rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (83%) create mode 100644 tools/testing/selftests/kvm/rseq_test.c
Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to transferring to a KVM guest, which is roughly equivalent to an exit to userspace and processes many of the same pending actions. While the task cannot be in an rseq critical section as the KVM path is reachable only by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a critical section still apply, e.g. the current CPU needs to be updated if the task is migrated.
Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults and other badness in userspace VMMs that use rseq in combination with KVM, e.g. due to the CPU ID being stale after task migration.
Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") Reported-by: Peter Foley pefoley@google.com Bisected-by: Doug Evans dje@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com --- kernel/entry/kvm.c | 4 +++- kernel/rseq.c | 14 +++++++++++--- 2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 49972ee99aff..049fd06b4c3d 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) if (ti_work & _TIF_NEED_RESCHED) schedule();
- if (ti_work & _TIF_NOTIFY_RESUME) + if (ti_work & _TIF_NOTIFY_RESUME) { tracehook_notify_resume(NULL); + rseq_handle_notify_resume(NULL, NULL); + }
ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); if (ret) diff --git a/kernel/rseq.c b/kernel/rseq.c index 35f7bd0fced0..6d45ac3dae7f 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -282,9 +282,17 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING)) return; - ret = rseq_ip_fixup(regs); - if (unlikely(ret < 0)) - goto error; + + /* + * regs is NULL if and only if the caller is in a syscall path. Skip + * fixup and leave rseq_cs as is so that rseq_sycall() will detect and + * kill a misbehaving userspace on debug kernels. + */ + if (regs) { + ret = rseq_ip_fixup(regs); + if (unlikely(ret < 0)) + goto error; + } if (unlikely(rseq_update_cpu_id(t))) goto error; return;
----- On Aug 20, 2021, at 6:49 PM, Sean Christopherson seanjc@google.com wrote:
Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to transferring to a KVM guest, which is roughly equivalent to an exit to userspace and processes many of the same pending actions. While the task cannot be in an rseq critical section as the KVM path is reachable only by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a critical section still apply, e.g. the current CPU needs to be updated if the task is migrated.
Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults and other badness in userspace VMMs that use rseq in combination with KVM, e.g. due to the CPU ID being stale after task migration.
Acked-by: Mathieu Desnoyers mathieu.desnoyers@efficios.com
Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") Reported-by: Peter Foley pefoley@google.com Bisected-by: Doug Evans dje@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com
kernel/entry/kvm.c | 4 +++- kernel/rseq.c | 14 +++++++++++--- 2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 49972ee99aff..049fd06b4c3d 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) if (ti_work & _TIF_NEED_RESCHED) schedule();
if (ti_work & _TIF_NOTIFY_RESUME)
if (ti_work & _TIF_NOTIFY_RESUME) { tracehook_notify_resume(NULL);
rseq_handle_notify_resume(NULL, NULL);
}
ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); if (ret)
diff --git a/kernel/rseq.c b/kernel/rseq.c index 35f7bd0fced0..6d45ac3dae7f 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -282,9 +282,17 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING)) return;
- ret = rseq_ip_fixup(regs);
- if (unlikely(ret < 0))
goto error;
- /*
* regs is NULL if and only if the caller is in a syscall path. Skip
* fixup and leave rseq_cs as is so that rseq_sycall() will detect and
* kill a misbehaving userspace on debug kernels.
*/
- if (regs) {
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
- } if (unlikely(rseq_update_cpu_id(t))) goto error; return;
-- 2.33.0.rc2.250.ged5fa647cd-goog
Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now that the two function are always called back-to-back by architectures that have rseq. The rseq helper is stubbed out for architectures that don't support rseq, i.e. this is a nop across the board.
Note, tracehook_notify_resume() is horribly named and arguably does not belong in tracehook.h as literally every line of code in it has nothing to do with tracing. But, that's been true since commit a42c6ded827d ("move key_repace_session_keyring() into tracehook_notify_resume()") first usurped tracehook_notify_resume() back in 2012. Punt cleaning that mess up to future patches.
No functional change intended.
Acked-by: Mathieu Desnoyers mathieu.desnoyers@efficios.com Signed-off-by: Sean Christopherson seanjc@google.com --- arch/arm/kernel/signal.c | 1 - arch/arm64/kernel/signal.c | 1 - arch/csky/kernel/signal.c | 4 +--- arch/mips/kernel/signal.c | 4 +--- arch/powerpc/kernel/signal.c | 4 +--- arch/s390/kernel/signal.c | 1 - include/linux/tracehook.h | 2 ++ kernel/entry/common.c | 4 +--- kernel/entry/kvm.c | 4 +--- 9 files changed, 7 insertions(+), 18 deletions(-)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index a3a38d0a4c85..9df68d139965 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -670,7 +670,6 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall) uprobe_notify_resume(regs); } else { tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); } } local_irq_disable(); diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 23036334f4dc..22b55db13da6 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -951,7 +951,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
if (thread_flags & _TIF_NOTIFY_RESUME) { tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs);
/* * If we reschedule after checking the affinity diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c index 312f046d452d..bc4238b9f709 100644 --- a/arch/csky/kernel/signal.c +++ b/arch/csky/kernel/signal.c @@ -260,8 +260,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) do_signal(regs);
- if (thread_info_flags & _TIF_NOTIFY_RESUME) { + if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); - } } diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c index f1e985109da0..c9b2a75563e1 100644 --- a/arch/mips/kernel/signal.c +++ b/arch/mips/kernel/signal.c @@ -906,10 +906,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, void *unused, if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) do_signal(regs);
- if (thread_info_flags & _TIF_NOTIFY_RESUME) { + if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); - }
user_enter(); } diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index e600764a926c..b93b87df499d 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -293,10 +293,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) do_signal(current); }
- if (thread_info_flags & _TIF_NOTIFY_RESUME) { + if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); - } }
static unsigned long get_tm_stackpointer(struct task_struct *tsk) diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c index 78ef53b29958..b307db26bf2d 100644 --- a/arch/s390/kernel/signal.c +++ b/arch/s390/kernel/signal.c @@ -537,5 +537,4 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) void do_notify_resume(struct pt_regs *regs) { tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); } diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h index 3e80c4bc66f7..2564b7434b4d 100644 --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -197,6 +197,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
mem_cgroup_handle_over_high(); blkcg_maybe_throttle_current(); + + rseq_handle_notify_resume(NULL, regs); }
/* diff --git a/kernel/entry/common.c b/kernel/entry/common.c index bf16395b9e13..d5a61d565ad5 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -171,10 +171,8 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) handle_signal_work(regs, ti_work);
- if (ti_work & _TIF_NOTIFY_RESUME) { + if (ti_work & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); - }
/* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work); diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 049fd06b4c3d..49972ee99aff 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -19,10 +19,8 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) if (ti_work & _TIF_NEED_RESCHED) schedule();
- if (ti_work & _TIF_NOTIFY_RESUME) { + if (ti_work & _TIF_NOTIFY_RESUME) tracehook_notify_resume(NULL); - rseq_handle_notify_resume(NULL, NULL); - }
ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); if (ret)
Move unistd_{32,64}.h from x86/include/asm to x86/include/uapi/asm so that tools/selftests that install kernel headers, e.g. KVM selftests, can include non-uapi tools headers, e.g. to get 'struct list_head', without effectively overriding the installed non-tool uapi headers.
Swapping KVM's search order, e.g. to search the kernel headers before tool headers, is not a viable option as doing results in linux/type.h and other core headers getting pulled from the kernel headers, which do not have the kernel-internal typedefs that are used through tools, including many files outside of selftests/kvm's control.
Prior to commit cec07f53c398 ("perf tools: Move syscall number fallbacks from perf-sys.h to tools/arch/x86/include/asm/"), the handcoded numbers were actual fallbacks, i.e. overriding unistd_{32,64}.h from the kernel headers was unintentional.
Signed-off-by: Sean Christopherson seanjc@google.com --- tools/arch/x86/include/{ => uapi}/asm/unistd_32.h | 0 tools/arch/x86/include/{ => uapi}/asm/unistd_64.h | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%) rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (100%)
diff --git a/tools/arch/x86/include/asm/unistd_32.h b/tools/arch/x86/include/uapi/asm/unistd_32.h similarity index 100% rename from tools/arch/x86/include/asm/unistd_32.h rename to tools/arch/x86/include/uapi/asm/unistd_32.h diff --git a/tools/arch/x86/include/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h similarity index 100% rename from tools/arch/x86/include/asm/unistd_64.h rename to tools/arch/x86/include/uapi/asm/unistd_64.h
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
Signed-off-by: Sean Christopherson seanjc@google.com --- tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 3 + tools/testing/selftests/kvm/rseq_test.c | 154 ++++++++++++++++++++++++ 3 files changed, 158 insertions(+) create mode 100644 tools/testing/selftests/kvm/rseq_test.c
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index 0709af0144c8..6d031ff6b68e 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -47,6 +47,7 @@ /kvm_page_table_test /memslot_modification_stress_test /memslot_perf_test +/rseq_test /set_memory_region_test /steal_time /kvm_binary_stats_test diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 5832f510a16c..0756e79cb513 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -80,6 +80,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus TEST_GEN_PROGS_x86_64 += kvm_page_table_test TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test TEST_GEN_PROGS_x86_64 += memslot_perf_test +TEST_GEN_PROGS_x86_64 += rseq_test TEST_GEN_PROGS_x86_64 += set_memory_region_test TEST_GEN_PROGS_x86_64 += steal_time TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test @@ -92,6 +93,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test TEST_GEN_PROGS_aarch64 += dirty_log_perf_test TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus TEST_GEN_PROGS_aarch64 += kvm_page_table_test +TEST_GEN_PROGS_aarch64 += rseq_test TEST_GEN_PROGS_aarch64 += set_memory_region_test TEST_GEN_PROGS_aarch64 += steal_time TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test @@ -103,6 +105,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test TEST_GEN_PROGS_s390x += dirty_log_test TEST_GEN_PROGS_s390x += kvm_create_max_vcpus TEST_GEN_PROGS_s390x += kvm_page_table_test +TEST_GEN_PROGS_s390x += rseq_test TEST_GEN_PROGS_s390x += set_memory_region_test TEST_GEN_PROGS_s390x += kvm_binary_stats_test
diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c new file mode 100644 index 000000000000..d28d7ba1a64a --- /dev/null +++ b/tools/testing/selftests/kvm/rseq_test.c @@ -0,0 +1,154 @@ +// SPDX-License-Identifier: GPL-2.0-only +#define _GNU_SOURCE /* for program_invocation_short_name */ +#include <errno.h> +#include <fcntl.h> +#include <pthread.h> +#include <sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <signal.h> +#include <syscall.h> +#include <sys/ioctl.h> +#include <asm/atomic.h> +#include <asm/barrier.h> +#include <linux/rseq.h> +#include <linux/unistd.h> + +#include "kvm_util.h" +#include "processor.h" +#include "test_util.h" + +#define VCPU_ID 0 + +static __thread volatile struct rseq __rseq = { + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, +}; + +#define RSEQ_SIG 0xdeadbeef + +static pthread_t migration_thread; +static cpu_set_t possible_mask; +static bool done; + +static atomic_t seq_cnt; + +static void guest_code(void) +{ + for (;;) + GUEST_SYNC(0); +} + +static void sys_rseq(int flags) +{ + int r; + + r = syscall(__NR_rseq, &__rseq, sizeof(__rseq), flags, RSEQ_SIG); + TEST_ASSERT(!r, "rseq failed, errno = %d (%s)", errno, strerror(errno)); +} + +static void *migration_worker(void *ign) +{ + cpu_set_t allowed_mask; + int r, i, nr_cpus, cpu; + + CPU_ZERO(&allowed_mask); + + nr_cpus = CPU_COUNT(&possible_mask); + + for (i = 0; i < 20000; i++) { + cpu = i % nr_cpus; + if (!CPU_ISSET(cpu, &possible_mask)) + continue; + + CPU_SET(cpu, &allowed_mask); + + /* + * Bump the sequence count twice to allow the reader to detect + * that a migration may have occurred in between rseq and sched + * CPU ID reads. An odd sequence count indicates a migration + * is in-progress, while a completely different count indicates + * a migration occurred since the count was last read. + */ + atomic_inc(&seq_cnt); + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask); + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", + errno, strerror(errno)); + atomic_inc(&seq_cnt); + + CPU_CLR(cpu, &allowed_mask); + + /* + * Let the read-side get back into KVM_RUN to improve the odds + * of task migration coinciding with KVM's run loop. + */ + usleep(1); + } + done = true; + return NULL; +} + +int main(int argc, char *argv[]) +{ + struct kvm_vm *vm; + u32 cpu, rseq_cpu; + int r, snapshot; + + /* Tell stdout not to buffer its content */ + setbuf(stdout, NULL); + + r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask); + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno, + strerror(errno)); + + if (CPU_COUNT(&possible_mask) < 2) { + print_skip("Only one CPU, task migration not possible\n"); + exit(KSFT_SKIP); + } + + sys_rseq(0); + + /* + * Create and run a dummy VM that immediately exits to userspace via + * GUEST_SYNC, while concurrently migrating the process by setting its + * CPU affinity. + */ + vm = vm_create_default(VCPU_ID, 0, guest_code); + + pthread_create(&migration_thread, NULL, migration_worker, 0); + + while (!done) { + vcpu_run(vm, VCPU_ID); + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, + "Guest failed?"); + + /* + * Verify rseq's CPU matches sched's CPU. Ensure migration + * doesn't occur between sched_getcpu() and reading the rseq + * cpu_id by rereading both if the sequence count changes, or + * if the count is odd (migration in-progress). + */ + do { + /* + * Drop bit 0 to force a mismatch if the count is odd, + * i.e. if a migration is in-progress. + */ + snapshot = atomic_read(&seq_cnt) & ~1; + smp_rmb(); + cpu = sched_getcpu(); + rseq_cpu = READ_ONCE(__rseq.cpu_id); + smp_rmb(); + } while (snapshot != atomic_read(&seq_cnt)); + + TEST_ASSERT(rseq_cpu == cpu, + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu); + } + + pthread_join(migration_thread, NULL); + + kvm_vm_free(vm); + + sys_rseq(RSEQ_FLAG_UNREGISTER); + + return 0; +}
----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
[...]
+#define RSEQ_SIG 0xdeadbeef
Is there any reason for defining a custom signature rather than including tools/testing/selftests/rseq/rseq.h ? This should take care of including the proper architecture header which will define the appropriate signature.
Arguably you don't define rseq critical sections in this test per se, but I'm wondering why the custom signature here.
[...]
+static void *migration_worker(void *ign) +{
- cpu_set_t allowed_mask;
- int r, i, nr_cpus, cpu;
- CPU_ZERO(&allowed_mask);
- nr_cpus = CPU_COUNT(&possible_mask);
- for (i = 0; i < 20000; i++) {
cpu = i % nr_cpus;
if (!CPU_ISSET(cpu, &possible_mask))
continue;
CPU_SET(cpu, &allowed_mask);
/*
* Bump the sequence count twice to allow the reader to detect
* that a migration may have occurred in between rseq and sched
* CPU ID reads. An odd sequence count indicates a migration
* is in-progress, while a completely different count indicates
* a migration occurred since the count was last read.
*/
atomic_inc(&seq_cnt);
So technically this atomic_inc contains the required barriers because the selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But it's rather odd that the semantic differs from the kernel implementation in terms of memory barriers: the kernel implementation of atomic_inc guarantees no memory barriers, but this one happens to provide full barriers pretty much by accident (selftests futex/include/atomic.h documents no such guarantee).
If this full barrier guarantee is indeed provided by the selftests atomic.h header, I would really like a comment stating that in the atomic.h header so the carpet is not pulled from under our feet by a future optimization.
r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
errno, strerror(errno));
atomic_inc(&seq_cnt);
CPU_CLR(cpu, &allowed_mask);
/*
* Let the read-side get back into KVM_RUN to improve the odds
* of task migration coinciding with KVM's run loop.
This comment should be about increasing the odds of letting the seqlock read-side complete. Otherwise, the delay between the two back-to-back atomic_inc is so small that the seqlock read-side may never have time to complete the reading the rseq cpu id and the sched_getcpu() call, and can retry forever.
I'm wondering if 1 microsecond is sufficient on other architectures as well. One alternative way to make this depend less on the architecture's implementation of sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the migration thread rather than use usleep, and throw away the value read. This would ensure the delay is appropriate on all architectures.
Thanks!
Mathieu
*/
usleep(1);
- }
- done = true;
- return NULL;
+}
+int main(int argc, char *argv[]) +{
- struct kvm_vm *vm;
- u32 cpu, rseq_cpu;
- int r, snapshot;
- /* Tell stdout not to buffer its content */
- setbuf(stdout, NULL);
- r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
- TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
strerror(errno));
- if (CPU_COUNT(&possible_mask) < 2) {
print_skip("Only one CPU, task migration not possible\n");
exit(KSFT_SKIP);
- }
- sys_rseq(0);
- /*
* Create and run a dummy VM that immediately exits to userspace via
* GUEST_SYNC, while concurrently migrating the process by setting its
* CPU affinity.
*/
- vm = vm_create_default(VCPU_ID, 0, guest_code);
- pthread_create(&migration_thread, NULL, migration_worker, 0);
- while (!done) {
vcpu_run(vm, VCPU_ID);
TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
"Guest failed?");
/*
* Verify rseq's CPU matches sched's CPU. Ensure migration
* doesn't occur between sched_getcpu() and reading the rseq
* cpu_id by rereading both if the sequence count changes, or
* if the count is odd (migration in-progress).
*/
do {
/*
* Drop bit 0 to force a mismatch if the count is odd,
* i.e. if a migration is in-progress.
*/
snapshot = atomic_read(&seq_cnt) & ~1;
smp_rmb();
cpu = sched_getcpu();
rseq_cpu = READ_ONCE(__rseq.cpu_id);
smp_rmb();
} while (snapshot != atomic_read(&seq_cnt));
TEST_ASSERT(rseq_cpu == cpu,
"rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
- }
- pthread_join(migration_thread, NULL);
- kvm_vm_free(vm);
- sys_rseq(RSEQ_FLAG_UNREGISTER);
- return 0;
+}
2.33.0.rc2.250.ged5fa647cd-goog
[ re-send to Darren Hart ]
----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
[...]
+#define RSEQ_SIG 0xdeadbeef
Is there any reason for defining a custom signature rather than including tools/testing/selftests/rseq/rseq.h ? This should take care of including the proper architecture header which will define the appropriate signature.
Arguably you don't define rseq critical sections in this test per se, but I'm wondering why the custom signature here.
[...]
+static void *migration_worker(void *ign) +{
- cpu_set_t allowed_mask;
- int r, i, nr_cpus, cpu;
- CPU_ZERO(&allowed_mask);
- nr_cpus = CPU_COUNT(&possible_mask);
- for (i = 0; i < 20000; i++) {
cpu = i % nr_cpus;
if (!CPU_ISSET(cpu, &possible_mask))
continue;
CPU_SET(cpu, &allowed_mask);
/*
* Bump the sequence count twice to allow the reader to detect
* that a migration may have occurred in between rseq and sched
* CPU ID reads. An odd sequence count indicates a migration
* is in-progress, while a completely different count indicates
* a migration occurred since the count was last read.
*/
atomic_inc(&seq_cnt);
So technically this atomic_inc contains the required barriers because the selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But it's rather odd that the semantic differs from the kernel implementation in terms of memory barriers: the kernel implementation of atomic_inc guarantees no memory barriers, but this one happens to provide full barriers pretty much by accident (selftests futex/include/atomic.h documents no such guarantee).
If this full barrier guarantee is indeed provided by the selftests atomic.h header, I would really like a comment stating that in the atomic.h header so the carpet is not pulled from under our feet by a future optimization.
r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
errno, strerror(errno));
atomic_inc(&seq_cnt);
CPU_CLR(cpu, &allowed_mask);
/*
* Let the read-side get back into KVM_RUN to improve the odds
* of task migration coinciding with KVM's run loop.
This comment should be about increasing the odds of letting the seqlock read-side complete. Otherwise, the delay between the two back-to-back atomic_inc is so small that the seqlock read-side may never have time to complete the reading the rseq cpu id and the sched_getcpu() call, and can retry forever.
I'm wondering if 1 microsecond is sufficient on other architectures as well. One alternative way to make this depend less on the architecture's implementation of sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the migration thread rather than use usleep, and throw away the value read. This would ensure the delay is appropriate on all architectures.
Thanks!
Mathieu
*/
usleep(1);
- }
- done = true;
- return NULL;
+}
+int main(int argc, char *argv[]) +{
- struct kvm_vm *vm;
- u32 cpu, rseq_cpu;
- int r, snapshot;
- /* Tell stdout not to buffer its content */
- setbuf(stdout, NULL);
- r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
- TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
strerror(errno));
- if (CPU_COUNT(&possible_mask) < 2) {
print_skip("Only one CPU, task migration not possible\n");
exit(KSFT_SKIP);
- }
- sys_rseq(0);
- /*
* Create and run a dummy VM that immediately exits to userspace via
* GUEST_SYNC, while concurrently migrating the process by setting its
* CPU affinity.
*/
- vm = vm_create_default(VCPU_ID, 0, guest_code);
- pthread_create(&migration_thread, NULL, migration_worker, 0);
- while (!done) {
vcpu_run(vm, VCPU_ID);
TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
"Guest failed?");
/*
* Verify rseq's CPU matches sched's CPU. Ensure migration
* doesn't occur between sched_getcpu() and reading the rseq
* cpu_id by rereading both if the sequence count changes, or
* if the count is odd (migration in-progress).
*/
do {
/*
* Drop bit 0 to force a mismatch if the count is odd,
* i.e. if a migration is in-progress.
*/
snapshot = atomic_read(&seq_cnt) & ~1;
smp_rmb();
cpu = sched_getcpu();
rseq_cpu = READ_ONCE(__rseq.cpu_id);
smp_rmb();
} while (snapshot != atomic_read(&seq_cnt));
TEST_ASSERT(rseq_cpu == cpu,
"rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
- }
- pthread_join(migration_thread, NULL);
- kvm_vm_free(vm);
- sys_rseq(RSEQ_FLAG_UNREGISTER);
- return 0;
+}
2.33.0.rc2.250.ged5fa647cd-goog
-- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
[ re-send to Darren Hart ]
----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
[...]
+#define RSEQ_SIG 0xdeadbeef
Is there any reason for defining a custom signature rather than including tools/testing/selftests/rseq/rseq.h ? This should take care of including the proper architecture header which will define the appropriate signature.
Arguably you don't define rseq critical sections in this test per se, but I'm wondering why the custom signature here.
Partly to avoid taking a dependency on rseq.h, and partly to try to call out that the test doesn't actually do any rseq critical sections.
[...]
+static void *migration_worker(void *ign) +{
- cpu_set_t allowed_mask;
- int r, i, nr_cpus, cpu;
- CPU_ZERO(&allowed_mask);
- nr_cpus = CPU_COUNT(&possible_mask);
- for (i = 0; i < 20000; i++) {
cpu = i % nr_cpus;
if (!CPU_ISSET(cpu, &possible_mask))
continue;
CPU_SET(cpu, &allowed_mask);
/*
* Bump the sequence count twice to allow the reader to detect
* that a migration may have occurred in between rseq and sched
* CPU ID reads. An odd sequence count indicates a migration
* is in-progress, while a completely different count indicates
* a migration occurred since the count was last read.
*/
atomic_inc(&seq_cnt);
So technically this atomic_inc contains the required barriers because the selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But it's rather odd that the semantic differs from the kernel implementation in terms of memory barriers: the kernel implementation of atomic_inc guarantees no memory barriers, but this one happens to provide full barriers pretty much by accident (selftests futex/include/atomic.h documents no such guarantee).
Yeah, I got quite lost trying to figure out what atomics the test would actually end up with.
If this full barrier guarantee is indeed provided by the selftests atomic.h header, I would really like a comment stating that in the atomic.h header so the carpet is not pulled from under our feet by a future optimization.
r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
errno, strerror(errno));
atomic_inc(&seq_cnt);
CPU_CLR(cpu, &allowed_mask);
/*
* Let the read-side get back into KVM_RUN to improve the odds
* of task migration coinciding with KVM's run loop.
This comment should be about increasing the odds of letting the seqlock read-side complete. Otherwise, the delay between the two back-to-back atomic_inc is so small that the seqlock read-side may never have time to complete the reading the rseq cpu id and the sched_getcpu() call, and can retry forever.
Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't possible (though that syscall would have to be screaming fast), but the primary motivation is very much to allow the read-side enough time to get back into KVM proper.
To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run loop, i.e. sched_setaffinity() must induce task migration after the read-side has invoked ioctl(KVM_RUN).
I'm wondering if 1 microsecond is sufficient on other architectures as well.
I'm definitely wondering that as well :-)
One alternative way to make this depend less on the architecture's implementation of sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the migration thread rather than use usleep, and throw away the value read. This would ensure the delay is appropriate on all architectures.
As above, I think an arbitrary delay is required regardless of how fast sched_getcpu() can execute. One thought would be to do sched_getcpu() _and_ usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part, but I don't know that that adds meaningful value.
The real test is if someone could see if the bug repros on non-x86 hardware...
----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:
On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
[ re-send to Darren Hart ]
----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
[...]
+#define RSEQ_SIG 0xdeadbeef
Is there any reason for defining a custom signature rather than including tools/testing/selftests/rseq/rseq.h ? This should take care of including the proper architecture header which will define the appropriate signature.
Arguably you don't define rseq critical sections in this test per se, but I'm wondering why the custom signature here.
Partly to avoid taking a dependency on rseq.h, and partly to try to call out that the test doesn't actually do any rseq critical sections.
It might be good to add a comment near this define stating this then, so nobody attempts to copy this as an example.
[...]
+static void *migration_worker(void *ign) +{
- cpu_set_t allowed_mask;
- int r, i, nr_cpus, cpu;
- CPU_ZERO(&allowed_mask);
- nr_cpus = CPU_COUNT(&possible_mask);
- for (i = 0; i < 20000; i++) {
cpu = i % nr_cpus;
if (!CPU_ISSET(cpu, &possible_mask))
continue;
CPU_SET(cpu, &allowed_mask);
/*
* Bump the sequence count twice to allow the reader to detect
* that a migration may have occurred in between rseq and sched
* CPU ID reads. An odd sequence count indicates a migration
* is in-progress, while a completely different count indicates
* a migration occurred since the count was last read.
*/
atomic_inc(&seq_cnt);
So technically this atomic_inc contains the required barriers because the selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But it's rather odd that the semantic differs from the kernel implementation in terms of memory barriers: the kernel implementation of atomic_inc guarantees no memory barriers, but this one happens to provide full barriers pretty much by accident (selftests futex/include/atomic.h documents no such guarantee).
Yeah, I got quite lost trying to figure out what atomics the test would actually end up with.
At the very least, until things are clarified in the selftests atomic header, I would recommend adding a comment stating which memory barriers are required alongside each use of atomic_inc here. I would even prefer that we add extra (currently unneeded) write barriers to make extra sure that this stays documented. Performance of the write-side does not matter much here.
If this full barrier guarantee is indeed provided by the selftests atomic.h header, I would really like a comment stating that in the atomic.h header so the carpet is not pulled from under our feet by a future optimization.
r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
errno, strerror(errno));
atomic_inc(&seq_cnt);
CPU_CLR(cpu, &allowed_mask);
/*
* Let the read-side get back into KVM_RUN to improve the odds
* of task migration coinciding with KVM's run loop.
This comment should be about increasing the odds of letting the seqlock read-side complete. Otherwise, the delay between the two back-to-back atomic_inc is so small that the seqlock read-side may never have time to complete the reading the rseq cpu id and the sched_getcpu() call, and can retry forever.
Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't possible (though that syscall would have to be screaming fast),
I don't think we have the same understanding of the livelock scenario. AFAIU the livelock would be caused by a too-small delay between the two consecutive atomic_inc() from one loop iteration to the next compared to the time it takes to perform a read-side critical section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I doubt that the sched_getcpu implementation have good odds to be fast enough to complete in that narrow window, leading to lots of read seqlock retry.
but the primary motivation is very much to allow the read-side enough time to get back into KVM proper.
I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we have:
migration thread KVM_RUN/read-side thread ----------------------------------------------------------------------------------- - ioctl(KVM_RUN) - atomic_inc_seq_cst(&seqcnt) - sched_setaffinity - atomic_inc_seq_cst(&seqcnt) - a = atomic_load(&seqcnt) & ~1 - smp_rmb() - b = LOAD_ONCE(__rseq_abi->cpu_id); - sched_getcpu() - smp_rmb() - re-load seqcnt/compare (succeeds) - Can only succeed if entire read-side happens while the seqcnt is in an even numbered state. - if (a != b) abort() /* no delay. Even counter state is very short. */ - atomic_inc_seq_cst(&seqcnt) /* Let's suppose the lack of delay causes the setaffinity to complete too early compared with KVM_RUN ioctl */ - sched_setaffinity - atomic_inc_seq_cst(&seqcnt)
/* no delay. Even counter state is very short. */ - atomic_inc_seq_cst(&seqcnt) /* Then a setaffinity from a following migration thread loop will run concurrently with KVM_RUN */ - ioctl(KVM_RUN) - sched_setaffinity - atomic_inc_seq_cst(&seqcnt)
As pointed out here, if the first setaffinity runs too early compared with KVM_RUN, a following setaffinity will run concurrently with it. However, the fact that the even counter state is very short may very well hurt progress of the read seqlock.
To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run loop, i.e. sched_setaffinity() must induce task migration after the read-side has invoked ioctl(KVM_RUN).
No argument here.
I'm wondering if 1 microsecond is sufficient on other architectures as well.
I'm definitely wondering that as well :-)
One alternative way to make this depend less on the architecture's implementation of sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the migration thread rather than use usleep, and throw away the value read. This would ensure the delay is appropriate on all architectures.
As above, I think an arbitrary delay is required regardless of how fast sched_getcpu() can execute. One thought would be to do sched_getcpu() _and_ usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part, but I don't know that that adds meaningful value.
The real test is if someone could see if the bug repros on non-x86 hardware...
As I explain in the scenario above, I don't think we agree on the reason why the delay is required. One way to confirm this would be to run the code without delay, and count how many seqcnt read-side succeed vs fail. We can then compare that with a run with a 1us delay between the migration thread loops. This data should help us come to a common understanding of the delay's role.
Thanks,
Mathieu
On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:
r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
errno, strerror(errno));
atomic_inc(&seq_cnt);
CPU_CLR(cpu, &allowed_mask);
/*
* Let the read-side get back into KVM_RUN to improve the odds
* of task migration coinciding with KVM's run loop.
This comment should be about increasing the odds of letting the seqlock read-side complete. Otherwise, the delay between the two back-to-back atomic_inc is so small that the seqlock read-side may never have time to complete the reading the rseq cpu id and the sched_getcpu() call, and can retry forever.
Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't possible (though that syscall would have to be screaming fast),
I don't think we have the same understanding of the livelock scenario. AFAIU the livelock would be caused by a too-small delay between the two consecutive atomic_inc() from one loop iteration to the next compared to the time it takes to perform a read-side critical section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I doubt that the sched_getcpu implementation have good odds to be fast enough to complete in that narrow window, leading to lots of read seqlock retry.
Ooooh, yeah, brain fart on my side. I was thinking of the two atomic_inc() in the same loop iteration and completely ignoring the next iteration. Yes, I 100% agree that a delay to ensure forward progress is needed. An assertion in main() that the reader complete at least some reasonable number of KVM_RUNs is also probably a good idea, e.g. to rule out a false pass due to the reader never making forward progress.
FWIW, the do-while loop does make forward progress without a delay, but at ~50% throughput, give or take.
but the primary motivation is very much to allow the read-side enough time to get back into KVM proper.
I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we have:
migration thread KVM_RUN/read-side thread
- ioctl(KVM_RUN)
atomic_inc_seq_cst(&seqcnt)
sched_setaffinity
atomic_inc_seq_cst(&seqcnt) - a = atomic_load(&seqcnt) & ~1 - smp_rmb() - b = LOAD_ONCE(__rseq_abi->cpu_id); - sched_getcpu() - smp_rmb() - re-load seqcnt/compare (succeeds) - Can only succeed if entire read-side happens while the seqcnt is in an even numbered state. - if (a != b) abort() /* no delay. Even counter state is very short. */
atomic_inc_seq_cst(&seqcnt) /* Let's suppose the lack of delay causes the setaffinity to complete too early compared with KVM_RUN ioctl */
sched_setaffinity
atomic_inc_seq_cst(&seqcnt)
/* no delay. Even counter state is very short. */
atomic_inc_seq_cst(&seqcnt) /* Then a setaffinity from a following migration thread loop will run concurrently with KVM_RUN */ - ioctl(KVM_RUN)
sched_setaffinity
atomic_inc_seq_cst(&seqcnt)
As pointed out here, if the first setaffinity runs too early compared with KVM_RUN, a following setaffinity will run concurrently with it. However, the fact that the even counter state is very short may very well hurt progress of the read seqlock.
*sigh*
Several hours later, I think I finally have my head wrapped around everything.
Due to the way the test is written and because of how KVM's run loop works, TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM actually enters the guest, otherwise KVM will exit to userspace without touching the flag, i.e. it will be handled by the normal exit_to_user_mode_loop().
Where I got lost was trying to figure out why I couldn't make the bug reproduce by causing the guest to exit to KVM, but not userspace, in which case KVM should easily trigger the problematic flow as the window for sched_getcpu() to collide with KVM would be enormous. The reason I didn't go down this route for the "official" test is that, unless there's something clever I'm overlooking, it requires arch specific guest code, and ialso I don't know that forcing an exit would even be necessary/sufficient on other architectures.
Anyways, I was trying to confirm that the bug was being hit without a delay, while still retaining the sequence retry in the test. The test doesn't fail because the back-to-back atomic_inc() changes seqcnt too fast. The read-side makes forward progress, but it never observes failure because the do-while loop only ever completes after another sched_setaffinity(), never after the one that collides with KVM because it takes too long to get out of ioctl(KVM_RUN) and back to the test. I.e. the atomic_inc() in the next loop iteration (makes seq_cnt odd) always completes before the check, and so the check ends up spinning until another migration, which correctly updates rseq. This was expected and didn't confuse me.
What confused me is that I was trying to confirm the bug was being hit from within the kernel by confirming KVM observed TIF_NOTIFY_RESUME, but I misunderstood when TIF_NOTIFY_RESUME would get set. KVM can observe TIF_NOTIFY_RESUME directly, but it's rare, and I suspect happens iff sched_setaffinity() hits the small window where it collides with KVM_RUN before KVM enters the guest.
More commonly, the bug occurs when KVM sees TIF_NEED_RESCHED. In that case, KVM calls xfer_to_guest_mode_work(), which does schedule() and _that_ sets TIF_NOTIFY_RESUME. xfer_to_guest_mode_work() then mishandles TIF_NOTIFY_RESUME and the bug is hit, but my confirmation logic in KVM never fired.
So there are effectively three reasons we want a delay:
1. To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM can enter the guest so that the guest doesn't need an arch-specific VM-Exit source.
2. To let ioctl(KVM_RUN) make its way back to the test before the next round of migration.
3. To ensure the read-side can make forward progress, e.g. if sched_getcpu() involves a syscall.
After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is the only arch that currently uses xfer_to_guest_mode_work(), i.e. the test could be tweaked to be overtly x86-specific. But since a delay is needed for #2 and #3, I'd prefer to rely on it for #1 as well in the hopes that this test provides coverage for arm64 and/or s390 if they're ever converted to use the common xfer_to_guest_mode_work().
----- On Aug 26, 2021, at 7:54 PM, Sean Christopherson seanjc@google.com wrote:
On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:
r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
errno, strerror(errno));
atomic_inc(&seq_cnt);
CPU_CLR(cpu, &allowed_mask);
/*
* Let the read-side get back into KVM_RUN to improve the odds
* of task migration coinciding with KVM's run loop.
This comment should be about increasing the odds of letting the seqlock read-side complete. Otherwise, the delay between the two back-to-back atomic_inc is so small that the seqlock read-side may never have time to complete the reading the rseq cpu id and the sched_getcpu() call, and can retry forever.
Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't possible (though that syscall would have to be screaming fast),
I don't think we have the same understanding of the livelock scenario. AFAIU the livelock would be caused by a too-small delay between the two consecutive atomic_inc() from one loop iteration to the next compared to the time it takes to perform a read-side critical section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I doubt that the sched_getcpu implementation have good odds to be fast enough to complete in that narrow window, leading to lots of read seqlock retry.
Ooooh, yeah, brain fart on my side. I was thinking of the two atomic_inc() in the same loop iteration and completely ignoring the next iteration. Yes, I 100% agree that a delay to ensure forward progress is needed. An assertion in main() that the reader complete at least some reasonable number of KVM_RUNs is also probably a good idea, e.g. to rule out a false pass due to the reader never making forward progress.
Agreed.
FWIW, the do-while loop does make forward progress without a delay, but at ~50% throughput, give or take.
I did not expect absolutely no progress, but a significant slow down of the read-side.
but the primary motivation is very much to allow the read-side enough time to get back into KVM proper.
I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we have:
migration thread KVM_RUN/read-side thread
- ioctl(KVM_RUN)
atomic_inc_seq_cst(&seqcnt)
sched_setaffinity
atomic_inc_seq_cst(&seqcnt) - a = atomic_load(&seqcnt) & ~1 - smp_rmb() - b = LOAD_ONCE(__rseq_abi->cpu_id); - sched_getcpu() - smp_rmb() - re-load seqcnt/compare (succeeds) - Can only succeed if entire read-side happens while the seqcnt is in an even numbered state. - if (a != b) abort() /* no delay. Even counter state is very short. */
atomic_inc_seq_cst(&seqcnt) /* Let's suppose the lack of delay causes the setaffinity to complete too early compared with KVM_RUN ioctl */
sched_setaffinity
atomic_inc_seq_cst(&seqcnt)
/* no delay. Even counter state is very short. */
atomic_inc_seq_cst(&seqcnt) /* Then a setaffinity from a following migration thread loop will run concurrently with KVM_RUN */ - ioctl(KVM_RUN)
sched_setaffinity
atomic_inc_seq_cst(&seqcnt)
As pointed out here, if the first setaffinity runs too early compared with KVM_RUN, a following setaffinity will run concurrently with it. However, the fact that the even counter state is very short may very well hurt progress of the read seqlock.
*sigh*
Several hours later, I think I finally have my head wrapped around everything.
Due to the way the test is written and because of how KVM's run loop works, TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM actually enters the guest, otherwise KVM will exit to userspace without touching the flag, i.e. it will be handled by the normal exit_to_user_mode_loop().
Where I got lost was trying to figure out why I couldn't make the bug reproduce by causing the guest to exit to KVM, but not userspace, in which case KVM should easily trigger the problematic flow as the window for sched_getcpu() to collide with KVM would be enormous. The reason I didn't go down this route for the "official" test is that, unless there's something clever I'm overlooking, it requires arch specific guest code, and ialso I don't know that forcing an exit would even be necessary/sufficient on other architectures.
Anyways, I was trying to confirm that the bug was being hit without a delay, while still retaining the sequence retry in the test. The test doesn't fail because the back-to-back atomic_inc() changes seqcnt too fast. The read-side makes forward progress, but it never observes failure because the do-while loop only ever completes after another sched_setaffinity(), never after the one that collides with KVM because it takes too long to get out of ioctl(KVM_RUN) and back to the test. I.e. the atomic_inc() in the next loop iteration (makes seq_cnt odd) always completes before the check, and so the check ends up spinning until another migration, which correctly updates rseq. This was expected and didn't confuse me.
What confused me is that I was trying to confirm the bug was being hit from within the kernel by confirming KVM observed TIF_NOTIFY_RESUME, but I misunderstood when TIF_NOTIFY_RESUME would get set. KVM can observe TIF_NOTIFY_RESUME directly, but it's rare, and I suspect happens iff sched_setaffinity() hits the small window where it collides with KVM_RUN before KVM enters the guest.
More commonly, the bug occurs when KVM sees TIF_NEED_RESCHED. In that case, KVM calls xfer_to_guest_mode_work(), which does schedule() and _that_ sets TIF_NOTIFY_RESUME. xfer_to_guest_mode_work() then mishandles TIF_NOTIFY_RESUME and the bug is hit, but my confirmation logic in KVM never fired.
So there are effectively three reasons we want a delay:
To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM can enter the guest so that the guest doesn't need an arch-specific VM-Exit source.
To let ioctl(KVM_RUN) make its way back to the test before the next round of migration.
To ensure the read-side can make forward progress, e.g. if sched_getcpu() involves a syscall.
After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is the only arch that currently uses xfer_to_guest_mode_work(), i.e. the test could be tweaked to be overtly x86-specific. But since a delay is needed for #2 and #3, I'd prefer to rely on it for #1 as well in the hopes that this test provides coverage for arm64 and/or s390 if they're ever converted to use the common xfer_to_guest_mode_work().
Now that we have this understanding of why we need the delay, it would be good to write this down in a comment within the test.
Does it reproduce if we randomize the delay to have it picked randomly from 0us to 100us (with 1us step) ? It would remove a lot of the needs for arch-specific magic delay value.
Thanks,
Mathieu
On Fri, Aug 27, 2021, Mathieu Desnoyers wrote:
So there are effectively three reasons we want a delay:
To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM can enter the guest so that the guest doesn't need an arch-specific VM-Exit source.
To let ioctl(KVM_RUN) make its way back to the test before the next round of migration.
To ensure the read-side can make forward progress, e.g. if sched_getcpu() involves a syscall.
After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is the only arch that currently uses xfer_to_guest_mode_work(), i.e. the test could be tweaked to be overtly x86-specific. But since a delay is needed for #2 and #3, I'd prefer to rely on it for #1 as well in the hopes that this test provides coverage for arm64 and/or s390 if they're ever converted to use the common xfer_to_guest_mode_work().
Now that we have this understanding of why we need the delay, it would be good to write this down in a comment within the test.
Ya, I'll get a new version out next week.
Does it reproduce if we randomize the delay to have it picked randomly from 0us to 100us (with 1us step) ? It would remove a lot of the needs for arch-specific magic delay value.
My less-than-scientific testing shows that it can reproduce at delays up to ~500us, but above ~10us the reproducibility starts to drop. The bug still reproduces reliably, it just takes more iterations, and obviously the test runs a bit slower.
Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?
----- On Aug 27, 2021, at 7:23 PM, Sean Christopherson seanjc@google.com wrote:
On Fri, Aug 27, 2021, Mathieu Desnoyers wrote:
[...]
Does it reproduce if we randomize the delay to have it picked randomly from 0us to 100us (with 1us step) ? It would remove a lot of the needs for arch-specific magic delay value.
My less-than-scientific testing shows that it can reproduce at delays up to ~500us, but above ~10us the reproducibility starts to drop. The bug still reproduces reliably, it just takes more iterations, and obviously the test runs a bit slower.
Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?
Works for me, thanks!
Mathieu
Revert the __NR_userfaultfd syscall fallback added for KVM selftests now that x86's unistd_{32,63}.h overrides are under uapi/ and thus not in KVM sefltests' search path, i.e. now that KVM gets x86 syscall numbers from the installed kernel headers.
No functional change intended.
Cc: Ben Gardon bgardon@google.com Signed-off-by: Sean Christopherson seanjc@google.com --- tools/arch/x86/include/uapi/asm/unistd_64.h | 3 --- 1 file changed, 3 deletions(-)
diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h index 4205ed4158bf..cb52a3a8b8fc 100644 --- a/tools/arch/x86/include/uapi/asm/unistd_64.h +++ b/tools/arch/x86/include/uapi/asm/unistd_64.h @@ -1,7 +1,4 @@ /* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __NR_userfaultfd -#define __NR_userfaultfd 282 -#endif #ifndef __NR_perf_event_open # define __NR_perf_event_open 298 #endif
On Fri, Aug 20, 2021 at 3:50 PM Sean Christopherson seanjc@google.com wrote:
Revert the __NR_userfaultfd syscall fallback added for KVM selftests now that x86's unistd_{32,63}.h overrides are under uapi/ and thus not in KVM sefltests' search path, i.e. now that KVM gets x86 syscall numbers from the installed kernel headers.
No functional change intended.
Cc: Ben Gardon bgardon@google.com Signed-off-by: Sean Christopherson seanjc@google.com
Reviewed-by: Ben Gardon bgardon@google.com
tools/arch/x86/include/uapi/asm/unistd_64.h | 3 --- 1 file changed, 3 deletions(-)
diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h index 4205ed4158bf..cb52a3a8b8fc 100644 --- a/tools/arch/x86/include/uapi/asm/unistd_64.h +++ b/tools/arch/x86/include/uapi/asm/unistd_64.h @@ -1,7 +1,4 @@ /* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __NR_userfaultfd -#define __NR_userfaultfd 282 -#endif #ifndef __NR_perf_event_open # define __NR_perf_event_open 298
#endif
2.33.0.rc2.250.ged5fa647cd-goog
linux-kselftest-mirror@lists.linaro.org