Patch 1 fixes a KVM+rseq bug where KVM's handling of TIF_NOTIFY_RESUME, e.g. for task migration, clears the flag without informing rseq and leads to stale data in userspace's rseq struct.
Patch 2 is a cleanup to try and make future bugs less likely. It's also a baby step towards moving and renaming tracehook_notify_resume() since it has nothing to do with tracing. It kills me to not do the move/rename as part of this series, but having a dedicated series/discussion seems more appropriate given the sheer number of architectures that call tracehook_notify_resume() and the lack of an obvious home for the code.
Patch 3 is a fix/cleanup to stop overriding x86's unistd_{32,64}.h when the include path (intentionally) omits tools' uapi headers. KVM's selftests do exactly that so that they can pick up the uapi headers from the installed kernel headers, and still use various tools/ headers that mirror kernel code, e.g. linux/types.h. This allows the new test in patch 4 to reference __NR_rseq without having to manually define it.
Patch 4 is a regression test for the KVM+rseq bug.
Patch 5 is a cleanup made possible by patch 3.
Sean Christopherson (5): KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() tools: Move x86 syscall number fallbacks to .../uapi/ KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs KVM: selftests: Remove __NR_userfaultfd syscall fallback
arch/arm/kernel/signal.c | 1 - arch/arm64/kernel/signal.c | 1 - arch/csky/kernel/signal.c | 4 +- arch/mips/kernel/signal.c | 4 +- arch/powerpc/kernel/signal.c | 4 +- arch/s390/kernel/signal.c | 1 - include/linux/tracehook.h | 2 + kernel/entry/common.c | 4 +- kernel/rseq.c | 4 +- .../x86/include/{ => uapi}/asm/unistd_32.h | 0 .../x86/include/{ => uapi}/asm/unistd_64.h | 3 - tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 3 + tools/testing/selftests/kvm/rseq_test.c | 131 ++++++++++++++++++ 14 files changed, 143 insertions(+), 20 deletions(-) rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%) rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (83%) create mode 100644 tools/testing/selftests/kvm/rseq_test.c
Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to transferring to a KVM guest, which is roughly equivalent to an exit to userspace and processes many of the same pending actions. While the task cannot be in an rseq critical section as the KVM path is reachable only via ioctl(KVM_RUN), the side effects that apply to rseq outside of a critical section still apply, e.g. the CPU ID needs to be updated if the task is migrated.
Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults and other badness in userspace VMMs that use rseq in combination with KVM, e.g. due to the CPU ID being stale after task migration.
Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") Reported-by: Peter Foley pefoley@google.com Bisected-by: Doug Evans dje@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com --- kernel/entry/kvm.c | 4 +++- kernel/rseq.c | 4 ++-- 2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 49972ee99aff..049fd06b4c3d 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) if (ti_work & _TIF_NEED_RESCHED) schedule();
- if (ti_work & _TIF_NOTIFY_RESUME) + if (ti_work & _TIF_NOTIFY_RESUME) { tracehook_notify_resume(NULL); + rseq_handle_notify_resume(NULL, NULL); + }
ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); if (ret) diff --git a/kernel/rseq.c b/kernel/rseq.c index 35f7bd0fced0..58c79a7918cd 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
static int rseq_ip_fixup(struct pt_regs *regs) { - unsigned long ip = instruction_pointer(regs); + unsigned long ip = regs ? instruction_pointer(regs) : 0; struct task_struct *t = current; struct rseq_cs rseq_cs; int ret; @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */ - if (!in_rseq_cs(ip, &rseq_cs)) + if (!regs || !in_rseq_cs(ip, &rseq_cs)) return clear_rseq_cs(t); ret = rseq_need_restart(t, rseq_cs.flags); if (ret <= 0)
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to transferring to a KVM guest, which is roughly equivalent to an exit to userspace and processes many of the same pending actions. While the task cannot be in an rseq critical section as the KVM path is reachable only via ioctl(KVM_RUN), the side effects that apply to rseq outside of a critical section still apply, e.g. the CPU ID needs to be updated if the task is migrated.
Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults and other badness in userspace VMMs that use rseq in combination with KVM, e.g. due to the CPU ID being stale after task migration.
I agree with the problem assessment, but I would recommend a small change to this fix.
Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") Reported-by: Peter Foley pefoley@google.com Bisected-by: Doug Evans dje@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com
kernel/entry/kvm.c | 4 +++- kernel/rseq.c | 4 ++-- 2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 49972ee99aff..049fd06b4c3d 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) if (ti_work & _TIF_NEED_RESCHED) schedule();
if (ti_work & _TIF_NOTIFY_RESUME)
if (ti_work & _TIF_NOTIFY_RESUME) { tracehook_notify_resume(NULL);
rseq_handle_notify_resume(NULL, NULL);
}
ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); if (ret)
diff --git a/kernel/rseq.c b/kernel/rseq.c index 35f7bd0fced0..58c79a7918cd 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
static int rseq_ip_fixup(struct pt_regs *regs) {
- unsigned long ip = instruction_pointer(regs);
- unsigned long ip = regs ? instruction_pointer(regs) : 0; struct task_struct *t = current; struct rseq_cs rseq_cs; int ret;
@@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */
- if (!in_rseq_cs(ip, &rseq_cs))
- if (!regs || !in_rseq_cs(ip, &rseq_cs))
I think clearing the thread's rseq_cs unconditionally here when regs is NULL is not the behavior we want when this is called from xfer_to_guest_mode_work.
If we have a scenario where userspace ends up calling this ioctl(KVM_RUN) from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to kill this application in the rseq_syscall handler when exiting back to usermode when the ioctl eventually returns.
However, clearing the thread's rseq_cs will prevent this from happening.
So I would favor an approach where we simply do:
if (!regs) return 0;
Immediately at the beginning of rseq_ip_fixup, before getting the instruction pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it is not relevant to do any fixup here, because it is nested in a ioctl system call.
Effectively, this would preserve the SIGSEGV behavior when this ioctl is erroneously called by user-space from a rseq critical section.
Thanks for looking into this !
Mathieu
return clear_rseq_cs(t);
ret = rseq_need_restart(t, rseq_cs.flags); if (ret <= 0) -- 2.33.0.rc1.237.g0d66db33f3-goog
On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
@@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */
- if (!in_rseq_cs(ip, &rseq_cs))
- if (!regs || !in_rseq_cs(ip, &rseq_cs))
I think clearing the thread's rseq_cs unconditionally here when regs is NULL is not the behavior we want when this is called from xfer_to_guest_mode_work.
If we have a scenario where userspace ends up calling this ioctl(KVM_RUN) from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to kill this application in the rseq_syscall handler when exiting back to usermode when the ioctl eventually returns.
However, clearing the thread's rseq_cs will prevent this from happening.
So I would favor an approach where we simply do:
if (!regs) return 0;
Immediately at the beginning of rseq_ip_fixup, before getting the instruction pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it is not relevant to do any fixup here, because it is nested in a ioctl system call.
Effectively, this would preserve the SIGSEGV behavior when this ioctl is erroneously called by user-space from a rseq critical section.
Ha, that's effectively what I implemented first, but I changed it because of the comment in clear_rseq_cs() that says:
The rseq_cs field is set to NULL on preemption or signal delivery ... as well as well as on top of code outside of the rseq assembly block.
Which makes it sound like something might rely on clearing rseq_cs?
Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an rseq critical section, and because syscalls in critical sections are illegal, by definition clearing rseq_cs is a nop unless userspace is misbehaving.
If that's true, what about explicitly checking that at NOTIFY_RESUME? Or is it not worth the extra code to detect an error that will likely be caught anyways?
diff --git a/kernel/rseq.c b/kernel/rseq.c index 35f7bd0fced0..28b8342290b0 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING)) return; + if (!regs) { +#ifdef CONFIG_DEBUG_RSEQ + if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs)) + goto error; +#endif + return; + } ret = rseq_ip_fixup(regs); if (unlikely(ret < 0)) goto error;
Thanks for looking into this !
Mathieu
return clear_rseq_cs(t);
ret = rseq_need_restart(t, rseq_cs.flags); if (ret <= 0) -- 2.33.0.rc1.237.g0d66db33f3-goog
-- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
----- On Aug 19, 2021, at 7:48 PM, Sean Christopherson seanjc@google.com wrote:
On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
@@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */
- if (!in_rseq_cs(ip, &rseq_cs))
- if (!regs || !in_rseq_cs(ip, &rseq_cs))
I think clearing the thread's rseq_cs unconditionally here when regs is NULL is not the behavior we want when this is called from xfer_to_guest_mode_work.
If we have a scenario where userspace ends up calling this ioctl(KVM_RUN) from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to kill this application in the rseq_syscall handler when exiting back to usermode when the ioctl eventually returns.
However, clearing the thread's rseq_cs will prevent this from happening.
So I would favor an approach where we simply do:
if (!regs) return 0;
Immediately at the beginning of rseq_ip_fixup, before getting the instruction pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it is not relevant to do any fixup here, because it is nested in a ioctl system call.
Effectively, this would preserve the SIGSEGV behavior when this ioctl is erroneously called by user-space from a rseq critical section.
Ha, that's effectively what I implemented first, but I changed it because of the comment in clear_rseq_cs() that says:
The rseq_cs field is set to NULL on preemption or signal delivery ... as well as well as on top of code outside of the rseq assembly block.
Which makes it sound like something might rely on clearing rseq_cs?
This comment is describing succinctly the lazy clear scheme for rseq_cs.
Without the lazy clear scheme, a rseq c.s. would look like:
* init(rseq_cs) * cpu = TLS->rseq::cpu_id_start * [1] TLS->rseq::rseq_cs = rseq_cs * [start_ip] ---------------------------- * [2] if (cpu != TLS->rseq::cpu_id) * goto abort_ip; * [3] <last_instruction_in_cs> * [post_commit_ip] ---------------------------- * [4] TLS->rseq::rseq_cs = NULL
But as a fast-path optimization, [4] is not entirely needed because the rseq_cs descriptor contains information about the instruction pointer range of the critical section. Therefore, userspace can omit [4], but if the kernel never clears it, it means that it will have to re-read the rseq_cs descriptor's content each time it needs to check it to confirm that it is not nested over a rseq c.s..
So making the kernel lazily clear the rseq_cs pointer is just an optimization which ensures that the kernel won't do useless work the next time it needs to check rseq_cs, given that it has already validated that the userspace code is currently not within the rseq c.s. currently advertised by the rseq_cs field.
Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an rseq critical section, and because syscalls in critical sections are illegal, by definition clearing rseq_cs is a nop unless userspace is misbehaving.
Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ code executed when returning from ioctl to userspace will be able to validate that it is not nested within a rseq critical section.
If that's true, what about explicitly checking that at NOTIFY_RESUME? Or is it not worth the extra code to detect an error that will likely be caught anyways?
The error will indeed already be caught on return from ioctl to userspace, so I don't see any added value in duplicating this check.
Thanks,
Mathieu
diff --git a/kernel/rseq.c b/kernel/rseq.c index 35f7bd0fced0..28b8342290b0 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING)) return;
if (!regs) {
+#ifdef CONFIG_DEBUG_RSEQ
if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs))
goto error;
+#endif
return;
} ret = rseq_ip_fixup(regs); if (unlikely(ret < 0)) goto error;
Thanks for looking into this !
Mathieu
return clear_rseq_cs(t);
ret = rseq_need_restart(t, rseq_cs.flags); if (ret <= 0) -- 2.33.0.rc1.237.g0d66db33f3-goog
-- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Fri, Aug 20, 2021, Mathieu Desnoyers wrote:
Without the lazy clear scheme, a rseq c.s. would look like:
init(rseq_cs)
cpu = TLS->rseq::cpu_id_start
- [1] TLS->rseq::rseq_cs = rseq_cs
- [start_ip] ----------------------------
- [2] if (cpu != TLS->rseq::cpu_id)
goto abort_ip;
- [3] <last_instruction_in_cs>
- [post_commit_ip] ----------------------------
- [4] TLS->rseq::rseq_cs = NULL
But as a fast-path optimization, [4] is not entirely needed because the rseq_cs descriptor contains information about the instruction pointer range of the critical section. Therefore, userspace can omit [4], but if the kernel never clears it, it means that it will have to re-read the rseq_cs descriptor's content each time it needs to check it to confirm that it is not nested over a rseq c.s..
So making the kernel lazily clear the rseq_cs pointer is just an optimization which ensures that the kernel won't do useless work the next time it needs to check rseq_cs, given that it has already validated that the userspace code is currently not within the rseq c.s. currently advertised by the rseq_cs field.
Thanks for the explanation, much appreciated!
On 20/08/21 20:51, Mathieu Desnoyers wrote:
Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an rseq critical section, and because syscalls in critical sections are illegal, by definition clearing rseq_cs is a nop unless userspace is misbehaving.
Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ code executed when returning from ioctl to userspace will be able to validate that it is not nested within a rseq critical section.
If that's true, what about explicitly checking that at NOTIFY_RESUME? Or is it not worth the extra code to detect an error that will likely be caught anyways?
The error will indeed already be caught on return from ioctl to userspace, so I don't see any added value in duplicating this check.
Sean, can you send a v2 (even for this patch only would be okay)?
Thanks,
Paolo
On Mon, Sep 06, 2021, Paolo Bonzini wrote:
On 20/08/21 20:51, Mathieu Desnoyers wrote:
Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an rseq critical section, and because syscalls in critical sections are illegal, by definition clearing rseq_cs is a nop unless userspace is misbehaving.
Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ code executed when returning from ioctl to userspace will be able to validate that it is not nested within a rseq critical section.
If that's true, what about explicitly checking that at NOTIFY_RESUME? Or is it not worth the extra code to detect an error that will likely be caught anyways?
The error will indeed already be caught on return from ioctl to userspace, so I don't see any added value in duplicating this check.
Sean, can you send a v2 (even for this patch only would be okay)?
Made it all the way to v3 while you were out :-)
https://lkml.kernel.org/r/20210901203030.1292304-1-seanjc@google.com
Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now that the two function are always called back-to-back by architectures that have rseq. The rseq helper is stubbed out for architectures that don't support rseq, i.e. this is a nop across the board.
Note, tracehook_notify_resume() is horribly named and arguably does not belong in tracehook.h as literally every line of code in it has nothing to do with tracing. But, that's been true since commit a42c6ded827d ("move key_repace_session_keyring() into tracehook_notify_resume()") first usurped tracehook_notify_resume() back in 2012. Punt cleaning that mess up to future patches.
No functional change intended.
Signed-off-by: Sean Christopherson seanjc@google.com --- arch/arm/kernel/signal.c | 1 - arch/arm64/kernel/signal.c | 1 - arch/csky/kernel/signal.c | 4 +--- arch/mips/kernel/signal.c | 4 +--- arch/powerpc/kernel/signal.c | 4 +--- arch/s390/kernel/signal.c | 1 - include/linux/tracehook.h | 2 ++ kernel/entry/common.c | 4 +--- kernel/entry/kvm.c | 4 +--- 9 files changed, 7 insertions(+), 18 deletions(-)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index a3a38d0a4c85..9df68d139965 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -670,7 +670,6 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall) uprobe_notify_resume(regs); } else { tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); } } local_irq_disable(); diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 23036334f4dc..22b55db13da6 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -951,7 +951,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
if (thread_flags & _TIF_NOTIFY_RESUME) { tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs);
/* * If we reschedule after checking the affinity diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c index 312f046d452d..bc4238b9f709 100644 --- a/arch/csky/kernel/signal.c +++ b/arch/csky/kernel/signal.c @@ -260,8 +260,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) do_signal(regs);
- if (thread_info_flags & _TIF_NOTIFY_RESUME) { + if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); - } } diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c index f1e985109da0..c9b2a75563e1 100644 --- a/arch/mips/kernel/signal.c +++ b/arch/mips/kernel/signal.c @@ -906,10 +906,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, void *unused, if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) do_signal(regs);
- if (thread_info_flags & _TIF_NOTIFY_RESUME) { + if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); - }
user_enter(); } diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index e600764a926c..b93b87df499d 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -293,10 +293,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) do_signal(current); }
- if (thread_info_flags & _TIF_NOTIFY_RESUME) { + if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); - } }
static unsigned long get_tm_stackpointer(struct task_struct *tsk) diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c index 78ef53b29958..b307db26bf2d 100644 --- a/arch/s390/kernel/signal.c +++ b/arch/s390/kernel/signal.c @@ -537,5 +537,4 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) void do_notify_resume(struct pt_regs *regs) { tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); } diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h index 3e80c4bc66f7..2564b7434b4d 100644 --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -197,6 +197,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
mem_cgroup_handle_over_high(); blkcg_maybe_throttle_current(); + + rseq_handle_notify_resume(NULL, regs); }
/* diff --git a/kernel/entry/common.c b/kernel/entry/common.c index bf16395b9e13..d5a61d565ad5 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -171,10 +171,8 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) handle_signal_work(regs, ti_work);
- if (ti_work & _TIF_NOTIFY_RESUME) { + if (ti_work & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - rseq_handle_notify_resume(NULL, regs); - }
/* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work); diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 049fd06b4c3d..49972ee99aff 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -19,10 +19,8 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) if (ti_work & _TIF_NEED_RESCHED) schedule();
- if (ti_work & _TIF_NOTIFY_RESUME) { + if (ti_work & _TIF_NOTIFY_RESUME) tracehook_notify_resume(NULL); - rseq_handle_notify_resume(NULL, NULL); - }
ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); if (ret)
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now that the two function are always called back-to-back by architectures that have rseq. The rseq helper is stubbed out for architectures that don't support rseq, i.e. this is a nop across the board.
Note, tracehook_notify_resume() is horribly named and arguably does not belong in tracehook.h as literally every line of code in it has nothing to do with tracing. But, that's been true since commit a42c6ded827d ("move key_repace_session_keyring() into tracehook_notify_resume()") first usurped tracehook_notify_resume() back in 2012. Punt cleaning that mess up to future patches.
No functional change intended.
This will make it harder to introduce new code paths which consume the NOTIFY_RESUME without calling the rseq callback, which introduces issues. Agreed.
Acked-by: Mathieu Desnoyers mathieu.desnoyers@efficios.com
Signed-off-by: Sean Christopherson seanjc@google.com
arch/arm/kernel/signal.c | 1 - arch/arm64/kernel/signal.c | 1 - arch/csky/kernel/signal.c | 4 +--- arch/mips/kernel/signal.c | 4 +--- arch/powerpc/kernel/signal.c | 4 +--- arch/s390/kernel/signal.c | 1 - include/linux/tracehook.h | 2 ++ kernel/entry/common.c | 4 +--- kernel/entry/kvm.c | 4 +--- 9 files changed, 7 insertions(+), 18 deletions(-)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index a3a38d0a4c85..9df68d139965 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -670,7 +670,6 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall) uprobe_notify_resume(regs); } else { tracehook_notify_resume(regs);
} local_irq_disable();rseq_handle_notify_resume(NULL, regs); }
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 23036334f4dc..22b55db13da6 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -951,7 +951,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
if (thread_flags & _TIF_NOTIFY_RESUME) { tracehook_notify_resume(regs);
rseq_handle_notify_resume(NULL, regs); /* * If we reschedule after checking the affinity
diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c index 312f046d452d..bc4238b9f709 100644 --- a/arch/csky/kernel/signal.c +++ b/arch/csky/kernel/signal.c @@ -260,8 +260,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) do_signal(regs);
- if (thread_info_flags & _TIF_NOTIFY_RESUME) {
- if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs);
rseq_handle_notify_resume(NULL, regs);
- }
} diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c index f1e985109da0..c9b2a75563e1 100644 --- a/arch/mips/kernel/signal.c +++ b/arch/mips/kernel/signal.c @@ -906,10 +906,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, void *unused, if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) do_signal(regs);
- if (thread_info_flags & _TIF_NOTIFY_RESUME) {
- if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs);
rseq_handle_notify_resume(NULL, regs);
}
user_enter();
} diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index e600764a926c..b93b87df499d 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -293,10 +293,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) do_signal(current); }
- if (thread_info_flags & _TIF_NOTIFY_RESUME) {
- if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs);
rseq_handle_notify_resume(NULL, regs);
- }
}
static unsigned long get_tm_stackpointer(struct task_struct *tsk) diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c index 78ef53b29958..b307db26bf2d 100644 --- a/arch/s390/kernel/signal.c +++ b/arch/s390/kernel/signal.c @@ -537,5 +537,4 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) void do_notify_resume(struct pt_regs *regs) { tracehook_notify_resume(regs);
- rseq_handle_notify_resume(NULL, regs);
} diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h index 3e80c4bc66f7..2564b7434b4d 100644 --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -197,6 +197,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
mem_cgroup_handle_over_high(); blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(NULL, regs);
}
/* diff --git a/kernel/entry/common.c b/kernel/entry/common.c index bf16395b9e13..d5a61d565ad5 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -171,10 +171,8 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) handle_signal_work(regs, ti_work);
if (ti_work & _TIF_NOTIFY_RESUME) {
if (ti_work & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs);
rseq_handle_notify_resume(NULL, regs);
}
/* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 049fd06b4c3d..49972ee99aff 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -19,10 +19,8 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) if (ti_work & _TIF_NEED_RESCHED) schedule();
if (ti_work & _TIF_NOTIFY_RESUME) {
if (ti_work & _TIF_NOTIFY_RESUME) tracehook_notify_resume(NULL);
rseq_handle_notify_resume(NULL, NULL);
}
ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); if (ret)
-- 2.33.0.rc1.237.g0d66db33f3-goog
Move unistd_{32,64}.h from x86/include/asm to x86/include/uapi/asm so that tools/selftests that install kernel headers, e.g. KVM selftests, can include non-uapi tools headers, e.g. to get 'struct list_head', without effectively overriding the installed non-tool uapi headers.
Swapping KVM's search order, e.g. to search the kernel headers before tool headers, is not a viable option as doing results in linux/type.h and other core headers getting pulled from the kernel headers, which do not have the kernel-internal typedefs that are used through tools, including many files outside of selftests/kvm's control.
Prior to commit cec07f53c398 ("perf tools: Move syscall number fallbacks from perf-sys.h to tools/arch/x86/include/asm/"), the handcoded numbers were actual fallbacks, i.e. overriding unistd_{32,64}.h from the kernel headers was unintentional.
Signed-off-by: Sean Christopherson seanjc@google.com --- tools/arch/x86/include/{ => uapi}/asm/unistd_32.h | 0 tools/arch/x86/include/{ => uapi}/asm/unistd_64.h | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%) rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (100%)
diff --git a/tools/arch/x86/include/asm/unistd_32.h b/tools/arch/x86/include/uapi/asm/unistd_32.h similarity index 100% rename from tools/arch/x86/include/asm/unistd_32.h rename to tools/arch/x86/include/uapi/asm/unistd_32.h diff --git a/tools/arch/x86/include/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h similarity index 100% rename from tools/arch/x86/include/asm/unistd_64.h rename to tools/arch/x86/include/uapi/asm/unistd_64.h
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
Signed-off-by: Sean Christopherson seanjc@google.com --- tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 3 + tools/testing/selftests/kvm/rseq_test.c | 131 ++++++++++++++++++++++++ 3 files changed, 135 insertions(+) create mode 100644 tools/testing/selftests/kvm/rseq_test.c
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index 0709af0144c8..6d031ff6b68e 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -47,6 +47,7 @@ /kvm_page_table_test /memslot_modification_stress_test /memslot_perf_test +/rseq_test /set_memory_region_test /steal_time /kvm_binary_stats_test diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 5832f510a16c..0756e79cb513 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -80,6 +80,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus TEST_GEN_PROGS_x86_64 += kvm_page_table_test TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test TEST_GEN_PROGS_x86_64 += memslot_perf_test +TEST_GEN_PROGS_x86_64 += rseq_test TEST_GEN_PROGS_x86_64 += set_memory_region_test TEST_GEN_PROGS_x86_64 += steal_time TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test @@ -92,6 +93,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test TEST_GEN_PROGS_aarch64 += dirty_log_perf_test TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus TEST_GEN_PROGS_aarch64 += kvm_page_table_test +TEST_GEN_PROGS_aarch64 += rseq_test TEST_GEN_PROGS_aarch64 += set_memory_region_test TEST_GEN_PROGS_aarch64 += steal_time TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test @@ -103,6 +105,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test TEST_GEN_PROGS_s390x += dirty_log_test TEST_GEN_PROGS_s390x += kvm_create_max_vcpus TEST_GEN_PROGS_s390x += kvm_page_table_test +TEST_GEN_PROGS_s390x += rseq_test TEST_GEN_PROGS_s390x += set_memory_region_test TEST_GEN_PROGS_s390x += kvm_binary_stats_test
diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c new file mode 100644 index 000000000000..90ed535eded7 --- /dev/null +++ b/tools/testing/selftests/kvm/rseq_test.c @@ -0,0 +1,131 @@ +// SPDX-License-Identifier: GPL-2.0-only +#define _GNU_SOURCE /* for program_invocation_short_name */ +#include <errno.h> +#include <fcntl.h> +#include <pthread.h> +#include <sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <signal.h> +#include <syscall.h> +#include <sys/ioctl.h> +#include <linux/rseq.h> +#include <linux/unistd.h> + +#include "kvm_util.h" +#include "processor.h" +#include "test_util.h" + +#define VCPU_ID 0 + +static __thread volatile struct rseq __rseq = { + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, +}; + +#define RSEQ_SIG 0xdeadbeef + +static pthread_t migration_thread; +static cpu_set_t possible_mask; +static bool done; + +static void guest_code(void) +{ + for (;;) + GUEST_SYNC(0); +} + +static void sys_rseq(int flags) +{ + int r; + + r = syscall(__NR_rseq, &__rseq, sizeof(__rseq), flags, RSEQ_SIG); + TEST_ASSERT(!r, "rseq failed, errno = %d (%s)", errno, strerror(errno)); +} + +static void *migration_worker(void *ign) +{ + cpu_set_t allowed_mask; + int r, i, nr_cpus, cpu; + + CPU_ZERO(&allowed_mask); + + nr_cpus = CPU_COUNT(&possible_mask); + + for (i = 0; i < 20000; i++) { + cpu = i % nr_cpus; + if (!CPU_ISSET(cpu, &possible_mask)) + continue; + + CPU_SET(cpu, &allowed_mask); + + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask); + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", errno, + strerror(errno)); + + CPU_CLR(cpu, &allowed_mask); + + usleep(10); + } + done = true; + return NULL; +} + +int main(int argc, char *argv[]) +{ + struct kvm_vm *vm; + u32 cpu, rseq_cpu; + int r; + + /* Tell stdout not to buffer its content */ + setbuf(stdout, NULL); + + r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask); + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno, + strerror(errno)); + + if (CPU_COUNT(&possible_mask) < 2) { + print_skip("Only one CPU, task migration not possible\n"); + exit(KSFT_SKIP); + } + + sys_rseq(0); + + /* + * Create and run a dummy VM that immediately exits to userspace via + * GUEST_SYNC, while concurrently migrating the process by setting its + * CPU affinity. + */ + vm = vm_create_default(VCPU_ID, 0, guest_code); + + pthread_create(&migration_thread, NULL, migration_worker, 0); + + while (!done) { + vcpu_run(vm, VCPU_ID); + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, + "Guest failed?"); + + cpu = sched_getcpu(); + rseq_cpu = READ_ONCE(__rseq.cpu_id); + + /* + * Verify rseq's CPU matches sched's CPU, and that sched's CPU + * is stable. This doesn't handle the case where the task is + * migrated between sched_getcpu() and reading rseq, and again + * between reading rseq and sched_getcpu(), but in practice no + * false positives have been observed, while on the other hand + * blocking migration while this thread reads CPUs messes with + * the timing and prevents hitting failures on a buggy kernel. + */ + TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(), + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu); + } + + pthread_join(migration_thread, NULL); + + kvm_vm_free(vm); + + sys_rseq(RSEQ_FLAG_UNREGISTER); + + return 0; +}
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
Signed-off-by: Sean Christopherson seanjc@google.com
[...]
+static void *migration_worker(void *ign) +{
- cpu_set_t allowed_mask;
- int r, i, nr_cpus, cpu;
- CPU_ZERO(&allowed_mask);
- nr_cpus = CPU_COUNT(&possible_mask);
- for (i = 0; i < 20000; i++) {
cpu = i % nr_cpus;
if (!CPU_ISSET(cpu, &possible_mask))
continue;
CPU_SET(cpu, &allowed_mask);
r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", errno,
strerror(errno));
CPU_CLR(cpu, &allowed_mask);
usleep(10);
- }
- done = true;
- return NULL;
+}
+int main(int argc, char *argv[]) +{
- struct kvm_vm *vm;
- u32 cpu, rseq_cpu;
- int r;
- /* Tell stdout not to buffer its content */
- setbuf(stdout, NULL);
- r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
- TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
strerror(errno));
- if (CPU_COUNT(&possible_mask) < 2) {
print_skip("Only one CPU, task migration not possible\n");
exit(KSFT_SKIP);
- }
- sys_rseq(0);
- /*
* Create and run a dummy VM that immediately exits to userspace via
* GUEST_SYNC, while concurrently migrating the process by setting its
* CPU affinity.
*/
- vm = vm_create_default(VCPU_ID, 0, guest_code);
- pthread_create(&migration_thread, NULL, migration_worker, 0);
- while (!done) {
vcpu_run(vm, VCPU_ID);
TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
"Guest failed?");
cpu = sched_getcpu();
rseq_cpu = READ_ONCE(__rseq.cpu_id);
/*
* Verify rseq's CPU matches sched's CPU, and that sched's CPU
* is stable. This doesn't handle the case where the task is
* migrated between sched_getcpu() and reading rseq, and again
* between reading rseq and sched_getcpu(), but in practice no
* false positives have been observed, while on the other hand
* blocking migration while this thread reads CPUs messes with
* the timing and prevents hitting failures on a buggy kernel.
*/
I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id if you add a pthread mutex to protect:
sched_getcpu and __rseq_abi.cpu_id reads
vs
sched_setaffinity calls within the migration thread.
Thoughts ?
Thanks,
Mathieu
TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(),
"rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
- }
- pthread_join(migration_thread, NULL);
- kvm_vm_free(vm);
- sys_rseq(RSEQ_FLAG_UNREGISTER);
- return 0;
+}
2.33.0.rc1.237.g0d66db33f3-goog
On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
Signed-off-by: Sean Christopherson seanjc@google.com
[...]
- while (!done) {
vcpu_run(vm, VCPU_ID);
TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
"Guest failed?");
cpu = sched_getcpu();
rseq_cpu = READ_ONCE(__rseq.cpu_id);
/*
* Verify rseq's CPU matches sched's CPU, and that sched's CPU
* is stable. This doesn't handle the case where the task is
* migrated between sched_getcpu() and reading rseq, and again
* between reading rseq and sched_getcpu(), but in practice no
* false positives have been observed, while on the other hand
* blocking migration while this thread reads CPUs messes with
* the timing and prevents hitting failures on a buggy kernel.
*/
I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id if you add a pthread mutex to protect:
sched_getcpu and __rseq_abi.cpu_id reads
vs
sched_setaffinity calls within the migration thread.
Thoughts ?
I tried that and couldn't reproduce the bug. That's what I attempted to call out in the blurb "blocking migration while this thread reads CPUs ... prevents hitting failures on a buggy kernel".
I considered adding arbitrary delays around the mutex to try and hit the bug, but I was worried that even if I got it "working" for this bug, the test would be too tailored to this bug and potentially miss future regression. Letting the two threads run wild seemed like it would provide the best coverage, at the cost of potentially causing to false failures.
----- On Aug 19, 2021, at 7:33 PM, Sean Christopherson seanjc@google.com wrote:
On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness.
Signed-off-by: Sean Christopherson seanjc@google.com
[...]
- while (!done) {
vcpu_run(vm, VCPU_ID);
TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
"Guest failed?");
cpu = sched_getcpu();
rseq_cpu = READ_ONCE(__rseq.cpu_id);
/*
* Verify rseq's CPU matches sched's CPU, and that sched's CPU
* is stable. This doesn't handle the case where the task is
* migrated between sched_getcpu() and reading rseq, and again
* between reading rseq and sched_getcpu(), but in practice no
* false positives have been observed, while on the other hand
* blocking migration while this thread reads CPUs messes with
* the timing and prevents hitting failures on a buggy kernel.
*/
I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id if you add a pthread mutex to protect:
sched_getcpu and __rseq_abi.cpu_id reads
vs
sched_setaffinity calls within the migration thread.
Thoughts ?
I tried that and couldn't reproduce the bug. That's what I attempted to call out in the blurb "blocking migration while this thread reads CPUs ... prevents hitting failures on a buggy kernel".
I considered adding arbitrary delays around the mutex to try and hit the bug, but I was worried that even if I got it "working" for this bug, the test would be too tailored to this bug and potentially miss future regression. Letting the two threads run wild seemed like it would provide the best coverage, at the cost of potentially causing to false failures.
OK, so your point is that using mutual exclusion to ensure stability of the cpu id changes the timings too much, to a point where the issues don't reproduce. I understand that this mutex ties the migration thread timings to the vcpu thread's use of the mutex, which will reduce timings randomness, which is unwanted here.
I still really hate flakiness in tests, because then people stop caring when they fail once in a while. And with the nature of rseq, a once-in-a-while failure is a big deal. Let's see if we can use other tricks to ensure stability of the cpu id without changing timings too much.
One idea would be to use a seqcount lock. But even if we use that, I'm concerned that the very long writer critical section calling sched_setaffinity would need to be alternated with a sleep to ensure the read-side progresses. The sleep delay could be relatively small compared to the duration of the sched_setaffinity call, e.g. ratio 1:10.
static volatile uint64_t seqcnt;
The thread responsible for setting the affinity would do something like:
for (;;) { atomic_inc_seq_cst(&seqcnt); sched_setaffinity(..., n++ % nr_cpus); atomic_inc_seq_cst(&seqcnt); usleep(1); /* this is where read-side is allowed to progress. */ }
And the thread reading the rseq cpu id and calling sched_getcpu():
uint64_t snapshot;
do { snapshot = atomic_load(&seqcnt) & ~1; /* force retry if odd */ smp_rmb(); cpu = sched_getcpu(); rseq_cpu = READ_ONCE(__rseq.cpu_id); smp_rmb(); } while (snapshot != atomic_load(&seqcnt));
So the reader retry the cpu id reads whenever sched_setaffinity is being called by the migration thread, and whenever it is preempted for more than one migration thread loop.
That should achieve our goal of providing cpu id stability without significantly changing the timings of the migration thread, given that it never blocks waiting for the reader.
Thoughts ?
Thanks,
Mathieu
On Fri, Aug 20, 2021, Mathieu Desnoyers wrote:
I still really hate flakiness in tests, because then people stop caring when they fail once in a while. And with the nature of rseq, a once-in-a-while failure is a big deal. Let's see if we can use other tricks to ensure stability of the cpu id without changing timings too much.
Yeah, zero agrument regarding flaky tests.
One idea would be to use a seqcount lock.
A sequence counter did the trick! Thanks much!
But even if we use that, I'm concerned that the very long writer critical section calling sched_setaffinity would need to be alternated with a sleep to ensure the read-side progresses. The sleep delay could be relatively small compared to the duration of the sched_setaffinity call, e.g. ratio 1:10.
I already had an arbitrary usleep(10) to let the reader make progress between sched_setaffinity() calls. Dropping it down to 1us didn't affect reproducibility, so I went with that to shave those precious cycles :-) Eliminating the delay entirely did result in no repro, which was a nice confirmation that it's needed to let the reader get back into KVM_RUN.
Thanks again!
Revert the __NR_userfaultfd syscall fallback added for KVM selftests now that x86's unistd_{32,63}.h overrides are under uapi/ and thus not in KVM sefltests' search path, i.e. now that KVM gets x86 syscall numbers from the installed kernel headers.
No functional change intended.
Cc: Ben Gardon bgardon@google.com Signed-off-by: Sean Christopherson seanjc@google.com --- tools/arch/x86/include/uapi/asm/unistd_64.h | 3 --- 1 file changed, 3 deletions(-)
diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h index 4205ed4158bf..cb52a3a8b8fc 100644 --- a/tools/arch/x86/include/uapi/asm/unistd_64.h +++ b/tools/arch/x86/include/uapi/asm/unistd_64.h @@ -1,7 +1,4 @@ /* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __NR_userfaultfd -#define __NR_userfaultfd 282 -#endif #ifndef __NR_perf_event_open # define __NR_perf_event_open 298 #endif
On 18/08/21 02:12, Sean Christopherson wrote:
Patch 1 fixes a KVM+rseq bug where KVM's handling of TIF_NOTIFY_RESUME, e.g. for task migration, clears the flag without informing rseq and leads to stale data in userspace's rseq struct.
Patch 2 is a cleanup to try and make future bugs less likely. It's also a baby step towards moving and renaming tracehook_notify_resume() since it has nothing to do with tracing. It kills me to not do the move/rename as part of this series, but having a dedicated series/discussion seems more appropriate given the sheer number of architectures that call tracehook_notify_resume() and the lack of an obvious home for the code.
Patch 3 is a fix/cleanup to stop overriding x86's unistd_{32,64}.h when the include path (intentionally) omits tools' uapi headers. KVM's selftests do exactly that so that they can pick up the uapi headers from the installed kernel headers, and still use various tools/ headers that mirror kernel code, e.g. linux/types.h. This allows the new test in patch 4 to reference __NR_rseq without having to manually define it.
Patch 4 is a regression test for the KVM+rseq bug.
Patch 5 is a cleanup made possible by patch 3.
Sean Christopherson (5): KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() tools: Move x86 syscall number fallbacks to .../uapi/ KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs KVM: selftests: Remove __NR_userfaultfd syscall fallback
arch/arm/kernel/signal.c | 1 - arch/arm64/kernel/signal.c | 1 - arch/csky/kernel/signal.c | 4 +- arch/mips/kernel/signal.c | 4 +- arch/powerpc/kernel/signal.c | 4 +- arch/s390/kernel/signal.c | 1 - include/linux/tracehook.h | 2 + kernel/entry/common.c | 4 +- kernel/rseq.c | 4 +- .../x86/include/{ => uapi}/asm/unistd_32.h | 0 .../x86/include/{ => uapi}/asm/unistd_64.h | 3 - tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 3 + tools/testing/selftests/kvm/rseq_test.c | 131 ++++++++++++++++++ 14 files changed, 143 insertions(+), 20 deletions(-) rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%) rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (83%) create mode 100644 tools/testing/selftests/kvm/rseq_test.c
Queued v3, thanks. I'll send it in a separate pull request to Linus since it touches stuff outside my usual turf.
Thanks,
Paolo
linux-kselftest-mirror@lists.linaro.org