On Thu, Jul 18, 2024, H. Peter Anvin wrote:
On July 12, 2024 8:12:51 AM PDT, Sean Christopherson seanjc@google.com
wrote:
On Wed, Jul 10, 2024, Xin3 Li wrote:
On Wed, Feb 07, 2024, Xin Li wrote:
Switch MSR_IA32_FRED_RSP0 between host and guest in vmx_prepare_switch_to_{host,guest}().
MSR_IA32_FRED_RSP0 is used during ring 3 event delivery only, thus KVM, running on ring 0, can run safely with guest FRED RSP0, i.e., no need to switch between host/guest FRED RSP0 during VM
entry and exit.
KVM should switch to host FRED RSP0 before returning to user level, and switch to guest FRED RSP0 before entering guest mode.
Heh, if only KVM had a framework that was specifically designed for context switching MSRs on return to userspace. Translation: please use the user_return_msr() APIs.
IIUC the user return MSR framework works for MSRs that are per CPU constants, but like MSR_KERNEL_GS_BASE, MSR_IA32_FRED_RSP0 is a per *task* constant, thus we can't use it.
Ah, in that case, the changelog is very misleading and needs to be fixed. Alternatively, is the desired RSP0 value tracked anywhere other than the MSR? E.g. if it's somewhere in task_struct, then kvm_on_user_return() would restore the current task's desired RSP0. Even if we don't get fancy, avoiding the RDMSR to get the current task's value would be nice.
Hm, perhaps the right thing to do is to always invoke this function before a context switch happens if that happens before return to user space?
Actually, if the _TIF_NEED_RSP0_LOAD doesn't provide a meaningful benefit (or y'all just don't want it :-) ),
We want it 😊.
My concern was adding an extra check of (ti_work & _TIF_NEED_RSP0_LOAD) into a hot function arch_exit_to_user_mode_prepare(). HPA checked the function and suggested to test ti_work for zero and then process individual bits in it:
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index fb2809b20b0a..4c78b99060b5 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, unsigned long ti_work) { - if (ti_work & _TIF_USER_RETURN_NOTIFY) - fire_user_return_notifiers(); + if (ti_work) { + if (ti_work & _TIF_USER_RETURN_NOTIFY) + fire_user_return_notifiers();
- if (unlikely(ti_work & _TIF_IO_BITMAP)) - tss_update_io_bitmap(); + if (unlikely(ti_work & _TIF_IO_BITMAP)) + tss_update_io_bitmap();
- fpregs_assert_state_consistent(); - if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) - switch_fpu_return(); + fpregs_assert_state_consistent(); + if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) + switch_fpu_return(); + }
#ifdef CONFIG_COMPAT /*
Based on it, I measured how many 0s are out of every one million ti_work values in kernel build tests, it's over 99%, i.e., unlikely(ti_work).
When booting a KVM guest, it becomes 75%, which is expected. After the guest is up running kernel build in it, it's 99% again.
So at least this patch seems a low-hanging fruit, and I have sent it to Intel 0day for broader perf tests.
As context switches are way less frequent than exit to user mode, I do NOT expect it makes a difference to write MSR_IA32_FRED_RSP0 on exit to user mode instead of on context switch especially when we do it on top of the above patch.