On Thu, Jul 18, 2024, H. Peter Anvin wrote:
On July 12, 2024 8:12:51 AM PDT, Sean Christopherson seanjc@google.com wrote:
On Wed, Jul 10, 2024, Xin3 Li wrote:
On Wed, Feb 07, 2024, Xin Li wrote:
Switch MSR_IA32_FRED_RSP0 between host and guest in vmx_prepare_switch_to_{host,guest}().
MSR_IA32_FRED_RSP0 is used during ring 3 event delivery only, thus KVM, running on ring 0, can run safely with guest FRED RSP0, i.e., no need to switch between host/guest FRED RSP0 during VM entry and exit.
KVM should switch to host FRED RSP0 before returning to user level, and switch to guest FRED RSP0 before entering guest mode.
Heh, if only KVM had a framework that was specifically designed for context switching MSRs on return to userspace. Translation: please use the user_return_msr() APIs.
IIUC the user return MSR framework works for MSRs that are per CPU constants, but like MSR_KERNEL_GS_BASE, MSR_IA32_FRED_RSP0 is a per *task* constant, thus we can't use it.
Ah, in that case, the changelog is very misleading and needs to be fixed. Alternatively, is the desired RSP0 value tracked anywhere other than the MSR? E.g. if it's somewhere in task_struct, then kvm_on_user_return() would restore the current task's desired RSP0. Even if we don't get fancy, avoiding the RDMSR to get the current task's value would be nice.
Hm, perhaps the right thing to do is to always invoke this function before a context switch happens if that happens before return to user space?
Actually, if the _TIF_NEED_RSP0_LOAD doesn't provide a meaningful benefit (or y'all just don't want it :-) ), what KVM could do is restore MSR_IA32_FRED_RSP0 when putting the vCPU and the vCPU is not being scheduled out, i.e. if and only if KVM can't guarantee a context switch.
If the vCPU/task is being scheduled out, update_task_stack() is guaranteed to write MSR_IA32_FRED_RSP0 with the new task's value.
On top of kvm/next, which adds the necessary vcpu->scheduled_out:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 5c6bb26463e8..4532ae943f2a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1338,15 +1338,9 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
- if (guest_can_use(vcpu, X86_FEATURE_FRED)) { - /* - * MSR_IA32_FRED_RSP0 is top of task stack, which never changes. - * Thus it should be initialized only once. - */ - if (unlikely(vmx->msr_host_fred_rsp0 == 0)) - vmx->msr_host_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0); - wrmsrl(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0); - } + if (cpu_feature_enabled(X86_FEATURE_FRED) && + guest_can_use(vcpu, X86_FEATURE_FRED)) + wrmsrns(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0); #else savesegment(fs, fs_sel); savesegment(gs, gs_sel); @@ -1392,9 +1386,13 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx) #ifdef CONFIG_X86_64 wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
- if (guest_can_use(&vmx->vcpu, X86_FEATURE_FRED)) { + if (cpu_feature_enabled(X86_FEATURE_FRED) && + guest_can_use(&vmx->vcpu, X86_FEATURE_FRED)) { vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0); - wrmsrl(MSR_IA32_FRED_RSP0, vmx->msr_host_fred_rsp0); + + if (!vcpu->scheduled_out) + wrmsrns(MSR_IA32_FRED_RSP0, + (unsigned long)task_stack_page(task) + THREAD_SIZE); } #endif load_fixmap_gdt(raw_smp_processor_id());