On Wed, Nov 13, 2024 at 2:31 AM Paolo Bonzini pbonzini@redhat.com wrote:
Il mar 12 nov 2024, 21:44 Doug Covelli doug.covelli@broadcom.com ha scritto:
Split irqchip should be the best tradeoff. Without it, moves from cr8 stay in the kernel, but moves to cr8 always go to userspace with a KVM_EXIT_SET_TPR exit. You also won't be able to use Intel flexpriority (in-processor accelerated TPR) because KVM does not know which bits are set in IRR. So it will be *really* every move to cr8 that goes to userspace.
Sorry to hijack this thread but is there a technical reason not to allow CR8 based accesses to the TPR (not MMIO accesses) when the in-kernel local APIC is not in use?
No worries, you're not hijacking :) The only reason is that it would be more code for a seldom used feature and anyway with worse performance. (To be clear, CR8 based accesses are allowed, but stores cause an exit in order to check the new TPR against IRR. That's because KVM's API does not have an equivalent of the TPR threshold as you point out below).
I have not really looked at the code but it seems like it could also simplify things as CR8 would be handled more uniformly regardless of who is virtualizing the local APIC.
Also I could not find these documented anywhere but with MSFT's APIC our monitor relies on extensions for trapping certain events such as INIT/SIPI plus LINT0 and SVR writes:
UINT64 X64ApicInitSipiExitTrap : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap UINT64 X64ApicWriteLint0ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap UINT64 X64ApicWriteLint1ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap UINT64 X64ApicWriteSvrExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap
There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are handled in the hypervisor and you can get the current state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively, plus the interrupt shadow; so there's no need for userspace to know when LINT0/LINT1 themselves change. The spurious interrupt vector register is also handled completely in kernel.
I realize that KVM can handle LINT0/SVR updates themselves but our interrupt subsystem relies on knowing the current values of these registers even when not virtualizing the local APIC. I suppose we could use KVM_GET_LAPIC to sync things up on demand but that seems like it might nor be great from a performance point of view.
I did not see any similar functionality for KVM. Does anything like that exist? In any case we would be happy to add support for handling CR8 accesses w/o exiting w/o the in-kernel APIC along with some sort of a way to configure the TPR threshold if folks are not opposed to that.
As far I know everybody who's using KVM (whether proprietary or open source) has had no need for that, so I don't think it's a good idea to make the API more complex. Performance of Windows guests is going to be bad anyway with userspace APIC.
From what I have seen the exit cost with KVM is significantly lower than with WHP/Hyper-V. I don't think performance of Windows guests with userspace APIC emulation would be bad if CR8 exits could be avoided (Linux guests perf isn't bad from what I have observed and the main difference is the astronomical number of CR8 exits). It seems like it would be pretty decent although I agree if you want the absolute best performance then you would want to use the in kernel APIC to speed up handling of ICR/EOI writes but those are relatively infrequent compared to CR8 accesses .
Anyway I just saw Sean's response while writing this and it seems he is not in favor of avoiding CR8 exits w/o the in kernel APIC either so I suppose we will have to look into making use of the in kernel APIC.
Doug
Paolo
Doug
For now I think it makes sense to handle BDOOR_CMD_GET_VCPU_INFO at userlevel like we do on Windows and macOS.
BDOOR_CMD_GETTIME/BDOOR_CMD_GETTIMEFULL are similar with the former being deprecated in favor of the latter. Both do essentially the same thing which is to return the host OS's time - on Linux this is obtained via gettimeofday. I believe this is mainly used by tools to fix up the VM's time when resuming from suspend. I think it is fine to continue handling these at userlevel.
As long as the TSC is not involved it should be okay.
Paolo
> Anyway, one question apart from this: is the API the same for the I/O > port and hypercall backdoors?
Yeah the calls and arguments are the same. The hypercall based interface is an attempt to modernize the backdoor since as you pointed out the I/O based interface is kind of hacky as it bypasses the normal checks for an I/O port access at CPL3. It would be nice to get rid of it but unfortunately I don't think that will happen in the foreseeable future as there are a lot of existing VMs out there with older SW that still uses this interface.
Yeah, but I think it still justifies that the KVM_ENABLE_CAP API can enable the hypercall but not the I/O port.
Paolo
-- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.
On 11/13/24 17:24, Doug Covelli wrote:
No worries, you're not hijacking :) The only reason is that it would be more code for a seldom used feature and anyway with worse performance. (To be clear, CR8 based accesses are allowed, but stores cause an exit in order to check the new TPR against IRR. That's because KVM's API does not have an equivalent of the TPR threshold as you point out below).
I have not really looked at the code but it seems like it could also simplify things as CR8 would be handled more uniformly regardless of who is virtualizing the local APIC.
Not much because CR8 basically does not exist at all (it's just a byte in memory) with userspace APIC. So it's not easy to make it simpler, even though it's less uniform.
That said, there is an optimization: you only get KVM_EXIT_SET_TPR if CR8 decreases.
Also I could not find these documented anywhere but with MSFT's APIC our monitor relies on extensions for trapping certain events such as INIT/SIPI plus LINT0 and SVR writes:
UINT64 X64ApicInitSipiExitTrap : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap UINT64 X64ApicWriteLint0ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap UINT64 X64ApicWriteLint1ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap UINT64 X64ApicWriteSvrExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap
There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are handled in the hypervisor and you can get the current state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively, plus the interrupt shadow; so there's no need for userspace to know when LINT0/LINT1 themselves change. The spurious interrupt vector register is also handled completely in kernel.
I realize that KVM can handle LINT0/SVR updates themselves but our interrupt subsystem relies on knowing the current values of these registers even when not virtualizing the local APIC. I suppose we could use KVM_GET_LAPIC to sync things up on demand but that seems like it might nor be great from a performance point of view.
Ah no, you're right---you want to track the CPU that has ExtINT enabled and send KVM_INTERRUPT to that one, I guess? And you need the spurious vector registers because writes can set the mask bit in LINTx, but essentially you want to trap LINT0 changes.
Something like this (missing the KVM_ENABLE_CAP and KVM_CHECK_EXTENSION code) is good, feel free to include it in your v2 (Co-developed-by and Signed-off-by me):
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5fb29ca3263b..b7dd89c99613 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -122,6 +122,7 @@ #define KVM_REQ_HV_TLB_FLUSH \ KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34) +#define KVM_REQ_REPORT_LINT0_ACCESS KVM_ARCH_REQ(35)
#define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ @@ -775,6 +776,7 @@ struct kvm_vcpu_arch { u64 smi_count; bool at_instruction_boundary; bool tpr_access_reporting; + bool lint0_access_reporting; bool xfd_no_write_intercept; u64 ia32_xss; u64 microcode_version; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 88dc43660d23..0e070f447aa2 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1561,6 +1561,21 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic) apic->divide_count)); }
+static void __report_lint0_access(struct kvm_lapic *apic, u32 value) +{ + struct kvm_vcpu *vcpu = apic->vcpu; + struct kvm_run *run = vcpu->run; + + kvm_make_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu); + run->lint0_access.value = value; +} + +static inline void report_lint0_access(struct kvm_lapic *apic, u32 value) +{ + if (apic->vcpu->arch.lint0_access_reporting) + __report_lint0_access(apic, value); +} + static void __report_tpr_access(struct kvm_lapic *apic, bool write) { struct kvm_vcpu *vcpu = apic->vcpu; @@ -2312,8 +2327,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) int i;
for (i = 0; i < apic->nr_lvt_entries; i++) { - kvm_lapic_set_reg(apic, APIC_LVTx(i), - kvm_lapic_get_reg(apic, APIC_LVTx(i)) | APIC_LVT_MASKED); + u32 old = kvm_lapic_get_reg(apic, APIC_LVTx(i)); + kvm_lapic_set_reg(apic, APIC_LVTx(i), old | APIC_LVT_MASKED); + if (i == 0 && !(old & APIC_LVT_MASKED)) + report_lint0_access(apic, old | APIC_LVT_MASKED); } apic_update_lvtt(apic); atomic_set(&apic->lapic_timer.pending, 0); @@ -2352,6 +2369,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) if (!kvm_apic_sw_enabled(apic)) val |= APIC_LVT_MASKED; val &= apic_lvt_mask[index]; + if (index == 0 && val != kvm_lapic_get_reg(apic, reg)) + report_lint0_access(apic, val); kvm_lapic_set_reg(apic, reg, val); break; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d0d3dc3b7ef6..2b039b372c3f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10879,6 +10879,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) kvm_vcpu_flush_tlb_guest(vcpu); #endif
+ if (kvm_check_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu)) { + vcpu->run->exit_reason = KVM_EXIT_LINT0_ACCESS; + r = 0; + goto out; + } if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) { vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS; r = 0; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc055145..ec97727f9de4 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -178,6 +178,7 @@ struct kvm_xen_exit { #define KVM_EXIT_NOTIFY 37 #define KVM_EXIT_LOONGARCH_IOCSR 38 #define KVM_EXIT_MEMORY_FAULT 39 +#define KVM_EXIT_LINT0_ACCESS 40
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -283,6 +284,10 @@ struct kvm_run { __u64 flags; }; } hypercall; + /* KVM_EXIT_LINT0_ACCESS */ + struct { + __u32 value; + } lint0_access; /* KVM_EXIT_TPR_ACCESS */ struct { __u64 rip;
For LINT1, it should be less performance critical; if it's possible to just go through all vCPUs, and do KVM_GET_LAPIC to check who you should send a KVM_NMI to, then I'd do that. I'd also accept a patch that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor if it's useful for you.
And since I've been proven wrong already, what do you need INIT/SIPI for?
Paolo
On Wed, Nov 13, 2024 at 12:59 PM Paolo Bonzini pbonzini@redhat.com wrote:
On 11/13/24 17:24, Doug Covelli wrote:
No worries, you're not hijacking :) The only reason is that it would be more code for a seldom used feature and anyway with worse performance. (To be clear, CR8 based accesses are allowed, but stores cause an exit in order to check the new TPR against IRR. That's because KVM's API does not have an equivalent of the TPR threshold as you point out below).
I have not really looked at the code but it seems like it could also simplify things as CR8 would be handled more uniformly regardless of who is virtualizing the local APIC.
Not much because CR8 basically does not exist at all (it's just a byte in memory) with userspace APIC. So it's not easy to make it simpler, even though it's less uniform.
That said, there is an optimization: you only get KVM_EXIT_SET_TPR if CR8 decreases.
Also I could not find these documented anywhere but with MSFT's APIC our monitor relies on extensions for trapping certain events such as INIT/SIPI plus LINT0 and SVR writes:
UINT64 X64ApicInitSipiExitTrap : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap UINT64 X64ApicWriteLint0ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap UINT64 X64ApicWriteLint1ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap UINT64 X64ApicWriteSvrExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap
There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are handled in the hypervisor and you can get the current state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively, plus the interrupt shadow; so there's no need for userspace to know when LINT0/LINT1 themselves change. The spurious interrupt vector register is also handled completely in kernel.
I realize that KVM can handle LINT0/SVR updates themselves but our interrupt subsystem relies on knowing the current values of these registers even when not virtualizing the local APIC. I suppose we could use KVM_GET_LAPIC to sync things up on demand but that seems like it might nor be great from a performance point of view.
Ah no, you're right---you want to track the CPU that has ExtINT enabled and send KVM_INTERRUPT to that one, I guess? And you need the spurious vector registers because writes can set the mask bit in LINTx, but essentially you want to trap LINT0 changes.
Something like this (missing the KVM_ENABLE_CAP and KVM_CHECK_EXTENSION code) is good, feel free to include it in your v2 (Co-developed-by and Signed-off-by me):
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5fb29ca3263b..b7dd89c99613 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -122,6 +122,7 @@ #define KVM_REQ_HV_TLB_FLUSH \ KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34) +#define KVM_REQ_REPORT_LINT0_ACCESS KVM_ARCH_REQ(35)
#define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ @@ -775,6 +776,7 @@ struct kvm_vcpu_arch { u64 smi_count; bool at_instruction_boundary; bool tpr_access_reporting;
bool lint0_access_reporting; bool xfd_no_write_intercept; u64 ia32_xss; u64 microcode_version;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 88dc43660d23..0e070f447aa2 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1561,6 +1561,21 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic) apic->divide_count)); }
+static void __report_lint0_access(struct kvm_lapic *apic, u32 value) +{
struct kvm_vcpu *vcpu = apic->vcpu;
struct kvm_run *run = vcpu->run;
kvm_make_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu);
run->lint0_access.value = value;
+}
+static inline void report_lint0_access(struct kvm_lapic *apic, u32 value) +{
if (apic->vcpu->arch.lint0_access_reporting)
__report_lint0_access(apic, value);
+}
- static void __report_tpr_access(struct kvm_lapic *apic, bool write) { struct kvm_vcpu *vcpu = apic->vcpu;
@@ -2312,8 +2327,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) int i;
for (i = 0; i < apic->nr_lvt_entries; i++) {
kvm_lapic_set_reg(apic, APIC_LVTx(i),
kvm_lapic_get_reg(apic, APIC_LVTx(i)) | APIC_LVT_MASKED);
u32 old = kvm_lapic_get_reg(apic, APIC_LVTx(i));
kvm_lapic_set_reg(apic, APIC_LVTx(i), old | APIC_LVT_MASKED);
if (i == 0 && !(old & APIC_LVT_MASKED))
report_lint0_access(apic, old | APIC_LVT_MASKED); } apic_update_lvtt(apic); atomic_set(&apic->lapic_timer.pending, 0);
@@ -2352,6 +2369,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) if (!kvm_apic_sw_enabled(apic)) val |= APIC_LVT_MASKED; val &= apic_lvt_mask[index];
if (index == 0 && val != kvm_lapic_get_reg(apic, reg))
report_lint0_access(apic, val); kvm_lapic_set_reg(apic, reg, val); break; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d0d3dc3b7ef6..2b039b372c3f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10879,6 +10879,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) kvm_vcpu_flush_tlb_guest(vcpu); #endif
if (kvm_check_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_LINT0_ACCESS;
r = 0;
goto out;
} if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) { vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS; r = 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc055145..ec97727f9de4 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -178,6 +178,7 @@ struct kvm_xen_exit { #define KVM_EXIT_NOTIFY 37 #define KVM_EXIT_LOONGARCH_IOCSR 38 #define KVM_EXIT_MEMORY_FAULT 39 +#define KVM_EXIT_LINT0_ACCESS 40
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -283,6 +284,10 @@ struct kvm_run { __u64 flags; }; } hypercall;
/* KVM_EXIT_LINT0_ACCESS */
struct {
__u32 value;
} lint0_access; /* KVM_EXIT_TPR_ACCESS */ struct { __u64 rip;
For LINT1, it should be less performance critical; if it's possible to just go through all vCPUs, and do KVM_GET_LAPIC to check who you should send a KVM_NMI to, then I'd do that. I'd also accept a patch that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor if it's useful for you.
Thanks for the patch - I'll get it a try but it might not be right away.
And since I've been proven wrong already, what do you need INIT/SIPI for?
I don't think this one is as critical. I believe the reason it was added was so that we can synchronize startup of the APs with execution of the BSP for guests that do not do a good job of that (Windows).
Doug
Paolo
On Thu, Nov 14, 2024 at 10:45 AM Doug Covelli doug.covelli@broadcom.com wrote:
On Wed, Nov 13, 2024 at 12:59 PM Paolo Bonzini pbonzini@redhat.com wrote:
On 11/13/24 17:24, Doug Covelli wrote:
No worries, you're not hijacking :) The only reason is that it would be more code for a seldom used feature and anyway with worse performance. (To be clear, CR8 based accesses are allowed, but stores cause an exit in order to check the new TPR against IRR. That's because KVM's API does not have an equivalent of the TPR threshold as you point out below).
I have not really looked at the code but it seems like it could also simplify things as CR8 would be handled more uniformly regardless of who is virtualizing the local APIC.
Not much because CR8 basically does not exist at all (it's just a byte in memory) with userspace APIC. So it's not easy to make it simpler, even though it's less uniform.
That said, there is an optimization: you only get KVM_EXIT_SET_TPR if CR8 decreases.
Also I could not find these documented anywhere but with MSFT's APIC our monitor relies on extensions for trapping certain events such as INIT/SIPI plus LINT0 and SVR writes:
UINT64 X64ApicInitSipiExitTrap : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap UINT64 X64ApicWriteLint0ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap UINT64 X64ApicWriteLint1ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap UINT64 X64ApicWriteSvrExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap
There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are handled in the hypervisor and you can get the current state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively, plus the interrupt shadow; so there's no need for userspace to know when LINT0/LINT1 themselves change. The spurious interrupt vector register is also handled completely in kernel.
I realize that KVM can handle LINT0/SVR updates themselves but our interrupt subsystem relies on knowing the current values of these registers even when not virtualizing the local APIC. I suppose we could use KVM_GET_LAPIC to sync things up on demand but that seems like it might nor be great from a performance point of view.
Ah no, you're right---you want to track the CPU that has ExtINT enabled and send KVM_INTERRUPT to that one, I guess? And you need the spurious vector registers because writes can set the mask bit in LINTx, but essentially you want to trap LINT0 changes.
Something like this (missing the KVM_ENABLE_CAP and KVM_CHECK_EXTENSION code) is good, feel free to include it in your v2 (Co-developed-by and Signed-off-by me):
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5fb29ca3263b..b7dd89c99613 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -122,6 +122,7 @@ #define KVM_REQ_HV_TLB_FLUSH \ KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34) +#define KVM_REQ_REPORT_LINT0_ACCESS KVM_ARCH_REQ(35)
#define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ @@ -775,6 +776,7 @@ struct kvm_vcpu_arch { u64 smi_count; bool at_instruction_boundary; bool tpr_access_reporting;
bool lint0_access_reporting; bool xfd_no_write_intercept; u64 ia32_xss; u64 microcode_version;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 88dc43660d23..0e070f447aa2 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1561,6 +1561,21 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic) apic->divide_count)); }
+static void __report_lint0_access(struct kvm_lapic *apic, u32 value) +{
struct kvm_vcpu *vcpu = apic->vcpu;
struct kvm_run *run = vcpu->run;
kvm_make_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu);
run->lint0_access.value = value;
+}
+static inline void report_lint0_access(struct kvm_lapic *apic, u32 value) +{
if (apic->vcpu->arch.lint0_access_reporting)
__report_lint0_access(apic, value);
+}
- static void __report_tpr_access(struct kvm_lapic *apic, bool write) { struct kvm_vcpu *vcpu = apic->vcpu;
@@ -2312,8 +2327,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) int i;
for (i = 0; i < apic->nr_lvt_entries; i++) {
kvm_lapic_set_reg(apic, APIC_LVTx(i),
kvm_lapic_get_reg(apic, APIC_LVTx(i)) | APIC_LVT_MASKED);
u32 old = kvm_lapic_get_reg(apic, APIC_LVTx(i));
kvm_lapic_set_reg(apic, APIC_LVTx(i), old | APIC_LVT_MASKED);
if (i == 0 && !(old & APIC_LVT_MASKED))
report_lint0_access(apic, old | APIC_LVT_MASKED); } apic_update_lvtt(apic); atomic_set(&apic->lapic_timer.pending, 0);
@@ -2352,6 +2369,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) if (!kvm_apic_sw_enabled(apic)) val |= APIC_LVT_MASKED; val &= apic_lvt_mask[index];
if (index == 0 && val != kvm_lapic_get_reg(apic, reg))
report_lint0_access(apic, val); kvm_lapic_set_reg(apic, reg, val); break; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d0d3dc3b7ef6..2b039b372c3f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10879,6 +10879,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) kvm_vcpu_flush_tlb_guest(vcpu); #endif
if (kvm_check_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_LINT0_ACCESS;
r = 0;
goto out;
} if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) { vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS; r = 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc055145..ec97727f9de4 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -178,6 +178,7 @@ struct kvm_xen_exit { #define KVM_EXIT_NOTIFY 37 #define KVM_EXIT_LOONGARCH_IOCSR 38 #define KVM_EXIT_MEMORY_FAULT 39 +#define KVM_EXIT_LINT0_ACCESS 40
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -283,6 +284,10 @@ struct kvm_run { __u64 flags; }; } hypercall;
/* KVM_EXIT_LINT0_ACCESS */
struct {
__u32 value;
} lint0_access; /* KVM_EXIT_TPR_ACCESS */ struct { __u64 rip;
For LINT1, it should be less performance critical; if it's possible to just go through all vCPUs, and do KVM_GET_LAPIC to check who you should send a KVM_NMI to, then I'd do that. I'd also accept a patch that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor if it's useful for you.
Thanks for the patch - I'll get it a try but it might not be right away.
And since I've been proven wrong already, what do you need INIT/SIPI for?
I don't think this one is as critical. I believe the reason it was added was so that we can synchronize startup of the APs with execution of the BSP for guests that do not do a good job of that (Windows).
Doug
We were able to get the in-kernel APIC working with our code using the split IRQ chip option with our virtual EFI FW even w/o the traps for SVR and LVT0 writes. Performance of Windows VMs is greatly improved as expected. Unfortunately our ancient legacy BIOS will not work with > 1 VCPU due to lack of support for IPIs with an archaic delivery mode of remote read which it uses to discover APs by attempting to read their APIC ID register. MSFT WHP supports this functionality via an option, WHvPartitionPropertyCodeApicRemoteReadSupport.
Changing our legacy BIOS is not an option so in order to support Windows VMs with the legacy BIOS with decent performance we would either need to add support for remote reads of the APIC ID register to KVM or support CR8 accesses w/o exiting w/o the in-kernel APIC in order. Do you have a preference?
Thanks, Doug
On Thu, Dec 12, 2024, Doug Covelli wrote:
On Thu, Nov 14, 2024 at 10:45 AM Doug Covelli doug.covelli@broadcom.com wrote:
For LINT1, it should be less performance critical; if it's possible to just go through all vCPUs, and do KVM_GET_LAPIC to check who you should send a KVM_NMI to, then I'd do that. I'd also accept a patch that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor if it's useful for you.
Thanks for the patch - I'll get it a try but it might not be right away.
And since I've been proven wrong already, what do you need INIT/SIPI for?
I don't think this one is as critical. I believe the reason it was added was so that we can synchronize startup of the APs with execution of the BSP for guests that do not do a good job of that (Windows).
Doug
We were able to get the in-kernel APIC working with our code using the split IRQ chip option with our virtual EFI FW even w/o the traps for SVR and LVT0 writes. Performance of Windows VMs is greatly improved as expected. Unfortunately our ancient legacy BIOS will not work with > 1 VCPU due to lack of support for IPIs with an archaic delivery mode of remote read which it uses to discover APs by attempting to read their APIC ID register. MSFT WHP supports this functionality via an option, WHvPartitionPropertyCodeApicRemoteReadSupport.
Changing our legacy BIOS is not an option so in order to support Windows VMs with the legacy BIOS with decent performance we would either need to add support for remote reads of the APIC ID register to KVM or support CR8 accesses w/o exiting w/o the in-kernel APIC in order. Do you have a preference?
I didn't quite follow the CR8 access thing. If the choice is between emulating Remote Read IPIs and using a userspace local APIC, then I vote with both hands for emulating Remote Reads, especially if we can do a half-assed version that provides only what your crazy BIOS needs :-)
The biggest wrinkle I can think of is that KVM uses the Remote Read IPI encoding for a paravirt vCPU kick feature, but I doubt that's used by Windows guests and so can be sacrificed on the Altar of Ancient BIOS.
On Wed, Dec 18, 2024 at 4:44 AM Sean Christopherson seanjc@google.com wrote:
Changing our legacy BIOS is not an option so in order to support Windows VMs with the legacy BIOS with decent performance we would either need to add support for remote reads of the APIC ID register to KVM or support CR8 accesses w/o exiting w/o the in-kernel APIC in order. Do you have a preference?
I didn't quite follow the CR8 access thing. If the choice is between emulating Remote Read IPIs and using a userspace local APIC, then I vote with both hands for emulating Remote Reads, especially if we can do a half-assed version that provides only what your crazy BIOS needs :-)
Absolutely. Not quite userspace local APIC - VMware only needs userspace traps on CR8 access but yeah, it would not be great to have that. Remote read support is totally acceptable and hopefully win-win for VMware too.
The biggest wrinkle I can think of is that KVM uses the Remote Read IPI encoding for a paravirt vCPU kick feature, but I doubt that's used by Windows guests and so can be sacrificed on the Altar of Ancient BIOS.
That's easy, the existing code can be wrapped with
if (guest_pv_has(vcpu, KVM_FEATURE_PV_UNHALT))
The remote-read hack is not even supposed to be used by the guest (it's used internally by kvm_pv_kick_cpu_op).
Paolo
On Tue, Jan 7, 2025 at 12:10 PM Paolo Bonzini pbonzini@redhat.com wrote:
On Wed, Dec 18, 2024 at 4:44 AM Sean Christopherson seanjc@google.com wrote:
Changing our legacy BIOS is not an option so in order to support Windows VMs with the legacy BIOS with decent performance we would either need to add support for remote reads of the APIC ID register to KVM or support CR8 accesses w/o exiting w/o the in-kernel APIC in order. Do you have a preference?
I didn't quite follow the CR8 access thing. If the choice is between emulating Remote Read IPIs and using a userspace local APIC, then I vote with both hands for emulating Remote Reads, especially if we can do a half-assed version that provides only what your crazy BIOS needs :-)
Absolutely. Not quite userspace local APIC - VMware only needs userspace traps on CR8 access but yeah, it would not be great to have that. Remote read support is totally acceptable and hopefully win-win for VMware too.
The biggest wrinkle I can think of is that KVM uses the Remote Read IPI encoding for a paravirt vCPU kick feature, but I doubt that's used by Windows guests and so can be sacrificed on the Altar of Ancient BIOS.
That's easy, the existing code can be wrapped with
if (guest_pv_has(vcpu, KVM_FEATURE_PV_UNHALT))
The remote-read hack is not even supposed to be used by the guest (it's used internally by kvm_pv_kick_cpu_op).
Paolo
OK. It seems like fully embracing the in-kernel APIC is the way to go especially considering it really simplifies using KVM's support for nested virtualization. Speaking of nested virtualization we have been working on adding support for that and would like to propose a couple of changes:
- Add an option for L0 to handle backdoor accesses from CPL3 code running in L2. On a #GP nested_vmx_l0_wants_exit can check if this option is enabled and KVM can handle the #GP like it would if it had been from L1 (exit to userlevel iff it is a backdoor access otherwwise deliver the fault to L2). When combined with enable_vmware_backdoor this will allow L0 to optionally handle backdoor accesses from CPL3 code running in L2. This is needed for cases such as running VMware tools in a Windows VM with VBS enabled. For other cases such as running tools in a Windows VM in an ESX VM we still want L1 to handle the backdoor accesses from L2.
- Extend KVM_EXIT_MEMORY_FAULT for permission faults (e.g the guest attempting to write to a page that has been protected by userlevel calling mprotect). This is useful for cases where we want synchronous detection of guest writes such as lazy snapshots (dirty page tracking is no good for this case). Currently permission faults result in KVM_RUN returning EFAULT which we handle by interpreting the instruction as we do not know the guest physical address associated with the fault. This works fine for normal VMs but it would be good to avoid it for the case where nested virtualization is enabled as emulating L2 instructions presents a number of challenges and it would be best to avoid this. This is the only case I have found where our userlevel code has to interpret instructions from L2.
Any thoughts on these proposed changes?
Doug
On Mon, Feb 3, 2025 at 5:35 PM Doug Covelli doug.covelli@broadcom.com wrote:
OK. It seems like fully embracing the in-kernel APIC is the way to go especially considering it really simplifies using KVM's support for nested virtualization. Speaking of nested virtualization we have been working on adding support for that and would like to propose a couple of changes:
- Add an option for L0 to handle backdoor accesses from CPL3 code running in L2.
On a #GP nested_vmx_l0_wants_exit can check if this option is enabled and KVM can handle the #GP like it would if it had been from L1 (exit to userlevel iff it is a backdoor access otherwwise deliver the fault to L2). When combined with enable_vmware_backdoor this will allow L0 to optionally handle backdoor accesses from CPL3 code running in L2. This is needed for cases such as running VMware tools in a Windows VM with VBS enabled. For other cases such as running tools in a Windows VM in an ESX VM we still want L1 to handle the backdoor accesses from L2.
I think this makes sense and could be an argument to KVM_ENABLE_CAP.
- Extend KVM_EXIT_MEMORY_FAULT for permission faults (e.g the guest attempting
to write to a page that has been protected by userlevel calling mprotect). This is useful for cases where we want synchronous detection of guest writes such as lazy snapshots (dirty page tracking is no good for this case). Currently permission faults result in KVM_RUN returning EFAULT which we handle by interpreting the instruction as we do not know the guest physical address associated with the fault.
Yes, this makes sense too, though you might want to look into userfaultfd as well.
We had something planned using attributes, but I don't see any issue extending it to EFAULT. Maybe it would have to be yet another KVM_ENABLE_CAP; considering that it would break your existing code, there might be someone else in the wild doing it.
Paolo
On Mon, Feb 3, 2025 at 1:22 PM Paolo Bonzini pbonzini@redhat.com wrote:
On Mon, Feb 3, 2025 at 5:35 PM Doug Covelli doug.covelli@broadcom.com wrote:
OK. It seems like fully embracing the in-kernel APIC is the way to go especially considering it really simplifies using KVM's support for nested virtualization. Speaking of nested virtualization we have been working on adding support for that and would like to propose a couple of changes:
- Add an option for L0 to handle backdoor accesses from CPL3 code running in L2.
On a #GP nested_vmx_l0_wants_exit can check if this option is enabled and KVM can handle the #GP like it would if it had been from L1 (exit to userlevel iff it is a backdoor access otherwwise deliver the fault to L2). When combined with enable_vmware_backdoor this will allow L0 to optionally handle backdoor accesses from CPL3 code running in L2. This is needed for cases such as running VMware tools in a Windows VM with VBS enabled. For other cases such as running tools in a Windows VM in an ESX VM we still want L1 to handle the backdoor accesses from L2.
I think this makes sense and could be an argument to KVM_ENABLE_CAP.
- Extend KVM_EXIT_MEMORY_FAULT for permission faults (e.g the guest attempting
to write to a page that has been protected by userlevel calling mprotect). This is useful for cases where we want synchronous detection of guest writes such as lazy snapshots (dirty page tracking is no good for this case). Currently permission faults result in KVM_RUN returning EFAULT which we handle by interpreting the instruction as we do not know the guest physical address associated with the fault.
Yes, this makes sense too, though you might want to look into userfaultfd as well.
We had something planned using attributes, but I don't see any issue extending it to EFAULT. Maybe it would have to be yet another KVM_ENABLE_CAP; considering that it would break your existing code, there might be someone else in the wild doing it.
It looks like KVM_EXIT_MEMORY_FAULT was implemented in such a way that it won't break existing code:
Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it accompanies a return code of ‘-1’, not ‘0’! errno will always be set to EFAULT or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume kvm_run.exit_reason is stale/undefined for all other error numbers.
That being said we could certainly make this opt-in if that is preferable.
Doug
Paolo
On Mon, Feb 03, 2025, Doug Covelli wrote:
On Mon, Feb 3, 2025 at 1:22 PM Paolo Bonzini pbonzini@redhat.com wrote:
On Mon, Feb 3, 2025 at 5:35 PM Doug Covelli doug.covelli@broadcom.com wrote:
OK. It seems like fully embracing the in-kernel APIC is the way to go especially considering it really simplifies using KVM's support for nested virtualization. Speaking of nested virtualization we have been working on adding support for that and would like to propose a couple of changes:
- Add an option for L0 to handle backdoor accesses from CPL3 code running in L2.
On a #GP nested_vmx_l0_wants_exit can check if this option is enabled and KVM can handle the #GP like it would if it had been from L1 (exit to userlevel iff it is a backdoor access otherwwise deliver the fault to L2). When combined with enable_vmware_backdoor this will allow L0 to optionally handle backdoor accesses from CPL3 code running in L2. This is needed for cases such as running VMware tools in a Windows VM with VBS enabled. For other cases such as running tools in a Windows VM in an ESX VM we still want L1 to handle the backdoor accesses from L2.
I think this makes sense and could be an argument to KVM_ENABLE_CAP.
- Extend KVM_EXIT_MEMORY_FAULT for permission faults (e.g the guest attempting
to write to a page that has been protected by userlevel calling mprotect). This is useful for cases where we want synchronous detection of guest writes such as lazy snapshots (dirty page tracking is no good for this case). Currently permission faults result in KVM_RUN returning EFAULT which we handle by interpreting the instruction as we do not know the guest physical address associated with the fault.
Yes, this makes sense too, though you might want to look into userfaultfd as well.
We had something planned using attributes, but I don't see any issue extending it to EFAULT. Maybe it would have to be yet another KVM_ENABLE_CAP; considering that it would break your existing code, there might be someone else in the wild doing it.
It looks like KVM_EXIT_MEMORY_FAULT was implemented in such a way that it won't break existing code:
Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it accompanies a return code of ‘-1’, not ‘0’! errno will always be set to EFAULT or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume kvm_run.exit_reason is stale/undefined for all other error numbers.
That being said we could certainly make this opt-in if that is preferable.
-EFAULT isn't the problem, KVM not being able to return useful information in all situations is the issue. Specifically, "guest" accesses that are emulated by KVM are problematic, because the -EFAULT from e.g. __kvm_write_guest_page() is disconnected from the code that actually kicks out to userspace. In that case, userspace will get KVM_EXIT_MMIO, not -EFAULT. There are more problems beyond KVM_EXIT_MMIO vs. -EFAULT, e.g. instructions that perform multiple memory accesses, "failures" that are squashed and never propagated to userspace (PV features tend to do this), page splits, etc.
In general, I don't expect most KVM access to guest memory to Just Work, as I doubt KVM will behave as you want.
We spent a lot of time trying to sort out a viable approach in the context of the USERFAULT_ON_MISSING series[1], and ultimately gave up (ignoring that we postponed the entire series)[2], because we decided that fully solving KVM accesses would require an absurd amount of effort and churn, and wasn't at all necessary for the userfault use case.
What exactly needs to happen on "synchronous detection of guest writes"? One idea (which may be horribly flawed as I have put *very* little thought into it) would be to implement a module (or KVM extension) that utilizes KVM's "external" write-tracking APIs to get the synchronous notifications (see arch/x86/include/asm/kvm_page_track.h).
[1] https://lore.kernel.org/all/ZIn6VQSebTRN1jtX@google.com [2] https://lore.kernel.org/all/ZR88w9W62qsZDro-@google.com
On 2/3/25 20:41, Sean Christopherson wrote:
-EFAULT isn't the problem, KVM not being able to return useful information in all situations is the issue.
Yes, that's why I don't want it to be an automatically opted-in API. If incremental improvements are possible, it may be useful to allow interested userspace to enable it early. For example...
Specifically, "guest" accesses that are emulated by KVM are problematic, because the -EFAULT from e.g. __kvm_write_guest_page() is disconnected from the code that actually kicks out to userspace. In that case, userspace will get KVM_EXIT_MMIO, not -EFAULT. There are more problems beyond KVM_EXIT_MMIO vs. -EFAULT, e.g. instructions that perform multiple memory accesses,
those are obviously synchronous and I expect VMware to handle them already.
That said my preferred solution to just use userfaultfd, which is synchronous by definition.
Paolo
"failures" that are squashed and never propagated to userspace (PV features tend to do this), page splits, etc.
On Mon, Feb 03, 2025, Paolo Bonzini wrote:
On 2/3/25 20:41, Sean Christopherson wrote:
-EFAULT isn't the problem, KVM not being able to return useful information in all situations is the issue.
Yes, that's why I don't want it to be an automatically opted-in API. If incremental improvements are possible, it may be useful to allow interested userspace to enable it early. For example...
Specifically, "guest" accesses that are emulated by KVM are problematic, because the -EFAULT from e.g. __kvm_write_guest_page() is disconnected from the code that actually kicks out to userspace. In that case, userspace will get KVM_EXIT_MMIO, not -EFAULT. There are more problems beyond KVM_EXIT_MMIO vs. -EFAULT, e.g. instructions that perform multiple memory accesses,
those are obviously synchronous and I expect VMware to handle them already.
That said my preferred solution to just use userfaultfd, which is synchronous by definition.
Oh, right, userfaultfd would be far better than piggybacking write-tracking.
On Mon, Feb 3, 2025 at 2:53 PM Sean Christopherson seanjc@google.com wrote:
On Mon, Feb 03, 2025, Paolo Bonzini wrote:
On 2/3/25 20:41, Sean Christopherson wrote:
-EFAULT isn't the problem, KVM not being able to return useful information in all situations is the issue.
Yes, that's why I don't want it to be an automatically opted-in API. If incremental improvements are possible, it may be useful to allow interested userspace to enable it early. For example...
Specifically, "guest" accesses that are emulated by KVM are problematic, because the -EFAULT from e.g. __kvm_write_guest_page() is disconnected from the code that actually kicks out to userspace. In that case, userspace will get KVM_EXIT_MMIO, not -EFAULT. There are more problems beyond KVM_EXIT_MMIO vs. -EFAULT, e.g. instructions that perform multiple memory accesses,
those are obviously synchronous and I expect VMware to handle them already.
That said my preferred solution to just use userfaultfd, which is synchronous by definition.
Oh, right, userfaultfd would be far better than piggybacking write-tracking.
Thanks. We will look into using userfaultfd.
Doug
linux-kselftest-mirror@lists.linaro.org