Currently, the situation when guest accesses MMIO during event delivery is handled differently in VMX and SVM: on VMX KVM returns internal error with suberror = KVM_INTERNAL_ERROR_DELIVERY_EV, when SVM simply goes into infinite loop trying to deliver an event again and again.
This patch series eliminates this difference by returning a KVM internal error with suberror = KVM_INTERNAL_ERROR_DELIVERY_EV when guest is performing MMIO during event delivery, for both VMX and SVM.
Also, it introduces a selftest test case which covers the MMIO during event delivery error handling.
Ivan Orlov (3): KVM: x86, vmx: Add function for event delivery error generation KVM: vmx, svm, mmu: Process MMIO during event delivery selftests: KVM: Add test case for MMIO during event delivery
arch/x86/include/asm/kvm_host.h | 8 ++++ arch/x86/kvm/mmu/mmu.c | 15 +++++- arch/x86/kvm/svm/svm.c | 4 ++ arch/x86/kvm/vmx/vmx.c | 32 ++++--------- arch/x86/kvm/x86.c | 22 +++++++++ .../selftests/kvm/set_memory_region_test.c | 46 +++++++++++++++++++ 6 files changed, 104 insertions(+), 23 deletions(-)
Extract KVM_INTERNAL_ERROR_DELIVERY_EV internal error generation into the SVM/VMX-agnostic 'kvm_prepare_ev_delivery_failure_exit' function, as it is done for KVM_INTERNAL_ERROR_EMULATION. The order of internal.data array entries is preserved as is, so it is going to be the same on VMX platforms (vectoring info, full exit reason, exit qualification, GPA if error happened due to MMIO and last_vmentry_cpu of the vcpu).
Having it as a separate function will help us to avoid code duplication when handling the MMIO during event delivery error on SVM.
No functional change intended.
Signed-off-by: Ivan Orlov iorlov@amazon.com --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/vmx/vmx.c | 15 +++------------ arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++ 3 files changed, 27 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6d9f763a7bb9..348daba424dd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2060,6 +2060,8 @@ void __kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu, u64 *data, u8 ndata); void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
+void kvm_prepare_ev_delivery_failure_exit(struct kvm_vcpu *vcpu, gpa_t gpa, bool is_mmio); + void kvm_enable_efer_bits(u64); bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer); int kvm_get_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 *data); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c67e448c6ebd..afd785e7f3a3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6550,19 +6550,10 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH && exit_reason.basic != EXIT_REASON_NOTIFY)) { - int ndata = 3; + gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); + bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
- vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_DELIVERY_EV; - vcpu->run->internal.data[0] = vectoring_info; - vcpu->run->internal.data[1] = exit_reason.full; - vcpu->run->internal.data[2] = vmx_get_exit_qual(vcpu); - if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG) { - vcpu->run->internal.data[ndata++] = - vmcs_read64(GUEST_PHYSICAL_ADDRESS); - } - vcpu->run->internal.data[ndata++] = vcpu->arch.last_vmentry_cpu; - vcpu->run->internal.ndata = ndata; + kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, is_mmio); return 0; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 83fe0a78146f..8ee67fc23e5d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8828,6 +8828,28 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_prepare_emulation_failure_exit);
+void kvm_prepare_ev_delivery_failure_exit(struct kvm_vcpu *vcpu, gpa_t gpa, bool is_mmio) +{ + struct kvm_run *run = vcpu->run; + int ndata = 0; + u32 reason, intr_info, error_code; + u64 info1, info2; + + kvm_x86_call(get_exit_info)(vcpu, &reason, &info1, &info2, &intr_info, &error_code); + + run->internal.data[ndata++] = info2; + run->internal.data[ndata++] = reason; + run->internal.data[ndata++] = info1; + if (is_mmio) + run->internal.data[ndata++] = (u64)gpa; + run->internal.data[ndata++] = vcpu->arch.last_vmentry_cpu; + + run->exit_reason = KVM_EXIT_INTERNAL_ERROR; + run->internal.suberror = KVM_INTERNAL_ERROR_DELIVERY_EV; + run->internal.ndata = ndata; +} +EXPORT_SYMBOL_GPL(kvm_prepare_ev_delivery_failure_exit); + static int handle_emulation_failure(struct kvm_vcpu *vcpu, int emulation_type) { struct kvm *kvm = vcpu->kvm;
"KVM: VMX:" for the scope. See "Shortlog" in Documentation/process/maintainer-kvm-x86.rst
On Fri, Sep 27, 2024, Ivan Orlov wrote:
Extract KVM_INTERNAL_ERROR_DELIVERY_EV internal error generation into the SVM/VMX-agnostic 'kvm_prepare_ev_delivery_failure_exit' function, as it is done for KVM_INTERNAL_ERROR_EMULATION.
Use the changelog to provide a human readable summary of the change. There are definitely situations where calling out functions, variables, defines, etc. by name is necessary, but this isn't one such situation.
The order of internal.data array entries is preserved as is, so it is going to be the same on VMX platforms (vectoring info, full exit reason, exit qualification, GPA if error happened due to MMIO and last_vmentry_cpu of the vcpu).
Similar to the above, let the code speak. The "No functional change intended" makes it clear that the intent is to preserve the order and behavior.
Having it as a separate function will help us to avoid code duplication
Avoid pronouns as much as possible, and no "we" or "us" as a hard rule. E.g. this can all be distilled down to:
-- Extract VMX's code for reporting an unhandleable VM-Exit during event delivery to userspace, so that the boilerplate code can be shared by SVM.
No functional change intended. --
when handling the MMIO during event delivery error on SVM.
No functional change intended.
Signed-off-by: Ivan Orlov iorlov@amazon.com
arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/vmx/vmx.c | 15 +++------------ arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++ 3 files changed, 27 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6d9f763a7bb9..348daba424dd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2060,6 +2060,8 @@ void __kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu, u64 *data, u8 ndata); void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu); +void kvm_prepare_ev_delivery_failure_exit(struct kvm_vcpu *vcpu, gpa_t gpa, bool is_mmio);
Please wrap at 80 columns. While checkpatch doesn't complaing until 100, my preference is to default to wrapping at 80, and poking past 80 only when it yields more readable code (which is obviously subjective, but it shouldn't be too hard to figure out KVM x86's preferred style).
void kvm_enable_efer_bits(u64); bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer); int kvm_get_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 *data); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c67e448c6ebd..afd785e7f3a3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6550,19 +6550,10 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH && exit_reason.basic != EXIT_REASON_NOTIFY)) {
int ndata = 3;
gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
There's no need for is_mmio, just pass INVALID_GPA when the GPA isn't known.
vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_DELIVERY_EV;
vcpu->run->internal.data[0] = vectoring_info;
vcpu->run->internal.data[1] = exit_reason.full;
vcpu->run->internal.data[2] = vmx_get_exit_qual(vcpu);
if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG) {
vcpu->run->internal.data[ndata++] =
vmcs_read64(GUEST_PHYSICAL_ADDRESS);
}
vcpu->run->internal.data[ndata++] = vcpu->arch.last_vmentry_cpu;
vcpu->run->internal.ndata = ndata;
return 0; }kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, is_mmio);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 83fe0a78146f..8ee67fc23e5d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8828,6 +8828,28 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_prepare_emulation_failure_exit); +void kvm_prepare_ev_delivery_failure_exit(struct kvm_vcpu *vcpu, gpa_t gpa, bool is_mmio)
Hmm, I don't love the name. I really don't like that event is abbreviated, and I suspect many readers will be misinterpret "event delivery failure" to mean that _KVM_ failed to deliver an event. Which is kinda sorta true, but it's more accurate to say that the CPU triggered a VM-Exit when vectoring/delivery an event, and KVM doesn't have code to robustly handle the situation.
Maybe kvm_prepare_event_vectoring_exit()? Vectoring is quite specific in Intel terminology.
+{
- struct kvm_run *run = vcpu->run;
- int ndata = 0;
- u32 reason, intr_info, error_code;
- u64 info1, info2;
Reverse fir/x-mas tree for variables. See "Coding Style" in Documentation/process/maintainer-kvm-x86.rst (which will redirect you to Documentation/process/maintainer-tip.rst, specifically "Variable declarations").
- kvm_x86_call(get_exit_info)(vcpu, &reason, &info1, &info2, &intr_info, &error_code);
Wrap. Though calling back into vendor code is silly. Pass the necessary info as parameters. E.g. error_code and intr_info are unused, so the above is wasteful and weird.
- run->internal.data[ndata++] = info2;
- run->internal.data[ndata++] = reason;
- run->internal.data[ndata++] = info1;
- if (is_mmio)
And this is where keying off MMIO gets weird.
run->internal.data[ndata++] = (u64)gpa;
- run->internal.data[ndata++] = vcpu->arch.last_vmentry_cpu;
- run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
- run->internal.suberror = KVM_INTERNAL_ERROR_DELIVERY_EV;
- run->internal.ndata = ndata;
+} +EXPORT_SYMBOL_GPL(kvm_prepare_ev_delivery_failure_exit);
static int handle_emulation_failure(struct kvm_vcpu *vcpu, int emulation_type) { struct kvm *kvm = vcpu->kvm; -- 2.43.0
On Fri, Oct 11, 2024, Sean Christopherson wrote:
- kvm_x86_call(get_exit_info)(vcpu, &reason, &info1, &info2, &intr_info, &error_code);
Wrap. Though calling back into vendor code is silly. Pass the necessary info as parameters. E.g. error_code and intr_info are unused, so the above is wasteful and weird.
Ah, but the next patch invokes this from common code, i.e. can't pass in the necessary info. _That_ is definitely worth calling out in the changelog.
Hi Sean,
On Fri, Oct 11, 2024 at 04:20:46PM -0700, Sean Christopherson wrote:
"KVM: VMX:" for the scope. See "Shortlog" in Documentation/process/maintainer-kvm-x86.rst
Ah, will update in the next version, thanks!
On Fri, Sep 27, 2024, Ivan Orlov wrote:
Extract KVM_INTERNAL_ERROR_DELIVERY_EV internal error generation into the SVM/VMX-agnostic 'kvm_prepare_ev_delivery_failure_exit' function, as it is done for KVM_INTERNAL_ERROR_EMULATION.
Use the changelog to provide a human readable summary of the change. There are definitely situations where calling out functions, variables, defines, etc. by name is necessary, but this isn't one such situation.
The order of internal.data array entries is preserved as is, so it is going to be the same on VMX platforms (vectoring info, full exit reason, exit qualification, GPA if error happened due to MMIO and last_vmentry_cpu of the vcpu).
Similar to the above, let the code speak. The "No functional change intended" makes it clear that the intent is to preserve the order and behavior.
Having it as a separate function will help us to avoid code duplication
Avoid pronouns as much as possible, and no "we" or "us" as a hard rule. E.g. this can all be distilled down to:
Yeah, makes sense. Will reformulate the message in the next version to consider all of the changes you suggested.
-- Extract VMX's code for reporting an unhandleable VM-Exit during event delivery to userspace, so that the boilerplate code can be shared by SVM.
No functional change intended.
Awesome, thanks for the example!
Please wrap at 80 columns. While checkpatch doesn't complaing until 100, my preference is to default to wrapping at 80, and poking past 80 only when it yields more readable code (which is obviously subjective, but it shouldn't be too hard to figure out KVM x86's preferred style).
Alright, will do, thanks! These rules vary from one subsystem to another, and I'll try to keep the style consistent in the future.
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c67e448c6ebd..afd785e7f3a3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6550,19 +6550,10 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH && exit_reason.basic != EXIT_REASON_NOTIFY)) {
int ndata = 3;
gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
There's no need for is_mmio, just pass INVALID_GPA when the GPA isn't known.
Ah alright, then we definitely don't need an is_mmio field. I assume we can't do MMIO at GPA=0, right?
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 83fe0a78146f..8ee67fc23e5d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8828,6 +8828,28 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_prepare_emulation_failure_exit); +void kvm_prepare_ev_delivery_failure_exit(struct kvm_vcpu *vcpu, gpa_t gpa, bool is_mmio)
Hmm, I don't love the name. I really don't like that event is abbreviated, and I suspect many readers will be misinterpret "event delivery failure" to mean that _KVM_ failed to deliver an event. Which is kinda sorta true, but it's more accurate to say that the CPU triggered a VM-Exit when vectoring/delivery an event, and KVM doesn't have code to robustly handle the situation.
Maybe kvm_prepare_event_vectoring_exit()? Vectoring is quite specific in Intel terminology.
Yep, sounds good, I like that the name you suggested doesn't contain 'failure' part as essentially it is not a failure but an MMIO exit. Will update in V2.
+{
- struct kvm_run *run = vcpu->run;
- int ndata = 0;
- u32 reason, intr_info, error_code;
- u64 info1, info2;
Reverse fir/x-mas tree for variables. See "Coding Style" in Documentation/process/maintainer-kvm-x86.rst (which will redirect you to Documentation/process/maintainer-tip.rst, specifically "Variable declarations").
Great, didn't know about that, thanks!
- kvm_x86_call(get_exit_info)(vcpu, &reason, &info1, &info2, &intr_info, &error_code);
Wrap. Though calling back into vendor code is silly. Pass the necessary info as parameters. E.g. error_code and intr_info are unused, so the above is wasteful and weird.
I use it here as this function gets called from the common for svm/vmx code in the next patch, but as I can see from the next email you've already noticed that :)
- run->internal.data[ndata++] = info2;
- run->internal.data[ndata++] = reason;
- run->internal.data[ndata++] = info1;
- if (is_mmio)
And this is where keying off MMIO gets weird.
We still need to exclude one of the data elements when GPA is not known to be backwards compatible, so we can get rid of the `is_mmio` argument, but not from this `if` (unfortunately).
Thank you so much for the review!
Kind regards, Ivan Orlov
On Tue, Oct 15, 2024, Ivan Orlov wrote:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c67e448c6ebd..afd785e7f3a3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6550,19 +6550,10 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH && exit_reason.basic != EXIT_REASON_NOTIFY)) {
int ndata = 3;
gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
There's no need for is_mmio, just pass INVALID_GPA when the GPA isn't known.
Ah alright, then we definitely don't need an is_mmio field. I assume we can't do MMIO at GPA=0, right?
Wrong :-)
From an architectural perspective, GPA=0 is not special in any way. E.g. prior to L1TF, Linux would happily use the page with PFN=0.
On 10/16/24 22:05, Sean Christopherson wrote:
On Tue, Oct 15, 2024, Ivan Orlov wrote:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c67e448c6ebd..afd785e7f3a3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6550,19 +6550,10 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH && exit_reason.basic != EXIT_REASON_NOTIFY)) {
int ndata = 3;
gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
There's no need for is_mmio, just pass INVALID_GPA when the GPA isn't known.
Ah alright, then we definitely don't need an is_mmio field. I assume we can't do MMIO at GPA=0, right?
Wrong :-)
Then getting rid of `is_mmio` will make distinguishing between vectoring error due to MMIO with GPA=0 and non-mmio vectoring error quite hard for the error reporti
Passing INVALID_GPA into the userspace due to non-mmio vectoring error will change the existing internal.data order, but I can do it if it's fine. Sorry for nitpicking :)
From an architectural perspective, GPA=0 is not special in any way. E.g. prior to L1TF, Linux would happily use the page with PFN=0.
Cool, didn't know about this vulnerability... Thanks for the explanation!
On Wed, Oct 16, 2024, Ivan Orlov wrote:
On 10/16/24 22:05, Sean Christopherson wrote:
On Tue, Oct 15, 2024, Ivan Orlov wrote:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c67e448c6ebd..afd785e7f3a3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6550,19 +6550,10 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH && exit_reason.basic != EXIT_REASON_NOTIFY)) {
int ndata = 3;
gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
There's no need for is_mmio, just pass INVALID_GPA when the GPA isn't known.
Ah alright, then we definitely don't need an is_mmio field. I assume we can't do MMIO at GPA=0, right?
Wrong :-)
Then getting rid of `is_mmio` will make distinguishing between vectoring error due to MMIO with GPA=0 and non-mmio vectoring error quite hard for the error reporti
Passing INVALID_GPA into the userspace due to non-mmio vectoring error will change the existing internal.data order, but I can do it if it's fine. Sorry for nitpicking :)
KVM's existing ABI is rather awful, though arguably the intent was that there is no ABI, i.e. that KVM is dumping info to try to be helpful. E.g. the existing behavior is that data[3] contains a GPA only for EPT_MISCONFIG, but for everything else, data[3] contains last_vmentry_cpu.
And because it's so awful, I doubt any userspace actually has code that acts on the layout of data[]. So, I suspect we can do the simple and sane thing, and fill data[3] with -1ull if the GPA is invalid, and then document that that's the behavior (if we're feeling generous).
Currently, the situation when guest accesses MMIO during event delivery is handled differently in VMX and SVM: on VMX KVM returns internal error with suberror = KVM_INTERNAL_ERROR_DELIVERY_EV, when SVM simply goes into infinite loop trying to deliver an event again and again.
Such a situation could happen when the exception occurs with guest IDTR (or GDTR) descriptor base pointing to an MMIO address.
Even with fixes for infinite loops on TDP failures applied, the problem still exists on SVM.
Eliminate the SVM/VMX difference by returning a KVM internal error with suberror = KVM_INTERNAL_ERROR_DELIVERY_EV for both SVM and VMX. As we don't have a reliable way to detect MMIO operation on SVM before actually looking at the GPA, move the problem detection into the common KVM x86 layer (into the kvm_mmu_page_fault function) and add the PFERR_EVT_DELIVERY flag which gets set in the SVM/VMX specific vmexit handler to signal that we are in the middle of the event delivery.
This way we won't introduce much overhead for VMX platform either, as the situation when the guest accesses MMIO during event delivery is quite rare and shouldn't happen frequently.
Signed-off-by: Ivan Orlov iorlov@amazon.com --- arch/x86/include/asm/kvm_host.h | 6 ++++++ arch/x86/kvm/mmu/mmu.c | 15 ++++++++++++++- arch/x86/kvm/svm/svm.c | 4 ++++ arch/x86/kvm/vmx/vmx.c | 21 +++++++++------------ 4 files changed, 33 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 348daba424dd..a1088239c9f5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -282,6 +282,12 @@ enum x86_intercept_stage; #define PFERR_PRIVATE_ACCESS BIT_ULL(49) #define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
+/* + * EVT_DELIVERY is a KVM-defined flag used to indicate that a fault occurred + * during event delivery. + */ +#define PFERR_EVT_DELIVERY BIT_ULL(50) + /* apic attention bits */ #define KVM_APIC_CHECK_VAPIC 0 /* diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e081f785fb23..36e25a6ba364 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6120,8 +6120,21 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err return -EFAULT;
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct); - if (r == RET_PF_EMULATE) + if (r == RET_PF_EMULATE) { + /* + * Check if the guest is accessing MMIO during event delivery. For + * instance, it could happen if the guest sets IDT / GDT descriptor + * base to point to an MMIO address. We can't deliver such an event + * without VMM intervention, so return a corresponding internal error + * instead (otherwise, vCPU will fall into infinite loop trying to + * deliver the event again and again). + */ + if (error_code & PFERR_EVT_DELIVERY) { + kvm_prepare_ev_delivery_failure_exit(vcpu, cr2_or_gpa, true); + return 0; + } goto emulate; + } }
if (r == RET_PF_INVALID) { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9df3e1e5ae81..93ce8c3d02f3 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2059,6 +2059,10 @@ static int npf_interception(struct kvm_vcpu *vcpu) u64 fault_address = svm->vmcb->control.exit_info_2; u64 error_code = svm->vmcb->control.exit_info_1;
+ /* Check if we have events awaiting delivery */ + if (svm->vmcb->control.exit_int_info & SVM_EXITINTINFO_TYPE_MASK) + error_code |= PFERR_EVT_DELIVERY; + /* * WARN if hardware generates a fault with an error code that collides * with KVM-defined sythentic flags. Clear the flags and continue on, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index afd785e7f3a3..bbe1126c3c7d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5828,6 +5828,11 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) static int handle_ept_misconfig(struct kvm_vcpu *vcpu) { gpa_t gpa; + u64 error_code = PFERR_RSVD_MASK; + + /* Do we have events awaiting delivery? */ + error_code |= (to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) + ? PFERR_EVT_DELIVERY : 0;
if (vmx_check_emulate_instruction(vcpu, EMULTYPE_PF, NULL, 0)) return 1; @@ -5843,7 +5848,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) return kvm_skip_emulated_instruction(vcpu); }
- return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0); + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); }
static int handle_nmi_window(struct kvm_vcpu *vcpu) @@ -6536,24 +6541,16 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) return 0; }
- /* - * Note: - * Do not try to fix EXIT_REASON_EPT_MISCONFIG if it caused by - * delivery event since it indicates guest is accessing MMIO. - * The vm-exit can be triggered again after return to guest that - * will cause infinite loop. - */ if ((vectoring_info & VECTORING_INFO_VALID_MASK) && (exit_reason.basic != EXIT_REASON_EXCEPTION_NMI && exit_reason.basic != EXIT_REASON_EPT_VIOLATION && exit_reason.basic != EXIT_REASON_PML_FULL && exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH && - exit_reason.basic != EXIT_REASON_NOTIFY)) { + exit_reason.basic != EXIT_REASON_NOTIFY && + exit_reason.basic != EXIT_REASON_EPT_MISCONFIG)) { gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); - bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG; - - kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, is_mmio); + kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, false); return 0; }
Same complaints on the shortlog scope.
On Fri, Sep 27, 2024, Ivan Orlov wrote:
Currently, the situation when guest accesses MMIO during event delivery is handled differently in VMX and SVM: on VMX KVM returns internal error with suberror = KVM_INTERNAL_ERROR_DELIVERY_EV, when SVM simply goes into infinite loop trying to deliver an event again and again.
Such a situation could happen when the exception occurs with guest IDTR (or GDTR) descriptor base pointing to an MMIO address.
Even with fixes for infinite loops on TDP failures applied, the problem still exists on SVM.
Eliminate the SVM/VMX difference by returning a KVM internal error with suberror = KVM_INTERNAL_ERROR_DELIVERY_EV for both SVM and VMX. As we don't have a reliable way to detect MMIO operation on SVM before actually looking at the GPA, move the problem detection into the common KVM x86 layer (into the kvm_mmu_page_fault function) and add the PFERR_EVT_DELIVERY flag which gets set in the SVM/VMX specific vmexit handler to signal that we are in the middle of the event delivery.
This way we won't introduce much overhead for VMX platform either, as the situation when the guest accesses MMIO during event delivery is quite rare and shouldn't happen frequently.
Signed-off-by: Ivan Orlov iorlov@amazon.com
arch/x86/include/asm/kvm_host.h | 6 ++++++ arch/x86/kvm/mmu/mmu.c | 15 ++++++++++++++- arch/x86/kvm/svm/svm.c | 4 ++++ arch/x86/kvm/vmx/vmx.c | 21 +++++++++------------ 4 files changed, 33 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 348daba424dd..a1088239c9f5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -282,6 +282,12 @@ enum x86_intercept_stage; #define PFERR_PRIVATE_ACCESS BIT_ULL(49) #define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS) +/*
- EVT_DELIVERY is a KVM-defined flag used to indicate that a fault occurred
- during event delivery.
- */
+#define PFERR_EVT_DELIVERY BIT_ULL(50)
This is very clearly a synthetic flag. See the mask above, and the warnings that fire. I think it's a moot point though (see below).
/* apic attention bits */ #define KVM_APIC_CHECK_VAPIC 0 /* diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e081f785fb23..36e25a6ba364 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6120,8 +6120,21 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err return -EFAULT; r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
if (r == RET_PF_EMULATE)
if (r == RET_PF_EMULATE) {
/*
* Check if the guest is accessing MMIO during event delivery. For
* instance, it could happen if the guest sets IDT / GDT descriptor
* base to point to an MMIO address. We can't deliver such an event
Too many pronouns.
* without VMM intervention, so return a corresponding internal error
* instead (otherwise, vCPU will fall into infinite loop trying to
* deliver the event again and again).
*/
if (error_code & PFERR_EVT_DELIVERY) {
Hmm, I'm 99% certain handling error in this location is wrong, and I'm also pretty sure it's unnecessary. Or rather, the synthetic error code is unnecessary.
It's wrong because this path specifically handles "cached" MMIO, i.e. emulated MMIO that is triggered by a special MMIO SPTE. KVM should punt to userspace on _any_ MMIO emulation. KVM has gotten away with the flaw because SVM is completely broken, and VMX can always generate reserved EPTEs. But with SVM, on CPUs with MAXPHYADDR=52, KVM can't generate a reserved #PF, i.e. can't do cached MMIO, and so I'm pretty sure your test would fail on those CPUs since they'll never come down this path.
Heh, though I bet the introduction of RET_PF_WRITE_PROTECTED has regressed shadow paging on CPUs with PA52.
Anyways, the synthetic PFERR flag is unnecessary because the information is readily available to {vmx,svm}_check_emulate_instruction(). Ha! And EMULTYPE_WRITE_PF_TO_SP means vendor code can even precisely identify MMIO.
I think another X86EMUL_* return type is needed, but that's better than a synthetic #PF error code flag.
kvm_prepare_ev_delivery_failure_exit(vcpu, cr2_or_gpa, true);
return 0;
} goto emulate;
}}
if (r == RET_PF_INVALID) { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9df3e1e5ae81..93ce8c3d02f3 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2059,6 +2059,10 @@ static int npf_interception(struct kvm_vcpu *vcpu) u64 fault_address = svm->vmcb->control.exit_info_2; u64 error_code = svm->vmcb->control.exit_info_1;
- /* Check if we have events awaiting delivery */
No "we". And this is inaccurate. The event isn't _awaiting_ delivery, the CPU was smack dab in the middle of delivering the event.
- if (svm->vmcb->control.exit_int_info & SVM_EXITINTINFO_TYPE_MASK)
error_code |= PFERR_EVT_DELIVERY;
- /*
- WARN if hardware generates a fault with an error code that collides
- with KVM-defined sythentic flags. Clear the flags and continue on,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index afd785e7f3a3..bbe1126c3c7d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5828,6 +5828,11 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) static int handle_ept_misconfig(struct kvm_vcpu *vcpu) { gpa_t gpa;
- u64 error_code = PFERR_RSVD_MASK;
- /* Do we have events awaiting delivery? */
- error_code |= (to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK)
? PFERR_EVT_DELIVERY : 0;
if (vmx_check_emulate_instruction(vcpu, EMULTYPE_PF, NULL, 0)) return 1; @@ -5843,7 +5848,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) return kvm_skip_emulated_instruction(vcpu); }
- return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
- return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
} static int handle_nmi_window(struct kvm_vcpu *vcpu) @@ -6536,24 +6541,16 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) return 0; }
- /*
* Note:
* Do not try to fix EXIT_REASON_EPT_MISCONFIG if it caused by
* delivery event since it indicates guest is accessing MMIO.
* The vm-exit can be triggered again after return to guest that
* will cause infinite loop.
if ((vectoring_info & VECTORING_INFO_VALID_MASK) && (exit_reason.basic != EXIT_REASON_EXCEPTION_NMI && exit_reason.basic != EXIT_REASON_EPT_VIOLATION && exit_reason.basic != EXIT_REASON_PML_FULL && exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH &&*/
exit_reason.basic != EXIT_REASON_NOTIFY)) {
exit_reason.basic != EXIT_REASON_NOTIFY &&
exit_reason.basic != EXIT_REASON_EPT_MISCONFIG)) {
Changing the behavior of EPT_MISCONFIG belongs in a separate commit.
Huh, and that's technically a bug fix. If userspace _creates_ a memslot, KVM doesn't eagerly zap MMIO SPTEs and instead relies on vcpu_match_mmio_gen() to force kvm_mmu_page_fault() down the actual page fault path. If the guest somehow manages to generate an access to the new page while vectoring an event, KVM will spuriously exit to userspace instead of trying to fault-in the new page.
It's _ridiculously_ contrived, but technically a bug.
Ugh, and the manual call to vmx_check_emulate_instruction() in handle_ept_misconfig() is similarly flawed, though encountering that is even more contrived as that only affects accesses from SGX enclaves.
Hmm, and looking at all of this, SVM doesn't take advantage of KVM_FAST_MMIO_BUS. Unless I'm forgetting some fundamental incompatibility, SVM can do fast MMIO so long as next_rip is valid.
Anyways, no need to deal with vmx_check_emulate_instruction() or fast MMIO, I'll tackle that in a separate series. But for this series, please do the EPT misconfig in a separate patch from fixing SVM. E.g. extract the helper, convert VMX to the new flow, and then teach SVM to do the same.
gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
Blank newline after variable declarations.
kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, is_mmio);
return 0; }kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, false);
All in all, I think this is the basic gist? Definitely feel free to propose a better name than X86EMUL_UNHANDLEABLE_VECTORING.
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9df3e1e5ae81..7a96bf5af015 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4800,6 +4800,10 @@ static int svm_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type, bool smep, smap, is_user; u64 error_code;
+ if (svm->vmcb->control.exit_int_info & SVM_EXITINTINFO_TYPE_MASK && + is_emul_type_mmio(emul_type)) + return X86EMUL_UNHANDLEABLE_VECTORING; + /* Emulation is always possible when KVM has access to all guest state. */ if (!sev_guest(vcpu->kvm)) return X86EMUL_CONTINUE; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 1b9cc1032010..43bc73b60c2b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1693,6 +1693,10 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data) int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type, void *insn, int insn_len) { + if (to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK && + is_emul_type_mmio(emul_type)) + return X86EMUL_UNHANDLEABLE_VECTORING; + /* * Emulation of instructions in SGX enclaves is impossible as RIP does * not point at the failing instruction, and even if it did, the code @@ -6536,13 +6540,6 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) return 0; }
- /* - * Note: - * Do not try to fix EXIT_REASON_EPT_MISCONFIG if it caused by - * delivery event since it indicates guest is accessing MMIO. - * The vm-exit can be triggered again after return to guest that - * will cause infinite loop. - */ if ((vectoring_info & VECTORING_INFO_VALID_MASK) && (exit_reason.basic != EXIT_REASON_EXCEPTION_NMI && exit_reason.basic != EXIT_REASON_EPT_VIOLATION && @@ -6550,10 +6547,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH && exit_reason.basic != EXIT_REASON_NOTIFY)) { - gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); - bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG; - - kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, is_mmio); + kvm_prepare_ev_delivery_failure_exit(vcpu, INVALID_GPA); return 0; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8ee67fc23e5d..d5cd254aaca7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9122,6 +9122,11 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, if (r == X86EMUL_RETRY_INSTR || r == X86EMUL_PROPAGATE_FAULT) return 1;
+ if (r == X86EMUL_UNHANDLEABLE_EVENT_VECTORING) { + kvm_prepare_event_vectoring_exit(vcpu, cr2_or_gpa); + return 0; + } + WARN_ON_ONCE(r != X86EMUL_UNHANDLEABLE); return handle_emulation_failure(vcpu, emulation_type); }
On 10/12/24 01:05, Sean Christopherson wrote:
* without VMM intervention, so return a corresponding internal error
* instead (otherwise, vCPU will fall into infinite loop trying to
* deliver the event again and again).
*/
if (error_code & PFERR_EVT_DELIVERY) {
Hmm, I'm 99% certain handling error in this location is wrong, and I'm also pretty sure it's unnecessary. Or rather, the synthetic error code is unnecessary.
It's wrong because this path specifically handles "cached" MMIO, i.e. emulated MMIO that is triggered by a special MMIO SPTE. KVM should punt to userspace on _any_ MMIO emulation. KVM has gotten away with the flaw because SVM is completely broken, and VMX can always generate reserved EPTEs. But with SVM, on CPUs with MAXPHYADDR=52, KVM can't generate a reserved #PF, i.e. can't do cached MMIO, and so I'm pretty sure your test would fail on those CPUs since they'll never come down this path.
Ah, alright, I see... Probably, I need to test the next version with enable_mmio_caching=false as well.
Heh, though I bet the introduction of RET_PF_WRITE_PROTECTED has regressed shadow paging on CPUs with PA52.
Is it because it doesn't process write-protected gfn correctly if it is in MMIO range when mmio caching is disabled?
Anyways, the synthetic PFERR flag is unnecessary because the information is readily available to {vmx,svm}_check_emulate_instruction(). Ha! And EMULTYPE_WRITE_PF_TO_SP means vendor code can even precisely identify MMIO.
Hmm, do you mean EMULTYPE_PF? It looks like EMULTYPE_WRITE_PF_TO_SP has nothing to do with MMIO...
I thought about processing the error in check_emulate_instruction as it seems logical, however I hadn't found a way to detect MMIO without page walking on SVM. I'll validate that EMULTYPE_PF gets set in all of the MMIO cases and move the handling into this function in V2 if it works.
I think another X86EMUL_* return type is needed, but that's better than a synthetic #PF error code flag.
If I understand correctly, you suggest returning this new X86EMUL_<something> code from {svm,vmx}_check_emulate_instruction and process it in the common code, right? I agree that it's much better than handling the error in MMU code. We are gonna return this return type from vendor code and handle it in the common code this way, which is neat!
- /*
* Note:
* Do not try to fix EXIT_REASON_EPT_MISCONFIG if it caused by
* delivery event since it indicates guest is accessing MMIO.
* The vm-exit can be triggered again after return to guest that
* will cause infinite loop.
if ((vectoring_info & VECTORING_INFO_VALID_MASK) && (exit_reason.basic != EXIT_REASON_EXCEPTION_NMI && exit_reason.basic != EXIT_REASON_EPT_VIOLATION && exit_reason.basic != EXIT_REASON_PML_FULL && exit_reason.basic != EXIT_REASON_APIC_ACCESS && exit_reason.basic != EXIT_REASON_TASK_SWITCH &&*/
exit_reason.basic != EXIT_REASON_NOTIFY)) {
exit_reason.basic != EXIT_REASON_NOTIFY &&
exit_reason.basic != EXIT_REASON_EPT_MISCONFIG)) {
Changing the behavior of EPT_MISCONFIG belongs in a separate commit.
I will extract the vmx-specific changes into separate commit in V2, thanks!
Huh, and that's technically a bug fix. If userspace _creates_ a memslot, KVM doesn't eagerly zap MMIO SPTEs and instead relies on vcpu_match_mmio_gen() to force kvm_mmu_page_fault() down the actual page fault path. If the guest somehow manages to generate an access to the new page while vectoring an event, KVM will spuriously exit to userspace instead of trying to fault-in the new page.
It's _ridiculously_ contrived, but technically a bug.
That's amazing, I finally introduced an unintentional bugfix (usually it's other way around) :D
Ugh, and the manual call to vmx_check_emulate_instruction() in handle_ept_misconfig() is similarly flawed, though encountering that is even more contrived as that only affects accesses from SGX enclaves.
Hmm, and looking at all of this, SVM doesn't take advantage of KVM_FAST_MMIO_BUS. Unless I'm forgetting some fundamental incompatibility, SVM can do fast MMIO so long as next_rip is valid.
Anyways, no need to deal with vmx_check_emulate_instruction() or fast MMIO, I'll tackle that in a separate series. But for this series, please do the EPT misconfig in a separate patch from fixing SVM. E.g. extract the helper, convert VMX to the new flow, and then teach SVM to do the same.
Hmm, implementing KVM_FAST_MMIO_BUS for SVM sounds like an interesting thing to do, please let me know if I could help. By the way, why can't we move the call to kvm_io_bus_write into the common code (e.g. MMU)? It would remove the need of implementing KVM_FAST_MMIO_BUS specifically for each vendor.
gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
Blank newline after variable declarations.
kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, is_mmio);
return 0; }kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, false);
All in all, I think this is the basic gist? Definitely feel free to propose a better name than X86EMUL_UNHANDLEABLE_VECTORING.
It sounds OK, but maybe something more precise would work, like X86EMUL_VECTORING_IO_NEEDED (by analogy with X86EMUL_IO_NEEDED)?
Thanks a lot for the review.
On Wed, Oct 16, 2024, Ivan Orlov wrote:
On 10/12/24 01:05, Sean Christopherson wrote:
* without VMM intervention, so return a corresponding internal error
* instead (otherwise, vCPU will fall into infinite loop trying to
* deliver the event again and again).
*/
if (error_code & PFERR_EVT_DELIVERY) {
Hmm, I'm 99% certain handling error in this location is wrong, and I'm also pretty sure it's unnecessary. Or rather, the synthetic error code is unnecessary.
It's wrong because this path specifically handles "cached" MMIO, i.e. emulated MMIO that is triggered by a special MMIO SPTE. KVM should punt to userspace on _any_ MMIO emulation. KVM has gotten away with the flaw because SVM is completely broken, and VMX can always generate reserved EPTEs. But with SVM, on CPUs with MAXPHYADDR=52, KVM can't generate a reserved #PF, i.e. can't do cached MMIO, and so I'm pretty sure your test would fail on those CPUs since they'll never come down this path.
Ah, alright, I see... Probably, I need to test the next version with enable_mmio_caching=false as well.
Heh, though I bet the introduction of RET_PF_WRITE_PROTECTED has regressed shadow paging on CPUs with PA52.
Is it because it doesn't process write-protected gfn correctly if it is in MMIO range when mmio caching is disabled?
Ignore this, I was thinking lack of cached MMIO SPTEs would result in no EPT Misconfig, but it's shadow paging, there is no EPT.
Anyways, the synthetic PFERR flag is unnecessary because the information is readily available to {vmx,svm}_check_emulate_instruction(). Ha! And EMULTYPE_WRITE_PF_TO_SP means vendor code can even precisely identify MMIO.
Hmm, do you mean EMULTYPE_PF? It looks like EMULTYPE_WRITE_PF_TO_SP has nothing to do with MMIO...
Nope. Well, both. EMULTYPE_PF is set if *any* page fault triggers emulation. EMULTYPE_WRITE_PF_TO_SP is set if emulation was triggered by a write protection violation due to write-tracking, and write-tracking requires an underlying memslot. I.e. if EMULTYPE_WRITE_PF_TO_SP is set, then emulation *can't* be for emulated MMIO.
I thought about processing the error in check_emulate_instruction as it seems logical, however I hadn't found a way to detect MMIO without page walking on SVM. I'll validate that EMULTYPE_PF gets set in all of the MMIO cases and move the handling into this function in V2 if it works.
I think another X86EMUL_* return type is needed, but that's better than a synthetic #PF error code flag.
If I understand correctly, you suggest returning this new X86EMUL_<something> code from {svm,vmx}_check_emulate_instruction and process it in the common code, right? I agree that it's much better than handling the error in MMU code. We are gonna return this return type from vendor code and handle it in the common code this way, which is neat!
Yep, exactly.
Ugh, and the manual call to vmx_check_emulate_instruction() in handle_ept_misconfig() is similarly flawed, though encountering that is even more contrived as that only affects accesses from SGX enclaves.
Hmm, and looking at all of this, SVM doesn't take advantage of KVM_FAST_MMIO_BUS. Unless I'm forgetting some fundamental incompatibility, SVM can do fast MMIO so long as next_rip is valid.
Anyways, no need to deal with vmx_check_emulate_instruction() or fast MMIO, I'll tackle that in a separate series. But for this series, please do the EPT misconfig in a separate patch from fixing SVM. E.g. extract the helper, convert VMX to the new flow, and then teach SVM to do the same.
Hmm, implementing KVM_FAST_MMIO_BUS for SVM sounds like an interesting thing to do, please let me know if I could help. By the way, why can't we move the call to kvm_io_bus_write into the common code (e.g. MMU)? It would remove the need of implementing KVM_FAST_MMIO_BUS specifically for each vendor.
That would work too, but vendor code needs to be aware of "fast" MMIO no matter what, because there are vendor specific conditions that make fast MMIO impossible (KVM needs the CPU to provide the instruction length, otherwise KVM needs to emulate the instruction in order to decode it, which makes fast MMIO not-fast).
gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
bool is_mmio = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG;
Blank newline after variable declarations.
kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, is_mmio);
return 0; }kvm_prepare_ev_delivery_failure_exit(vcpu, gpa, false);
All in all, I think this is the basic gist? Definitely feel free to propose a better name than X86EMUL_UNHANDLEABLE_VECTORING.
It sounds OK, but maybe something more precise would work, like X86EMUL_VECTORING_IO_NEEDED (by analogy with X86EMUL_IO_NEEDED)?
Hmm, I definitely want X86EMUL_UNHANDLEABLE_XXX, so that it's super clear that emulation failed. Maybe X86EMUL_UNHANDLEABLE_IO_VECTORING? Sadly, I think we need VECTORING (or similar) in the name, because it will be morphed to KVM_INTERNAL_ERROR_DELIVERY_EV. E.g. simply X86EMUL_UNHANDLEABLE_IO would be nice, but is wildly misleading as there are other scenarios where KVM can't handled emulated MMIO during instruction emulation. :-/
Extend the 'set_memory_region_test' with a test case which covers the MMIO during event delivery error handling. The test case
1) Tries to set an IDT descriptor base to point to an MMIO address 2) Generates a #GP 3) Verifies that we got a correct exit reason (KVM_EXIT_INTERNAL_ERROR) and suberror code (KVM_INTERNAL_ERROR_DELIVERY_EV) 4) Verifies that we got a corrent "faulty" GPA in internal.data[3]
Signed-off-by: Ivan Orlov iorlov@amazon.com --- .../selftests/kvm/set_memory_region_test.c | 46 +++++++++++++++++++ 1 file changed, 46 insertions(+)
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c index a8267628e9ed..e9e97346edf1 100644 --- a/tools/testing/selftests/kvm/set_memory_region_test.c +++ b/tools/testing/selftests/kvm/set_memory_region_test.c @@ -553,6 +553,51 @@ static void test_add_overlapping_private_memory_regions(void) close(memfd); kvm_vm_free(vm); } + +static const struct desc_ptr faulty_idt_desc = { + .address = MEM_REGION_GPA, + .size = 0xFFF, +}; + +static void guest_code_faulty_idt_desc(void) +{ + __asm__ __volatile__("lidt %0"::"m"(faulty_idt_desc)); + + /* Generate a #GP by dereferencing a non-canonical address */ + *((uint8_t *)0xDEADBEEFDEADBEEFULL) = 0x1; + + /* We should never reach this point */ + GUEST_ASSERT(0); +} + +/* + * This test tries to point the IDT descriptor base to an MMIO address. This action + * should cause a KVM internal error, so the VMM could handle such situations gracefully. + */ +static void test_faulty_idt_desc(void) +{ + struct kvm_vm *vm; + struct kvm_vcpu *vcpu; + + pr_info("Testing a faulty IDT descriptor pointing to an MMIO address\n"); + + vm = vm_create_with_one_vcpu(&vcpu, guest_code_faulty_idt_desc); + virt_map(vm, MEM_REGION_GPA, MEM_REGION_GPA, 1); + + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_INTERNAL_ERROR); + TEST_ASSERT(vcpu->run->internal.suberror == KVM_INTERNAL_ERROR_DELIVERY_EV, + "Unexpected suberror = %d", vcpu->run->internal.suberror); + TEST_ASSERT(vcpu->run->internal.ndata > 4, "Unexpected internal error data array size = %d", + vcpu->run->internal.ndata); + + /* The "faulty" GPA address should be = IDT base + offset of the GP vector */ + TEST_ASSERT(vcpu->run->internal.data[3] == MEM_REGION_GPA + + GP_VECTOR * sizeof(struct idt_entry), + "Unexpected GPA = %llx", vcpu->run->internal.data[3]); + + kvm_vm_free(vm); +} #endif
int main(int argc, char *argv[]) @@ -568,6 +613,7 @@ int main(int argc, char *argv[]) * KVM_RUN fails with ENOEXEC or EFAULT. */ test_zero_memory_regions(); + test_faulty_idt_desc(); #endif
test_invalid_memory_region_flags();
On Fri, Sep 27, 2024, Ivan Orlov wrote:
Extend the 'set_memory_region_test' with a test case which covers the MMIO during event delivery error handling. The test case
- Tries to set an IDT descriptor base to point to an MMIO address
- Generates a #GP
- Verifies that we got a correct exit reason (KVM_EXIT_INTERNAL_ERROR) and suberror code (KVM_INTERNAL_ERROR_DELIVERY_EV)
- Verifies that we got a corrent "faulty" GPA in internal.data[3]
Signed-off-by: Ivan Orlov iorlov@amazon.com
.../selftests/kvm/set_memory_region_test.c | 46 +++++++++++++++++++ 1 file changed, 46 insertions(+)
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c index a8267628e9ed..e9e97346edf1 100644 --- a/tools/testing/selftests/kvm/set_memory_region_test.c +++ b/tools/testing/selftests/kvm/set_memory_region_test.c @@ -553,6 +553,51 @@ static void test_add_overlapping_private_memory_regions(void) close(memfd); kvm_vm_free(vm); }
+static const struct desc_ptr faulty_idt_desc = {
- .address = MEM_REGION_GPA,
- .size = 0xFFF,
+};
There's no reason this needs to be global, i.e. declare it in the function, on the stack.
+static void guest_code_faulty_idt_desc(void) +{
- __asm__ __volatile__("lidt %0"::"m"(faulty_idt_desc));
It's not "faulty". It specifically points at MMIO. That is _very_ different than a "faulty" address, because an actual fault when vectoring an event would lead to triple fault shutdown. And a benefit of declaring the descriptor locally is that you don't need to come up with a descriptive name :-) E.g.
const struct desc_ptr idt_desc = { .address = MEM_REGION_GPA, .size = 0xfff, };
And it's probably worth adding a lidt() helper in processor.h (in a separate commit, because there's two other users that can be converted when it's added).
- /* Generate a #GP by dereferencing a non-canonical address */
- *((uint8_t *)0xDEADBEEFDEADBEEFULL) = 0x1;
Hmm, I could have sworn KVM-Unit-Tests' NONCANONICAL got pulled into selftests. Please do that as part of the test, e.g. add this to processor.h
#define NONCANONICAL 0xaaaaaaaaaaaaaaaaull
- /* We should never reach this point */
No pronouns. Yes, it's nitpicky, but "we" gets _very_ ambiguous when "we" could mean the admin, the user, the VMM, KVM, the guest, etc.
- GUEST_ASSERT(0);
+}
+/*
- This test tries to point the IDT descriptor base to an MMIO address.
There is no try. Do, or do not :-)
Translation: just state what the code does, don't hedge.
This action
Wrap at 89.
- should cause a KVM internal error, so the VMM could handle such situations gracefully.
Heh, don't editorialize what a VMM might do in comments. For changelogs it's often helpful, as it provides justification and context for _why_ that is the behavior. But for a selftest, just state what KVM's ABI is. E.g. I guarantee there are plenty of VMMs that don't handle this situation gracefully :-)
- */
+static void test_faulty_idt_desc(void) +{
- struct kvm_vm *vm;
- struct kvm_vcpu *vcpu;
- pr_info("Testing a faulty IDT descriptor pointing to an MMIO address\n");
- vm = vm_create_with_one_vcpu(&vcpu, guest_code_faulty_idt_desc);
- virt_map(vm, MEM_REGION_GPA, MEM_REGION_GPA, 1);
- vcpu_run(vcpu);
- TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_INTERNAL_ERROR);
- TEST_ASSERT(vcpu->run->internal.suberror == KVM_INTERNAL_ERROR_DELIVERY_EV,
"Unexpected suberror = %d", vcpu->run->internal.suberror);
- TEST_ASSERT(vcpu->run->internal.ndata > 4, "Unexpected internal error data array size = %d",
vcpu->run->internal.ndata);
Capture "run", or maybe event "internal" in a local variable. Doing so will shorten these lines and make the code easier to read. I'd probably vote for grabbing "internal" since TEST_ASSERT_KVM_EXIT_REASON() takes care of asserting on the bits outside of "internal".
- /* The "faulty" GPA address should be = IDT base + offset of the GP vector */
GPA address is redundant. GPA is Guest Physical Address.
Again, avoid "faulty". "reported" works nicely. And try not to mix code with human language (though it's ok for math, e.g. the '+' is totally fine and preferred). The '=' is hard to read because it looks like a typo. And in this case, there's no need to actually say "equal to". And similar to writing changelogs for humans instead of giving a play-by-play of the code, do the same for comments, e.g.
/* The reported GPA should be the address of the #GP entry in the IDT. */
- TEST_ASSERT(vcpu->run->internal.data[3] == MEM_REGION_GPA +
GP_VECTOR * sizeof(struct idt_entry),
Put the math on one line, i.e.
vcpu->run->internal.data[3] == MEM_REGION_GPA + GP_VECTOR * sizeof(struct idt_entry),
"Unexpected GPA = %llx", vcpu->run->internal.data[3]);
Print what GPA was expected, so that the user doesn't have to manually figure that out.
- kvm_vm_free(vm);
+} #endif int main(int argc, char *argv[]) @@ -568,6 +613,7 @@ int main(int argc, char *argv[]) * KVM_RUN fails with ENOEXEC or EFAULT. */ test_zero_memory_regions();
- test_faulty_idt_desc();
#endif test_invalid_memory_region_flags(); -- 2.43.0
On Fri, 2024-10-11 at 17:21 -0700, Sean Christopherson wrote:
+ /* We should never reach this point */
No pronouns. Yes, it's nitpicky, but "we" gets _very_ ambiguous when "we" could mean the admin, the user, the VMM, KVM, the guest, etc.
+ GUEST_ASSERT(0);
Is there really *any* way that can be interpreted as anything other than "the CPU executing this code will never get to this point and that's why there's an ASSERT(0) right after this comment"?
I don't believe there's *any* way that particular pronoun can be ambiguous, and now we've got to the point of fetishising the bizarre "no pronouns" rule just for the sake of it.
I get it, especially for some individuals it *can* be difficult to take context into account, and the wilful use of pronouns instead of spelling things out explicitly *every* *single* *time* can sometimes help. But at a cost of conciseness and brevity.
On Thu, Oct 17, 2024, David Woodhouse wrote:
On Fri, 2024-10-11 at 17:21 -0700, Sean Christopherson wrote:
+ /* We should never reach this point */
No pronouns. Yes, it's nitpicky, but "we" gets _very_ ambiguous when "we" could mean the admin, the user, the VMM, KVM, the guest, etc.
+ GUEST_ASSERT(0);
Is there really *any* way that can be interpreted as anything other than "the CPU executing this code will never get to this point and that's why there's an ASSERT(0) right after this comment"?
I don't believe there's *any* way that particular pronoun can be ambiguous, and now we've got to the point of fetishising the bizarre "no pronouns" rule just for the sake of it.
No, it's not just for the sake of it. In this case, "we" isn't all that ambiguous, (though my interpretation of it is "the test", not "the CPU"), but only because the comment is utterly useless. The GUEST_ASSERT(0) communicates very clearly that it's supposed to be unreachable.
And if the comment were rewritten to explain _why_ the code is unreachable, then "we" is all bug guaranateed to become ambiguous, because explaining "why" likely means preciesly describing the behavior the userspace side, the guest side, and/or KVM. In other words, using "we" or "us" is often a hint that either the statement is likely ambiguous or doesn't add value.
And irrespective of whether or not you agree with the above, having a hard rule of "no we, no us" eliminates all subjectivity, and for me that is sufficient reason to enforce the rule.
I get it, especially for some individuals it *can* be difficult to take context into account, and the wilful use of pronouns instead of spelling things out explicitly *every* *single* *time* can sometimes help. But at a cost of conciseness and brevity.
In this particular case, I am more than willing to sacrifice brevity. I 100% agree that there is value in having to-the-point comments and changelogs, but I can't recall a single time where avoiding a "we" or "us" made a statement meaningfully harder to read and understand. On ther hand, I can recall many, many changelogs I had to re-read multiple times because I struggled to figure out how the author _intended_ "we" or "us" to be interpreted.
linux-kselftest-mirror@lists.linaro.org