Problem =======
When host APEI is unable to claim synchronous external abort (SEA) during stage-2 guest abort, today KVM directly injects an async SError into the VCPU then resumes it. The injected SError usually results in unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable uncorrected memory error (UER), which is not uncommon at all in modern datacenter servers with large amounts of physical memory. Although SError and guest panic is sufficient to stop the propagation of corrupted memory there is still room to recover from memory UER in a more graceful manner.
Proposed Solution =================
Alternatively KVM can replay the SEA to the faulting VCPU, via existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or the fault that cause SEA is not from guest kernel, the blast radius can be limited to the consuming or faulting guest userspace process, so the VM can keep running.
In addition, instead of doing under the hood without involving userspace, there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and VMM usually has the responsibility to start the process of notifying the customers of memory error events in their VMs. For example some cloud provider emits a critical log in their observability UI [1], and provides playbook for customers on how to mitigate disruptions to their workloads.
- VMM can protect future memory error consumption or faults by unmapping the poisoned pages from stage-2 page table with KVM userfault [2], which is more performant than splitting the memslot that contains the poisoned guest pages.
- VMM can keep track SEA events in the VM. When VMM thinks the status on the host or the VM is bad enough, e.g. number of distinct SEAs exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to let VMM either recover from the MCE, or terminate itself with VM. The prior RFC proposes to implement SIGBUS on arm64 as well, but Marc preferred VCPU exit over signal [3]. However, implementation aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged to inject external aborts into the faulting VCPU, which is already supported by KVM on arm64, although not fully supported by KVM_SET_VCPU_EVENTS but complemented in this patchset.
New UAPIs =========
This patchset introduces following userspace-visiable changes to empower VMM to control what happens next for guest SEA:
- KVM_CAP_ARM_SEA_TO_USER. If userspace enables this new capability at VM creation, KVM will not inject SError while taking SEA, but VM exit to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details about the SEA is provided in arm_sea as much as possible, including ESR value at EL2, if guest virtual and physical addresses (GPA and GVA) are available and the values if available.
- KVM_CAP_ARM_INJECT_EXT_IABT. VMM today can inject external data abort to VCPU via KVM_SET_VCPU_EVENTS API. However, in case of instruction abort, VMM cannot inject it via KVM_SET_VCPU_EVENTS. KVM_CAP_ARM_INJECT_EXT_IABT is just a natural extend to KVM_CAP_ARM_INJECT_EXT_DABT that tells VMM KVM_SET_VCPU_EVENTS now supports external instruction abort.
Patchset utilizes commit 26fbdf369227 ("KVM: arm64: Don't translate FAR if invalid/unsafe") from [4], available already in kvmarm/next. [4] makes KVM safely do address translation for HPFAR_EL2, including at the event of SEA, and indicate if HPFAR_EL2 is valid in NS bit. This patchset depends on [4] to tell userspace if GPA is valid and its value if valid.
Patchset is based on commit 68ec8b4e84446 ("Merge branch kvm-arm64/pkvm-6.16 into kvmarm-master/next")
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors [2] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20... [3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org [4] https://lore.kernel.org/all/174369514508.3034362.13165690020799838042.b4-ty@...
Jiaqi Yan (5): KVM: arm64: VM exit to userspace to handle SEA KVM: arm64: Set FnV for VCPU when FAR_EL2 is invalid KVM: selftests: Test for KVM_EXIT_ARM_SEA and KVM_CAP_ARM_SEA_TO_USER KVM: selftests: Test for KVM_CAP_INJECT_EXT_IABT Documentation: kvm: new uAPI for handling SEA
Raghavendra Rao Ananta (1): KVM: arm64: Allow userspace to inject external instruction aborts
Documentation/virt/kvm/api.rst | 120 ++++++- arch/arm64/include/asm/kvm_emulate.h | 12 + arch/arm64/include/asm/kvm_host.h | 8 + arch/arm64/include/asm/kvm_ras.h | 21 +- arch/arm64/include/uapi/asm/kvm.h | 3 +- arch/arm64/kvm/Makefile | 3 +- arch/arm64/kvm/arm.c | 6 + arch/arm64/kvm/guest.c | 13 +- arch/arm64/kvm/inject_fault.c | 3 + arch/arm64/kvm/kvm_ras.c | 54 +++ arch/arm64/kvm/mmu.c | 12 +- include/uapi/linux/kvm.h | 12 + tools/arch/arm64/include/uapi/asm/kvm.h | 3 +- tools/testing/selftests/kvm/Makefile.kvm | 2 + .../testing/selftests/kvm/arm64/inject_iabt.c | 100 ++++++ .../testing/selftests/kvm/arm64/sea_to_user.c | 324 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 17 files changed, 654 insertions(+), 43 deletions(-) create mode 100644 arch/arm64/kvm/kvm_ras.c create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
When APEI fails to handle a stage2 abort that is synchrnous external abort (SEA), today KVM directly injects an async SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, there is still room to recover from memory UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to - Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running. - VMM can protect from future memory poison consumption by unmapping the page from stage-2 with KVM userfault [1]. VMM can also track SEA events that VM customer cares about, restart VM when certain number of distinct poison events happened, provide observability to customers [2].
Introduce following userspace-visible features to make VMM handle SEA: - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during synchronous abort. - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this, and KVM fills kvm_run.arm_sea with as much as possible information about the SEA, including - ESR_EL2. - If faulting guest virtual and physical addresses are available. - Faulting guest virtual address if available. - Faulting guest physical address if available.
[1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20... [2] https://cloud.google.com/solutions/sap/docs/manage-host-errors
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- arch/arm64/include/asm/kvm_emulate.h | 12 +++++++ arch/arm64/include/asm/kvm_host.h | 8 +++++ arch/arm64/include/asm/kvm_ras.h | 21 ++++------- arch/arm64/kvm/Makefile | 3 +- arch/arm64/kvm/arm.c | 5 +++ arch/arm64/kvm/kvm_ras.c | 54 ++++++++++++++++++++++++++++ arch/arm64/kvm/mmu.c | 12 ++----- include/uapi/linux/kvm.h | 11 ++++++ 8 files changed, 101 insertions(+), 25 deletions(-) create mode 100644 arch/arm64/kvm/kvm_ras.c
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index bd020fc28aa9c..a9de30478a088 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -429,6 +429,18 @@ static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu) } }
+/* Return true if FAR holds valid faulting guest virtual address. */ +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) +{ + return !(kvm_vcpu_get_esr(vcpu) & ESR_ELx_FnV); +} + +/* Return true if HPFAR_EL2 holds valid faulting guest physical address. */ +static inline bool kvm_vcpu_sea_ipa_valid(const struct kvm_vcpu *vcpu) +{ + return vcpu->arch.fault.hpfar_el2 & HPFAR_EL2_NS; +} + static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu) { u64 esr = kvm_vcpu_get_esr(vcpu); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 73b7762b0e7d1..e0129f9799f80 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -342,6 +342,14 @@ struct kvm_arch { #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9 /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */ #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10 + /* + * When APEI failed to claim stage-2 synchronous external abort + * (SEA) return to userspace with fault information. Userspace + * can opt in this feature if KVM_CAP_ARM_SEA_TO_USER is + * supported. Userspace is encouraged to handle this VM exit + * by injecting a SEA to VCPU before resume the VCPU. + */ +#define KVM_ARCH_FLAG_RETURN_SEA_TO_USER 11 unsigned long flags;
/* VM-wide vCPU feature set */ diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h index 9398ade632aaf..a2fd91af8f97e 100644 --- a/arch/arm64/include/asm/kvm_ras.h +++ b/arch/arm64/include/asm/kvm_ras.h @@ -4,22 +4,15 @@ #ifndef __ARM64_KVM_RAS_H__ #define __ARM64_KVM_RAS_H__
-#include <linux/acpi.h> -#include <linux/errno.h> -#include <linux/types.h> - -#include <asm/acpi.h> +#include <linux/kvm_host.h>
/* - * Was this synchronous external abort a RAS notification? - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. + * Handle stage2 synchronous external abort (SEA) in the following order: + * 1. Delegate to APEI/GHES and if they can claim SEA, resume guest. + * 2. If userspace opt-ed in KVM_CAP_ARM_SEA_TO_USER, exit to userspace + * with details about the SEA. + * 3. Otherwise, inject async SError into the VCPU and resume guest. */ -static inline int kvm_handle_guest_sea(void) -{ - /* apei_claim_sea(NULL) expects to mask interrupts itself */ - lockdep_assert_irqs_enabled(); - - return apei_claim_sea(NULL); -} +int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
#endif /* __ARM64_KVM_RAS_H__ */ diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 209bc76263f10..785d568411e88 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -23,7 +23,8 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ vgic/vgic-v3.o vgic/vgic-v4.o \ vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \ vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ - vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o + vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \ + kvm_ras.o
kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 19ca57def6292..47544945fba45 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -133,6 +133,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->lock); break; + case KVM_CAP_ARM_SEA_TO_USER: + r = 0; + set_bit(KVM_ARCH_FLAG_RETURN_SEA_TO_USER, &kvm->arch.flags); + break; default: break; } @@ -322,6 +326,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_COUNTER_OFFSET: case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS: + case KVM_CAP_ARM_SEA_TO_USER: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c new file mode 100644 index 0000000000000..83f2731c95d77 --- /dev/null +++ b/arch/arm64/kvm/kvm_ras.c @@ -0,0 +1,54 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include <linux/acpi.h> +#include <linux/types.h> +#include <asm/acpi.h> +#include <asm/kvm_emulate.h> +#include <asm/kvm_ras.h> +#include <asm/system_misc.h> + +/* + * Was this synchronous external abort a RAS notification? + * Returns 0 for errors handled by some RAS subsystem, or -ENOENT. + */ +static int kvm_delegate_guest_sea(void) +{ + /* apei_claim_sea(NULL) expects to mask interrupts itself. */ + lockdep_assert_irqs_enabled(); + return apei_claim_sea(NULL); +} + +int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) +{ + struct kvm_run *run = vcpu->run; + bool exit = test_bit(KVM_ARCH_FLAG_RETURN_SEA_TO_USER, + &vcpu->kvm->arch.flags); + + /* For RAS the host kernel may handle this abort. */ + if (kvm_delegate_guest_sea() == 0) + return 1; + + if (!exit) { + /* Fallback behavior prior to KVM_EXIT_ARM_SEA. */ + kvm_inject_vabt(vcpu); + return 1; + } + + run->exit_reason = KVM_EXIT_ARM_SEA; + run->arm_sea.esr = kvm_vcpu_get_esr(vcpu); + run->arm_sea.flags = 0ULL; + run->arm_sea.gva = 0ULL; + run->arm_sea.gpa = 0ULL; + + if (kvm_vcpu_sea_far_valid(vcpu)) { + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GVA_VALID; + run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu); + } + + if (kvm_vcpu_sea_ipa_valid(vcpu)) { + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID; + run->arm_sea.gpa = kvm_vcpu_get_fault_ipa(vcpu); + } + + return 0; +} diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 754f2fe0cc673..a605ee56fa150 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1795,16 +1795,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) int ret, idx;
/* Synchronous External Abort? */ - if (kvm_vcpu_abt_issea(vcpu)) { - /* - * For RAS the host kernel may handle this abort. - * There is no need to pass the error into the guest. - */ - if (kvm_handle_guest_sea()) - kvm_inject_vabt(vcpu); - - return 1; - } + if (kvm_vcpu_abt_issea(vcpu)) + return kvm_handle_guest_sea(vcpu);
esr = kvm_vcpu_get_esr(vcpu);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index b6ae8ad8934b5..79dc4676ff74b 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -178,6 +178,7 @@ struct kvm_xen_exit { #define KVM_EXIT_NOTIFY 37 #define KVM_EXIT_LOONGARCH_IOCSR 38 #define KVM_EXIT_MEMORY_FAULT 39 +#define KVM_EXIT_ARM_SEA 40
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -446,6 +447,15 @@ struct kvm_run { __u64 gpa; __u64 size; } memory_fault; + /* KVM_EXIT_ARM_SEA */ + struct { + __u64 esr; +#define KVM_EXIT_ARM_SEA_FLAG_GVA_VALID (1ULL << 0) +#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 1) + __u64 flags; + __u64 gva; + __u64 gpa; + } arm_sea; /* Fix the size of the union. */ char padding[256]; }; @@ -930,6 +940,7 @@ struct kvm_enable_cap { #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237 #define KVM_CAP_X86_GUEST_MODE 238 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239 +#define KVM_CAP_ARM_SEA_TO_USER 240
struct kvm_irq_routing_irqchip { __u32 irqchip;
Certain microarchitectures (e.g. Neoverse V2) do not keep track of the faulting address for a memory load that consumes poisoned data and results in a synchronous external abort (SEA). This means the faulting guest physical address is unavailable when KVM handles such SEA in EL2, and FAR_EL2 just holds a garbage value.
In case VMM later asks KVM to synchronously inject a SEA into the guest, KVM should set FnV bit - in VCPU's ESR_EL1 to let guest kernel know that FAR_EL1 is invalid and holds garbage value - in VCPU's ESR_EL2 to let nested virtualization know that FAR_EL2 is invalid and holds garbage value
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- arch/arm64/kvm/inject_fault.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c index a640e839848e6..b4f9a09952ead 100644 --- a/arch/arm64/kvm/inject_fault.c +++ b/arch/arm64/kvm/inject_fault.c @@ -81,6 +81,9 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr if (!is_iabt) esr |= ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT;
+ if (!kvm_vcpu_sea_far_valid(vcpu)) + esr |= ESR_ELx_FnV; + esr |= ESR_ELx_FSC_EXTABT;
if (match_target_el(vcpu, unpack_vcpu_flag(EXCEPT_AA64_EL1_SYNC))) {
From: Raghavendra Rao Ananta rananta@google.com
When KVM returns to userspace for KVM_EXIT_ARM_SEA, the userspace is encouraged to inject the abort into the guest via KVM_SET_VCPU_EVENTS.
KVM_SET_VCPU_EVENTS currently only allows injecting external data aborts. However, the synchronous external abort that caused KVM_EXIT_ARM_SEA is possible to be an instruction abort. Userspace is already able to tell if an abort is due to data or instruction via kvm_run.arm_sea.esr, by checking its Exception Class value.
Extend the KVM_SET_VCPU_EVENTS ioctl to allow injecting instruction abort into the guest.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com Signed-off-by: Raghavendra Rao Ananta rananta@google.com --- arch/arm64/include/uapi/asm/kvm.h | 3 ++- arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/guest.c | 13 ++++++++++--- include/uapi/linux/kvm.h | 1 + 4 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index ed5f3892674c7..643e8c4825451 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -184,8 +184,9 @@ struct kvm_vcpu_events { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; + __u8 ext_iabt_pending; /* Align it to 8 bytes */ - __u8 pad[5]; + __u8 pad[4]; __u64 serror_esr; } exception; __u32 reserved[12]; diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 47544945fba45..dc2efb627f450 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -319,6 +319,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ARM_IRQ_LINE_LAYOUT_2: case KVM_CAP_ARM_NISV_TO_USER: case KVM_CAP_ARM_INJECT_EXT_DABT: + case KVM_CAP_ARM_INJECT_EXT_IABT: case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index 2196979a24a32..4917361ecf5cb 100644 --- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -825,9 +825,9 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu, events->exception.serror_esr = vcpu_get_vsesr(vcpu);
/* - * We never return a pending ext_dabt here because we deliver it to - * the virtual CPU directly when setting the event and it's no longer - * 'pending' at this point. + * We never return a pending ext_dabt or ext_iabt here because we + * deliver it to the virtual CPU directly when setting the event + * and it's no longer 'pending' at this point. */
return 0; @@ -839,6 +839,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, bool serror_pending = events->exception.serror_pending; bool has_esr = events->exception.serror_has_esr; bool ext_dabt_pending = events->exception.ext_dabt_pending; + bool ext_iabt_pending = events->exception.ext_iabt_pending;
if (serror_pending && has_esr) { if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN)) @@ -852,8 +853,14 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, kvm_inject_vabt(vcpu); }
+ /* DABT and IABT cannot happen at the same time. */ + if (ext_dabt_pending && ext_iabt_pending) + return -EINVAL; + if (ext_dabt_pending) kvm_inject_dabt(vcpu, kvm_vcpu_get_hfar(vcpu)); + else if (ext_iabt_pending) + kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
return 0; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 79dc4676ff74b..bcf2b95b79123 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -941,6 +941,7 @@ struct kvm_enable_cap { #define KVM_CAP_X86_GUEST_MODE 238 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239 #define KVM_CAP_ARM_SEA_TO_USER 240 +#define KVM_CAP_ARM_INJECT_EXT_IABT 241
struct kvm_irq_routing_irqchip { __u32 irqchip;
Test how KVM handles guest stage-2 SEA when APEI is unable to claim it. The behavior is triggered by consuming recoverable memory error (UER) injected via EINJ. The test asserts two major things: 1. KVM returns to userspace with KVM_EXIT_ARM_SEA exit reason, and has provided correct fault information, e.g. esr, flags, gva, gpa. 2. Userspace is able to handle KVM_EXIT_ARM_SEA by injecting SEA to guest and KVM injects expected SEA into the VCPU.
Tested on a data center server running Siryn AmpereOne processor. Several things to notice before attempting to run this selftest: - The test relies on EINJ support in both firmware and kernel to inject UER. Otherwise the test will be skipped. - The under-test platform's APEI should be unable to claim the SEA. Otherwise the test will be skipped. - Some platform doesn't support notrigger in EINJ, which may cause APEI and GHES to offline the memory before guest can consume injected UER, and making test unable to trigger SEA.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/sea_to_user.c | 324 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 3 files changed, 326 insertions(+) create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index f62b0a5aba35a..16d2e9f32619f 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -151,6 +151,7 @@ TEST_GEN_PROGS_arm64 += arm64/hypercalls TEST_GEN_PROGS_arm64 += arm64/mmio_abort TEST_GEN_PROGS_arm64 += arm64/page_fault_test TEST_GEN_PROGS_arm64 += arm64/psci_test +TEST_GEN_PROGS_arm64 += arm64/sea_to_user TEST_GEN_PROGS_arm64 += arm64/set_id_regs TEST_GEN_PROGS_arm64 += arm64/smccc_filter TEST_GEN_PROGS_arm64 += arm64/vcpu_width_config diff --git a/tools/testing/selftests/kvm/arm64/sea_to_user.c b/tools/testing/selftests/kvm/arm64/sea_to_user.c new file mode 100644 index 0000000000000..9490cdbad3466 --- /dev/null +++ b/tools/testing/selftests/kvm/arm64/sea_to_user.c @@ -0,0 +1,324 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Test KVM returns to userspace with KVM_EXIT_ARM_SEA if host APEI fails + * to handle SEA and userspace has opt-ed in KVM_CAP_ARM_SEA_TO_USER. + * + * After reaching userspace with expected arm_sea info, also test userspace + * injecting a synchronous external data abort into the guest. + * + * This test utilizes EINJ to generate a REAL synchronous external data + * abort by consuming a recoverable uncorrectable memory error. Therefore + * the device under test must support EINJ in both firmware and host kernel, + * including the notrigger feature. Otherwise the test will be skipped. + * The under-test platform's APEI should be unable to claim SEA. Otherwise + * the test will also be skipped. + */ + +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "guest_modes.h" + +#define PAGE_PRESENT (1ULL << 63) +#define PAGE_PHYSICAL 0x007fffffffffffffULL +#define PAGE_ADDR_MASK (~(0xfffULL)) + +/* Value for "Recoverable state (UER)". */ +#define ESR_ELx_SET_UER 0U + +#define EINJ_ETYPE "/sys/kernel/debug/apei/einj/error_type" +#define EINJ_ADDR "/sys/kernel/debug/apei/einj/param1" +#define EINJ_MASK "/sys/kernel/debug/apei/einj/param2" +#define EINJ_FLAGS "/sys/kernel/debug/apei/einj/flags" +#define EINJ_NOTRIGGER "/sys/kernel/debug/apei/einj/notrigger" +#define EINJ_DOIT "/sys/kernel/debug/apei/einj/error_inject" +/* Memory Uncorrectable non-fatal. */ +#define ERROR_TYPE_MEMORY_UER 0x10 +/* Memory address and mask valid (param1 and param2). */ +#define MASK_MEMORY_UER 0b10 + +/* Guest virtual address region = [2G, 3G). */ +#define START_GVA 0x80000000UL +#define VM_MEM_SIZE 0x40000000UL +/* Note: EINJ_OFFSET must < VM_MEM_SIZE. */ +#define EINJ_OFFSET 0x05234badUL +#define EINJ_GVA ((START_GVA) + (EINJ_OFFSET)) + +static vm_paddr_t einj_gpa; +static void *einj_hva; +static uint64_t einj_hpa; +static bool far_invalid; + +static uint64_t translate_to_host_paddr(unsigned long vaddr) +{ + uint64_t pinfo; + int64_t offset = vaddr / getpagesize() * sizeof(pinfo); + int fd; + uint64_t page_addr; + uint64_t paddr; + + fd = open("/proc/self/pagemap", O_RDONLY); + if (fd < 0) + ksft_exit_fail_perror("Failed to open /proc/self/pagemap"); + if (pread(fd, &pinfo, sizeof(pinfo), offset) != sizeof(pinfo)) { + close(fd); + ksft_exit_fail_perror("Failed to read /proc/self/pagemap"); + } + + close(fd); + + if ((pinfo & PAGE_PRESENT) == 0) + ksft_exit_fail_perror("Page not present"); + + page_addr = (pinfo & PAGE_PHYSICAL) << MIN_PAGE_SHIFT; + paddr = page_addr + (vaddr & (getpagesize() - 1)); + return paddr; +} + +static void write_einj_entry(const char *einj_path, uint64_t val) +{ + char cmd[256] = {0}; + FILE *cmdfile = NULL; + + sprintf(cmd, "echo %#lx > %s", val, einj_path); + cmdfile = popen(cmd, "r"); + + if (pclose(cmdfile) == 0) + ksft_print_msg("echo %#lx > %s - done\n", val, einj_path); + else + ksft_exit_fail_perror("Failed to write EINJ entry"); +} + +static void inject_uer(uint64_t paddr) +{ + if (access("/sys/firmware/acpi/tables/EINJ", R_OK) == -1) + ksft_test_result_skip("EINJ table no available in firmware"); + + if (access(EINJ_ETYPE, R_OK | W_OK) == -1) + ksft_test_result_skip("EINJ module probably not loaded?"); + + write_einj_entry(EINJ_ETYPE, ERROR_TYPE_MEMORY_UER); + write_einj_entry(EINJ_FLAGS, MASK_MEMORY_UER); + write_einj_entry(EINJ_ADDR, paddr); + write_einj_entry(EINJ_MASK, ~0x0UL); + write_einj_entry(EINJ_NOTRIGGER, 1); + write_einj_entry(EINJ_DOIT, 1); +} + +/* + * When host APEI successfully claims the SEA caused by guest_code, kernel + * will send SIGBUS signal with BUS_MCEERR_AR to test thread. + * + * We set up this SIGBUS handler to skip the test for that case. + */ +static void sigbus_signal_handler(int sig, siginfo_t *si, void *v) +{ + ksft_print_msg("SIGBUS (%d) received, dumping siginfo...\n", sig); + ksft_print_msg("si_signo=%d, si_errno=%d, si_code=%d, si_addr=%p\n", + si->si_signo, si->si_errno, si->si_code, si->si_addr); + if (si->si_code == BUS_MCEERR_AR) + ksft_test_result_skip("SEA is claimed by host APEI\n"); + else + ksft_test_result_fail("Exit with signal unhandled\n"); + + exit(0); +} + +static void setup_sigbus_handler(void) +{ + struct sigaction act; + + memset(&act, 0, sizeof(act)); + sigemptyset(&act.sa_mask); + act.sa_sigaction = sigbus_signal_handler; + act.sa_flags = SA_SIGINFO; + TEST_ASSERT(sigaction(SIGBUS, &act, NULL) == 0, + "Failed to setup SIGBUS handler"); +} + +static void guest_code(void) +{ + uint64_t guest_data; + + /* Consumes error will cause a SEA. */ + guest_data = *(uint64_t *)EINJ_GVA; + + GUEST_FAIL("Data corruption not prevented by SEA: gva=%#lx, data=%#lx", + EINJ_GVA, guest_data); +} + +static void expect_sea_handler(struct ex_regs *regs) +{ + u64 esr = read_sysreg(esr_el1); + u64 far = read_sysreg(far_el1); + bool expect_far_invalid = far_invalid; + + GUEST_PRINTF("Guest SEA esr_el1=%#lx, far_el1=%#lx\n", esr, far); + + GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_DABT_CUR); + GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); + + if (expect_far_invalid) { + GUEST_ASSERT(esr & ESR_ELx_FnV); + GUEST_PRINTF("Guest observed garbage value in FAR\n"); + } else { + GUEST_ASSERT(!(esr & ESR_ELx_FnV)); + GUEST_ASSERT_EQ(far, EINJ_GVA); + } + + GUEST_DONE(); +} + +static void vcpu_inject_sea(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_events events = {}; + + events.exception.ext_dabt_pending = true; + vcpu_events_set(vcpu, &events); +} + +static void run_vm(struct kvm_vm *vm, struct kvm_vcpu *vcpu) +{ + struct ucall uc; + bool guest_done = false; + struct kvm_run *run = vcpu->run; + + /* Resume the vCPU after error injection to consume the error. */ + vcpu_run(vcpu); + + ksft_print_msg("Dump kvm_run info about KVM_EXIT_%s\n", + exit_reason_str(run->exit_reason)); + ksft_print_msg("kvm_run.arm_sea: esr=%#llx, flags=%#llx\n", + run->arm_sea.esr, run->arm_sea.flags); + ksft_print_msg("kvm_run.arm_sea: gva=%#llx, gpa=%#llx\n", + run->arm_sea.gva, run->arm_sea.gpa); + + /* Validate the KVM_EXIT. */ + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_ARM_SEA); + TEST_ASSERT_EQ(ESR_ELx_EC(run->arm_sea.esr), ESR_ELx_EC_DABT_LOW); + TEST_ASSERT_EQ(run->arm_sea.esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); + TEST_ASSERT_EQ(run->arm_sea.esr & ESR_ELx_SET_MASK, ESR_ELx_SET_UER); + + if (run->arm_sea.flags & KVM_EXIT_ARM_SEA_FLAG_GVA_VALID) + TEST_ASSERT_EQ(run->arm_sea.gva, EINJ_GVA); + + if (run->arm_sea.flags & KVM_EXIT_ARM_SEA_FLAG_GPA_VALID) + TEST_ASSERT_EQ(run->arm_sea.gpa, einj_gpa & PAGE_ADDR_MASK); + + far_invalid = run->arm_sea.esr & ESR_ELx_FnV; + + /* Inject a SEA into guest and expect handled in SEA handler. */ + vcpu_inject_sea(vcpu); + + /* Expect the guest to reach GUEST_DONE gracefully. */ + do { + vcpu_run(vcpu); + switch (get_ucall(vcpu, &uc)) { + case UCALL_PRINTF: + ksft_print_msg("From guest: %s", uc.buffer); + break; + case UCALL_DONE: + ksft_print_msg("Guest done gracefully!\n"); + guest_done = 1; + break; + case UCALL_ABORT: + ksft_print_msg("Guest aborted!\n"); + guest_done = 1; + REPORT_GUEST_ASSERT(uc); + break; + default: + TEST_FAIL("Unexpected ucall: %lu\n", uc.cmd); + } + } while (!guest_done); +} + +static struct kvm_vm *vm_create_with_sea_handler(struct kvm_vcpu **vcpu) +{ + size_t backing_page_size; + size_t guest_page_size; + size_t alignment; + uint64_t num_guest_pages; + vm_paddr_t start_gpa; + enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB; + struct kvm_vm *vm; + + backing_page_size = get_backing_src_pagesz(src_type); + guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size; + alignment = max(backing_page_size, guest_page_size); + num_guest_pages = VM_MEM_SIZE / guest_page_size; + + vm = __vm_create_with_one_vcpu(vcpu, num_guest_pages, guest_code); + vm_init_descriptor_tables(vm); + vcpu_init_descriptor_tables(*vcpu); + + vm_install_sync_handler(vm, + /*vector=*/VECTOR_SYNC_CURRENT, + /*ec=*/ESR_ELx_EC_DABT_CUR, + /*handler=*/expect_sea_handler); + + start_gpa = (vm->max_gfn - num_guest_pages) * guest_page_size; + start_gpa = align_down(start_gpa, alignment); + + vm_userspace_mem_region_add( + /*vm=*/vm, + /*src_type=*/src_type, + /*guest_paddr=*/start_gpa, + /*slot=*/1, + /*npages=*/num_guest_pages, + /*flags=*/0); + + virt_map(vm, START_GVA, start_gpa, num_guest_pages); + + ksft_print_msg("Mapped %#lx pages: gva=%#lx to gpa=%#lx\n", + num_guest_pages, START_GVA, start_gpa); + return vm; +} + +static void vm_inject_memory_uer(struct kvm_vm *vm) +{ + uint64_t guest_data; + + einj_gpa = addr_gva2gpa(vm, EINJ_GVA); + einj_hva = addr_gva2hva(vm, EINJ_GVA); + + /* Populate certain data before injecting UER. */ + *(uint64_t *)einj_hva = 0xBAADCAFE; + guest_data = *(uint64_t *)einj_hva; + ksft_print_msg("Before EINJect: data=%#lx\n", + guest_data); + + einj_hpa = translate_to_host_paddr((unsigned long)einj_hva); + + ksft_print_msg("EINJ_GVA=%#lx, einj_gpa=%#lx, einj_hva=%p, einj_hpa=%#lx\n", + EINJ_GVA, einj_gpa, einj_hva, einj_hpa); + + inject_uer(einj_hpa); + ksft_print_msg("Memory UER EINJected\n"); +} + +int main(int argc, char *argv[]) +{ + struct kvm_vm *vm; + struct kvm_vcpu *vcpu; + + TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_SEA_TO_USER)); + + setup_sigbus_handler(); + + vm = vm_create_with_sea_handler(&vcpu); + + vm_enable_cap(vm, KVM_CAP_ARM_SEA_TO_USER, 0); + + vm_inject_memory_uer(vm); + + run_vm(vm, vcpu); + + kvm_vm_free(vm); + + return 0; +} diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index 815bc45dd8dc6..bc9fcf6c3295a 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -2021,6 +2021,7 @@ static struct exit_reason { KVM_EXIT_STRING(NOTIFY), KVM_EXIT_STRING(LOONGARCH_IOCSR), KVM_EXIT_STRING(MEMORY_FAULT), + KVM_EXIT_STRING(ARM_SEA), };
/*
Test userspace can use KVM_SET_VCPU_EVENTS to inject an external instruction abort into guest. The test injects instruction abort at an arbitrary time without real SEA happening in the guest VCPU, so only certain ESR_EL1 value can be expected, but not the case for FAR_EL1.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- tools/arch/arm64/include/uapi/asm/kvm.h | 3 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/inject_iabt.c | 100 ++++++++++++++++++ 3 files changed, 103 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h index af9d9acaf9975..d3a4530846311 100644 --- a/tools/arch/arm64/include/uapi/asm/kvm.h +++ b/tools/arch/arm64/include/uapi/asm/kvm.h @@ -184,8 +184,9 @@ struct kvm_vcpu_events { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; + __u8 ext_iabt_pending; /* Align it to 8 bytes */ - __u8 pad[5]; + __u8 pad[4]; __u64 serror_esr; } exception; __u32 reserved[12]; diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 16d2e9f32619f..708fd126a36dd 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -148,6 +148,7 @@ TEST_GEN_PROGS_arm64 += arm64/aarch32_id_regs TEST_GEN_PROGS_arm64 += arm64/arch_timer_edge_cases TEST_GEN_PROGS_arm64 += arm64/debug-exceptions TEST_GEN_PROGS_arm64 += arm64/hypercalls +TEST_GEN_PROGS_arm64 += arm64/inject_iabt TEST_GEN_PROGS_arm64 += arm64/mmio_abort TEST_GEN_PROGS_arm64 += arm64/page_fault_test TEST_GEN_PROGS_arm64 += arm64/psci_test diff --git a/tools/testing/selftests/kvm/arm64/inject_iabt.c b/tools/testing/selftests/kvm/arm64/inject_iabt.c new file mode 100644 index 0000000000000..43b701e9143c2 --- /dev/null +++ b/tools/testing/selftests/kvm/arm64/inject_iabt.c @@ -0,0 +1,100 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * inject_iabt.c - Tests for injecting instruction aborts into guest. + */ + +#include "processor.h" +#include "test_util.h" + +static void expect_iabt_handler(struct ex_regs *regs) +{ + u64 esr = read_sysreg(esr_el1); + + GUEST_PRINTF("Guest SEA esr_el1=%#lx\n", esr); + GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_IABT_CUR); + GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); + /* + * We inject IABT but there is no SEA in guest at all, + * so guest should see FnV == 1, which is set by KVM. + */ + GUEST_ASSERT(esr & ESR_ELx_FnV); + + GUEST_DONE(); +} + +static void guest_code(void) +{ + GUEST_FAIL("Guest should only run SEA handler"); +} + +static void vcpu_run_expect_done(struct kvm_vcpu *vcpu) +{ + struct ucall uc; + bool guest_done = false; + + do { + vcpu_run(vcpu); + switch (get_ucall(vcpu, &uc)) { + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + break; + case UCALL_PRINTF: + ksft_print_msg("From guest: %s", uc.buffer); + case UCALL_DONE: + ksft_print_msg("Guest done gracefully!\n"); + guest_done = true; + break; + default: + TEST_FAIL("Unexpected ucall: %lu", uc.cmd); + } + } while (!guest_done); +} + +static void vcpu_inject_ext_iabt(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_events events = {}; + + events.exception.ext_iabt_pending = true; + vcpu_events_set(vcpu, &events); +} + +static void vcpu_inject_invalid_abt(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_events events = {}; + int r; + + events.exception.ext_iabt_pending = true; + events.exception.ext_dabt_pending = true; + + ksft_print_msg("Injecting invalid external abort events\n"); + r = __vcpu_ioctl(vcpu, KVM_SET_VCPU_EVENTS, &events); + TEST_ASSERT(r && errno == EINVAL, + KVM_IOCTL_ERROR(KVM_SET_VCPU_EVENTS, r)); +} + +static void test_inject_iabt(void) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + + vm = vm_create_with_one_vcpu(&vcpu, guest_code); + + vm_init_descriptor_tables(vm); + vcpu_init_descriptor_tables(vcpu); + + vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, + ESR_ELx_EC_IABT_CUR, expect_iabt_handler); + + vcpu_inject_invalid_abt(vcpu); + + vcpu_inject_ext_iabt(vcpu); + vcpu_run_expect_done(vcpu); + + kvm_vm_free(vm); +} + +int main(void) +{ + test_inject_iabt(); + return 0; +}
Document the new userspace-visible features and APIs for handling synchronous external abort (SEA) - KVM_CAP_ARM_SEA_TO_USER: How userspace enables the new feature. - KVM_EXIT_ARM_SEA: When userspace needs to handle SEA and what userspace gets while taking the SEA. - KVM_CAP_ARM_INJECT_EXT_(D|I)ABT: How userspace injects SEA to guest while taking the SEA.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- Documentation/virt/kvm/api.rst | 120 +++++++++++++++++++++++++++++---- 1 file changed, 107 insertions(+), 13 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 47c7c3f92314e..fa91a123e1b88 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1236,8 +1236,9 @@ directly to the virtual CPU). __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; + __u8 ext_iabt_pending; /* Align it to 8 bytes */ - __u8 pad[5]; + __u8 pad[4]; __u64 serror_esr; } exception; __u32 reserved[12]; @@ -1292,20 +1293,52 @@ ARM64:
User space may need to inject several types of events to the guest.
+Inject SError +~~~~~~~~~~~~~ + Set the pending SError exception state for this VCPU. It is not possible to 'cancel' an Serror that has been made pending.
-If the guest performed an access to I/O memory which could not be handled by -userspace, for example because of missing instruction syndrome decode -information or because there is no device mapped at the accessed IPA, then -userspace can ask the kernel to inject an external abort using the address -from the exiting fault on the VCPU. It is a programming error to set -ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or -KVM_EXIT_ARM_NISV. This feature is only available if the system supports -KVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality in -how userspace reports accesses for the above cases to guests, across different -userspace implementations. Nevertheless, userspace can still emulate all Arm -exceptions by manipulating individual registers using the KVM_SET_ONE_REG API. +Inject SEA (synchronous external abort) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- If the guest performed an access to I/O memory which could not be handled by + userspace, for example because of missing instruction syndrome decode + information or because there is no device mapped at the accessed IPA. + +- If the guest consumed an uncorrected memory error, and RAS extension in the + Trusted Firmware choose to notify PE with SEA, KVM has to handle it when + host APEI is unable to claim the SEA. For the following types of faults, + if userspace enabled KVM_CAP_ARM_SEA_TO_USER, KVM returns to userspace with + KVM_EXIT_ARM_SEA: + + - Synchronous external abort, not on translation table walk or hardware + update of translation table. + + - Synchronous external abort on translation table walk or hardware update of + translation table, including all levels. + + - Synchronous parity or ECC error on memory access, not on translation table + walk. + + - Synchronous parity or ECC error on memory access on translation table walk + or hardware update of translation table, including all levels. + +For the cases above, userspace can ask the kernel to replay either an external +data abort (by setting ext_dabt_pending) or an external instruciton abort +(by setting ext_iabt_pending) into the faulting VCPU. KVM will use the address +from the exiting fault on the VCPU. Setting both ext_dabt_pending and +ext_iabt_pending at the same time will return -EINVAL. + +It is a programming error to set ext_dabt_pending or ext_iabt_pending after an +exit which was not KVM_EXIT_MMIO, KVM_EXIT_ARM_NISV or KVM_EXIT_ARM_SEA. +Injecting SEA for data and instruction abort is only available if KVM supports +KVM_CAP_ARM_INJECT_EXT_DABT and KVM_CAP_ARM_INJECT_EXT_IABT respectively. + +This is a helper which provides commonality in how userspace reports accesses +for the above cases to guests, across different userspace implementations. +Nevertheless, userspace can still emulate all Arm exceptions by manipulating +individual registers using the KVM_SET_ONE_REG API.
See KVM_GET_VCPU_EVENTS for the data structure.
@@ -7151,6 +7184,55 @@ The valid value for 'flags' is: - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid in VMCS. It would run into unknown result if resume the target VM.
+:: + + /* KVM_EXIT_ARM_SEA */ + struct { + __u64 esr; + #define KVM_EXIT_ARM_SEA_FLAG_GVA_VALID (1ULL << 0) + #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 1) + __u64 flags; + __u64 gva; + __u64 gpa; + } arm_sea; + +Used on arm64 systems. When the VM capability KVM_CAP_ARM_SEA_TO_USER is +enabled, a VM exit is generated if guest caused a synchronous external abort +(SEA) and the host APEI fails to handle the SEA. + +Historically KVM handles SEA by first delegating the SEA to host APEI as there +is high chance that the SEA is caused by consuming uncorrected memory error. +However, not all platforms support SEA handling in APEI, and KVM's fallback +handling is to inject an async SError into the guest, which usually panics +guest kernel unpleasantly. As an alternative, userspace can participate into +the SEA handling by enabling KVM_CAP_ARM_SEA_TO_USER at VM creation, after +querying the capability. Once enabled, when KVM has to handle the guest +caused SEA, it returns to userspace with KVM_EXIT_ARM_SEA, with details +about the SEA available in 'arm_sea'. + +The 'esr' filed holds the value of the exception syndrome register (ESR) while +KVM taking the SEA, which tells userspace the character of the current SEA, +such as its Exception Class, Synchronous Error Type, Fault Specific Code and +so on. For more details on ESR, check the Arm Architecture Registers +documentation. + +The 'flags' field indicates if the faulting addresses are available while +taking the SEA: + + - KVM_EXIT_ARM_SEA_FLAG_GVA_VALID -- the faulting guest virtual address + is valid and userspace can get its value in the 'gva' field. + - KVM_EXIT_ARM_SEA_FLAG_GPA_VALID -- the faulting guest physical address + is valid and userspace can get its value in the 'gpa' filed. + +Userspace needs to take actions to handle guest SEA synchronously, namely in +the same thread that runs KVM_RUN and receives KVM_EXIT_ARM_SEA. One of the +encouraged approaches is to utilize the KVM_SET_VCPU_EVENTS to inject the SEA +to the faulting VCPU. This way, the guest has the opportunity to keep running +and limit the blast radius of the SEA to the particular guest application that +caused the SEA. If the Exception Class indicated by 'esr' field in 'arm_sea' +is data abort, userspace should inject data abort. If the Exception Class is +instruction abort, userspace should inject instruction abort. + ::
/* Fix the size of the union. */ @@ -8478,7 +8560,7 @@ ENOSYS for the others. When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
-7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS +7.42 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS -------------------------------------
:Architectures: arm64 @@ -8496,6 +8578,18 @@ aforementioned registers before the first KVM_RUN. These registers are VM scoped, meaning that the same set of values are presented on all vCPUs in a given VM.
+7.43 KVM_CAP_ARM_SEA_TO_USER +---------------------------- + +:Architecture: arm64 +:Target: VM +:Parameters: none +:Returns: 0 on success, -EINVAL if unsupported. + +This capability, if KVM_CHECK_EXTENSION indicates that it is available, means +that KVM has an implementation that allows userspace to participate in handling +synchronous external abort caused by VM, by an exit of KVM_EXIT_ARM_SEA. + 8. Other capabilities. ======================
...
+Inject SError +~~~~~~~~~~~~~
- Set the pending SError exception state for this VCPU. It is not possible to 'cancel' an Serror that has been made pending.
-If the guest performed an access to I/O memory which could not be handled by -userspace, for example because of missing instruction syndrome decode -information or because there is no device mapped at the accessed IPA, then -userspace can ask the kernel to inject an external abort using the address -from the exiting fault on the VCPU. It is a programming error to set -ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or -KVM_EXIT_ARM_NISV. This feature is only available if the system supports -KVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality in -how userspace reports accesses for the above cases to guests, across different -userspace implementations. Nevertheless, userspace can still emulate all Arm -exceptions by manipulating individual registers using the KVM_SET_ONE_REG API. +Inject SEA (synchronous external abort) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+- If the guest performed an access to I/O memory which could not be handled by
- userspace, for example because of missing instruction syndrome decode
- information or because there is no device mapped at the accessed IPA.
+- If the guest consumed an uncorrected memory error, and RAS extension in the
- Trusted Firmware choose to notify PE with SEA, KVM has to handle it when
- host APEI is unable to claim the SEA. For the following types of faults,
- if userspace enabled KVM_CAP_ARM_SEA_TO_USER, KVM returns to userspace with
- KVM_EXIT_ARM_SEA:
- Synchronous external abort, not on translation table walk or hardware
- update of translation table.
- Synchronous external abort on translation table walk or hardware update of
- translation table, including all levels.
- Synchronous parity or ECC error on memory access, not on translation table
- walk.
- Synchronous parity or ECC error on memory access on translation table walk
- or hardware update of translation table, including all levels.
+For the cases above, userspace can ask the kernel to replay either an external +data abort (by setting ext_dabt_pending) or an external instruciton abort
typo instruciton -> instruction
+(by setting ext_iabt_pending) into the faulting VCPU. KVM will use the address +from the exiting fault on the VCPU. Setting both ext_dabt_pending and +ext_iabt_pending at the same time will return -EINVAL.
+It is a programming error to set ext_dabt_pending or ext_iabt_pending after an +exit which was not KVM_EXIT_MMIO, KVM_EXIT_ARM_NISV or KVM_EXIT_ARM_SEA. +Injecting SEA for data and instruction abort is only available if KVM supports +KVM_CAP_ARM_INJECT_EXT_DABT and KVM_CAP_ARM_INJECT_EXT_IABT respectively.
+This is a helper which provides commonality in how userspace reports accesses +for the above cases to guests, across different userspace implementations. +Nevertheless, userspace can still emulate all Arm exceptions by manipulating +individual registers using the KVM_SET_ONE_REG API. See KVM_GET_VCPU_EVENTS for the data structure. @@ -7151,6 +7184,55 @@ The valid value for 'flags' is: - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid in VMCS. It would run into unknown result if resume the target VM. +::
- /* KVM_EXIT_ARM_SEA */
- struct {
__u64 esr;
- #define KVM_EXIT_ARM_SEA_FLAG_GVA_VALID (1ULL << 0)
- #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 1)
__u64 flags;
__u64 gva;
__u64 gpa;
- } arm_sea;
+Used on arm64 systems. When the VM capability KVM_CAP_ARM_SEA_TO_USER is +enabled, a VM exit is generated if guest caused a synchronous external abort +(SEA) and the host APEI fails to handle the SEA.
+Historically KVM handles SEA by first delegating the SEA to host APEI as there +is high chance that the SEA is caused by consuming uncorrected memory error. +However, not all platforms support SEA handling in APEI, and KVM's fallback +handling is to inject an async SError into the guest, which usually panics +guest kernel unpleasantly. As an alternative, userspace can participate into +the SEA handling by enabling KVM_CAP_ARM_SEA_TO_USER at VM creation, after +querying the capability. Once enabled, when KVM has to handle the guest +caused SEA, it returns to userspace with KVM_EXIT_ARM_SEA, with details +about the SEA available in 'arm_sea'.
+The 'esr' filed holds the value of the exception syndrome register (ESR) while
'esr' filed holds -> 'esr' field hold
+KVM taking the SEA, which tells userspace the character of the current SEA, +such as its Exception Class, Synchronous Error Type, Fault Specific Code and +so on. For more details on ESR, check the Arm Architecture Registers +documentation.
+The 'flags' field indicates if the faulting addresses are available while +taking the SEA:
- KVM_EXIT_ARM_SEA_FLAG_GVA_VALID -- the faulting guest virtual address
- is valid and userspace can get its value in the 'gva' field.
the 'gpa' filed -> the 'gpa' field.
- KVM_EXIT_ARM_SEA_FLAG_GPA_VALID -- the faulting guest physical address
- is valid and userspace can get its value in the 'gpa' filed.
+Userspace needs to take actions to handle guest SEA synchronously, namely in +the same thread that runs KVM_RUN and receives KVM_EXIT_ARM_SEA. One of the +encouraged approaches is to utilize the KVM_SET_VCPU_EVENTS to inject the SEA +to the faulting VCPU. This way, the guest has the opportunity to keep running +and limit the blast radius of the SEA to the particular guest application that +caused the SEA. If the Exception Class indicated by 'esr' field in 'arm_sea' +is data abort, userspace should inject data abort. If the Exception Class is +instruction abort, userspace should inject instruction abort.
Thanks, Alok
linux-kselftest-mirror@lists.linaro.org