Problem =======
When host APEI is unable to claim synchronous external abort (SEA) during stage-2 guest abort, today KVM directly injects an async SError into the VCPU then resumes it. The injected SError usually results in unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable uncorrected memory error (UER), which is not uncommon at all in modern datacenter servers with large amounts of physical memory. Although SError and guest panic is sufficient to stop the propagation of corrupted memory there is room to recover from an UER in a more graceful manner.
Proposed Solution =================
Alternatively KVM can replay the SEA to the faulting VCPU, via existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or the fault that cause SEA is not from guest kernel, the blast radius can be limited to the consuming or faulting guest userspace process, so the VM can keep running.
In addition, instead of doing under the hood without involving userspace, there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and VMM usually has the responsibility to start the process of notifying the customers of memory error events in their VMs. For example some cloud provider emits a critical log in their observability UI [1], and provides playbook for customers on how to mitigate disruptions to their workloads.
- VMM can protect future memory error consumption by unmapping the poisoned pages from stage-2 page table with KVM userfault, or by splitting the memslot that contains the poisoned guest pages [2].
- VMM can keep track of SEA events in the VM. When VMM thinks the status on the host or the VM is bad enough, e.g. number of distinct SEAs exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to let VMM either recover from the MCE, or terminate itself with VM. The prior RFC proposes to implement SIGBUS on arm64 as well, but Marc preferred VCPU exit over signal [3]. However, implementation aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged to inject external aborts into the faulting VCPU, which is already supported by KVM on arm64. We notice injecting instruction abort is not fully supported by KVM_SET_VCPU_EVENTS. Complement it in the patchset.
New UAPIs =========
This patchset introduces following userspace-visiable changes to empower VMM to control what happens next for SEA on guest memory:
- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled this new capability at VM creation, and the SEA is not caused by memory allocated for stage-2 translation table, instead of injecting SError, return KVM_EXIT_ARM_SEA to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details about the SEA is provided in arm_sea as much as possible, including sanitized ESR value at EL2, if guest virtual and physical addresses (GPA and GVA) are available and the values if available.
- KVM_CAP_ARM_INJECT_EXT_IABT. VMM today can inject external data abort to VCPU via KVM_SET_VCPU_EVENTS API. However, in case of instruction abort, VMM cannot inject it via KVM_SET_VCPU_EVENTS. KVM_CAP_ARM_INJECT_EXT_IABT is just a natural extend to KVM_CAP_ARM_INJECT_EXT_DABT that tells VMM KVM_SET_VCPU_EVENTS now supports external instruction abort.
* From v1 [4]: - Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer"). - Sanitize ESR_EL2 before reporting it to userspace. - Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to stage-2 translation table.
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com [3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org [4] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com
Jiaqi Yan (5): KVM: arm64: VM exit to userspace to handle SEA KVM: arm64: Set FnV for VCPU when FAR_EL2 is invalid KVM: selftests: Test for KVM_EXIT_ARM_SEA and KVM_CAP_ARM_SEA_TO_USER KVM: selftests: Test for KVM_CAP_INJECT_EXT_IABT Documentation: kvm: new uAPI for handling SEA
Raghavendra Rao Ananta (1): KVM: arm64: Allow userspace to inject external instruction aborts
Documentation/virt/kvm/api.rst | 128 ++++++- arch/arm64/include/asm/kvm_emulate.h | 67 ++++ arch/arm64/include/asm/kvm_host.h | 8 + arch/arm64/include/asm/kvm_ras.h | 2 +- arch/arm64/include/uapi/asm/kvm.h | 3 +- arch/arm64/kvm/arm.c | 6 + arch/arm64/kvm/guest.c | 13 +- arch/arm64/kvm/inject_fault.c | 3 + arch/arm64/kvm/mmu.c | 59 ++- include/uapi/linux/kvm.h | 12 + tools/arch/arm64/include/asm/esr.h | 2 + tools/arch/arm64/include/uapi/asm/kvm.h | 3 +- tools/testing/selftests/kvm/Makefile.kvm | 2 + .../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++ .../testing/selftests/kvm/arm64/sea_to_user.c | 340 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 16 files changed, 718 insertions(+), 29 deletions(-) create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
When APEI fails to handle a stage-2 synchronous external abort (SEA), today KVM directly injects an async SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to - Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running. - VMM can protect from future memory poison consumption by unmapping the page from stage-2, or interrupt guest of the poisoned guest page so guest kernel can unmap it from stage-1. - VMM can also track SEA events that VM customers care about, restart VM when certain number of distinct poison events have happened, provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM to handle SEA: - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during SEA if it is not caused by access on memory of stage-2 translation table. - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. KVM fills kvm_run.arm_sea with as much as possible information about the SEA, enabling VMM to emulate SEA to guest by itself. - Sanitized ESR_EL2. The general rule is to keep only the bits useful for userspace and relevant to guest memory. See code comments for why bits are hidden/reported. - If faulting guest virtual and physical addresses are available. - Faulting guest virtual address if available. - Faulting guest physical address if available.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- arch/arm64/include/asm/kvm_emulate.h | 67 ++++++++++++++++++++++++++++ arch/arm64/include/asm/kvm_host.h | 8 ++++ arch/arm64/include/asm/kvm_ras.h | 2 +- arch/arm64/kvm/arm.c | 5 +++ arch/arm64/kvm/mmu.c | 59 +++++++++++++++++++----- include/uapi/linux/kvm.h | 11 +++++ 6 files changed, 141 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index bd020fc28aa9c..ac602f8503622 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -429,6 +429,73 @@ static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu) } }
+/* + * Return true if SEA is on an access made for stage-2 translation table walk. + */ +static inline bool kvm_vcpu_sea_iss2ttw(const struct kvm_vcpu *vcpu) +{ + u64 esr = kvm_vcpu_get_esr(vcpu); + + if (!esr_fsc_is_sea_ttw(esr) && !esr_fsc_is_secc_ttw(esr)) + return false; + + return !(esr & ESR_ELx_S1PTW); +} + +/* + * Sanitize ESR_EL2 before KVM_EXIT_ARM_SEA. The general rule is to keep + * only the SEA-relevant bits that are useful for userspace and relevant to + * guest memory. + */ +static inline u64 kvm_vcpu_sea_esr_sanitized(const struct kvm_vcpu *vcpu) +{ + u64 esr = kvm_vcpu_get_esr(vcpu); + /* + * Starting with zero to hide the following bits: + * - HDBSSF: hardware dirty state is not guest memory. + * - TnD, TagAccess, AssuredOnly, Overlay, DirtyBit: they are + * for permission fault. + * - GCS: not guest memory. + * - Xs: it is for translation/access flag/permission fault. + * - ISV: it is 1 mostly for Translation fault, Access flag fault, + * or Permission fault. Only when FEAT_RAS is not implemented, + * it may be set to 1 (implementation defined) for S2PTW, + * which not worthy to return to userspace anyway. + * - ISS[23:14]: because ISV is already hidden. + * - VNCR: VNCR_EL2 is not guest memory. + */ + u64 sanitized = 0ULL; + + /* + * Reasons to make these bits visible to userspace: + * - EC: tell if abort on instruction or data. + * - IL: useful if userspace decides to retire the instruction. + * - FSC: tell if abort on translation table walk. + * - SET: tell if abort is recoverable, uncontainable, or + * restartable. + * - S1PTW: userspace can tell guest its stage-1 has problem. + * - FnV: userspace should avoid writing FAR_EL1 if FnV=1. + * - CM and WnR: make ESR "authentic" in general. + */ + sanitized |= esr & (ESR_ELx_EC_MASK | ESR_ELx_IL | ESR_ELx_FSC | + ESR_ELx_SET_MASK | ESR_ELx_S1PTW | ESR_ELx_FnV | + ESR_ELx_CM | ESR_ELx_WNR); + + return sanitized; +} + +/* Return true if faulting guest virtual address during SEA is valid. */ +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) +{ + return !(kvm_vcpu_get_esr(vcpu) & ESR_ELx_FnV); +} + +/* Return true if faulting guest physical address during SEA is valid. */ +static inline bool kvm_vcpu_sea_ipa_valid(const struct kvm_vcpu *vcpu) +{ + return vcpu->arch.fault.hpfar_el2 & HPFAR_EL2_NS; +} + static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu) { u64 esr = kvm_vcpu_get_esr(vcpu); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index d941abc6b5eef..4b27e988ec768 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -349,6 +349,14 @@ struct kvm_arch { #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9 /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */ #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10 + /* + * When APEI failed to claim stage-2 synchronous external abort + * (SEA) return to userspace with fault information. Userspace + * can opt in this feature if KVM_CAP_ARM_SEA_TO_USER is + * supported. Userspace is encouraged to handle this VM exit + * by injecting a SEA to VCPU before resume the VCPU. + */ +#define KVM_ARCH_FLAG_RETURN_SEA_TO_USER 11 unsigned long flags;
/* VM-wide vCPU feature set */ diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h index 9398ade632aaf..760a5e34489b1 100644 --- a/arch/arm64/include/asm/kvm_ras.h +++ b/arch/arm64/include/asm/kvm_ras.h @@ -14,7 +14,7 @@ * Was this synchronous external abort a RAS notification? * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. */ -static inline int kvm_handle_guest_sea(void) +static inline int kvm_delegate_guest_sea(void) { /* apei_claim_sea(NULL) expects to mask interrupts itself */ lockdep_assert_irqs_enabled(); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 505d504b52b53..99e0c6c16e437 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -133,6 +133,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->lock); break; + case KVM_CAP_ARM_SEA_TO_USER: + r = 0; + set_bit(KVM_ARCH_FLAG_RETURN_SEA_TO_USER, &kvm->arch.flags); + break; default: break; } @@ -322,6 +326,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_COUNTER_OFFSET: case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS: + case KVM_CAP_ARM_SEA_TO_USER: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index e445db2cb4a43..5a50d0ed76a68 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1775,6 +1775,53 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) read_unlock(&vcpu->kvm->mmu_lock); }
+/* Handle stage-2 synchronous external abort (SEA). */ +static int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) +{ + struct kvm_run *run = vcpu->run; + + /* Delegate to APEI for RAS and if it can claim SEA, resume guest. */ + if (kvm_delegate_guest_sea() == 0) + return 1; + + /* + * In addition to userspace opt out KVM_ARCH_FLAG_RETURN_SEA_TO_USER, + * when the SEA is caused on memory for stage-2 page table, returning + * to userspace doesn't bring any benefit: eventually a EL2 exception + * will crash the host kernel. + */ + if (!test_bit(KVM_ARCH_FLAG_RETURN_SEA_TO_USER, + &vcpu->kvm->arch.flags) || + kvm_vcpu_sea_iss2ttw(vcpu)) { + /* Fallback behavior prior to KVM_EXIT_ARM_SEA. */ + kvm_inject_vabt(vcpu); + return 1; + } + + /* + * Exit to userspace, and provide faulting guest virtual and physical + * addresses in case userspace wants to emulate SEA to guest by + * writing to FAR_EL1 and HPFAR_EL1 registers. + */ + run->exit_reason = KVM_EXIT_ARM_SEA; + run->arm_sea.esr = kvm_vcpu_sea_esr_sanitized(vcpu); + run->arm_sea.flags = 0ULL; + run->arm_sea.gva = 0ULL; + run->arm_sea.gpa = 0ULL; + + if (kvm_vcpu_sea_far_valid(vcpu)) { + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GVA_VALID; + run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu); + } + + if (kvm_vcpu_sea_ipa_valid(vcpu)) { + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID; + run->arm_sea.gpa = kvm_vcpu_get_fault_ipa(vcpu); + } + + return 0; +} + /** * kvm_handle_guest_abort - handles all 2nd stage aborts * @vcpu: the VCPU pointer @@ -1799,16 +1846,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) int ret, idx;
/* Synchronous External Abort? */ - if (kvm_vcpu_abt_issea(vcpu)) { - /* - * For RAS the host kernel may handle this abort. - * There is no need to pass the error into the guest. - */ - if (kvm_handle_guest_sea()) - kvm_inject_vabt(vcpu); - - return 1; - } + if (kvm_vcpu_abt_issea(vcpu)) + return kvm_handle_guest_sea(vcpu);
esr = kvm_vcpu_get_esr(vcpu);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index c9d4a908976e8..4fed3fdfb13d6 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -178,6 +178,7 @@ struct kvm_xen_exit { #define KVM_EXIT_NOTIFY 37 #define KVM_EXIT_LOONGARCH_IOCSR 38 #define KVM_EXIT_MEMORY_FAULT 39 +#define KVM_EXIT_ARM_SEA 40
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -446,6 +447,15 @@ struct kvm_run { __u64 gpa; __u64 size; } memory_fault; + /* KVM_EXIT_ARM_SEA */ + struct { + __u64 esr; +#define KVM_EXIT_ARM_SEA_FLAG_GVA_VALID (1ULL << 0) +#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 1) + __u64 flags; + __u64 gva; + __u64 gpa; + } arm_sea; /* Fix the size of the union. */ char padding[256]; }; @@ -932,6 +942,7 @@ struct kvm_enable_cap { #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239 #define KVM_CAP_ARM_EL2 240 #define KVM_CAP_ARM_EL2_E2H0 241 +#define KVM_CAP_ARM_SEA_TO_USER 242
struct kvm_irq_routing_irqchip { __u32 irqchip;
On Tue, Jun 3, 2025 at 10:09 PM Jiaqi Yan jiaqiyan@google.com wrote:
When APEI fails to handle a stage-2 synchronous external abort (SEA), today KVM directly injects an async SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running.
- VMM can protect from future memory poison consumption by unmapping the page from stage-2, or interrupt guest of the poisoned guest page so guest kernel can unmap it from stage-1.
- VMM can also track SEA events that VM customers care about, restart VM when certain number of distinct poison events have happened, provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM to handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during SEA if it is not caused by access on memory of stage-2 translation table.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. KVM fills kvm_run.arm_sea with as much as possible information about the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits useful for userspace and relevant to guest memory. See code comments for why bits are hidden/reported.
- If faulting guest virtual and physical addresses are available.
- Faulting guest virtual address if available.
- Faulting guest physical address if available.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com
arch/arm64/include/asm/kvm_emulate.h | 67 ++++++++++++++++++++++++++++ arch/arm64/include/asm/kvm_host.h | 8 ++++ arch/arm64/include/asm/kvm_ras.h | 2 +- arch/arm64/kvm/arm.c | 5 +++ arch/arm64/kvm/mmu.c | 59 +++++++++++++++++++----- include/uapi/linux/kvm.h | 11 +++++ 6 files changed, 141 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index bd020fc28aa9c..ac602f8503622 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -429,6 +429,73 @@ static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu) } }
+/*
- Return true if SEA is on an access made for stage-2 translation table walk.
- */
+static inline bool kvm_vcpu_sea_iss2ttw(const struct kvm_vcpu *vcpu) +{
u64 esr = kvm_vcpu_get_esr(vcpu);
if (!esr_fsc_is_sea_ttw(esr) && !esr_fsc_is_secc_ttw(esr))
return false;
return !(esr & ESR_ELx_S1PTW);
+}
+/*
- Sanitize ESR_EL2 before KVM_EXIT_ARM_SEA. The general rule is to keep
- only the SEA-relevant bits that are useful for userspace and relevant to
- guest memory.
- */
+static inline u64 kvm_vcpu_sea_esr_sanitized(const struct kvm_vcpu *vcpu) +{
u64 esr = kvm_vcpu_get_esr(vcpu);
/*
* Starting with zero to hide the following bits:
* - HDBSSF: hardware dirty state is not guest memory.
* - TnD, TagAccess, AssuredOnly, Overlay, DirtyBit: they are
* for permission fault.
* - GCS: not guest memory.
* - Xs: it is for translation/access flag/permission fault.
* - ISV: it is 1 mostly for Translation fault, Access flag fault,
* or Permission fault. Only when FEAT_RAS is not implemented,
* it may be set to 1 (implementation defined) for S2PTW,
* which not worthy to return to userspace anyway.
* - ISS[23:14]: because ISV is already hidden.
* - VNCR: VNCR_EL2 is not guest memory.
*/
u64 sanitized = 0ULL;
/*
* Reasons to make these bits visible to userspace:
* - EC: tell if abort on instruction or data.
* - IL: useful if userspace decides to retire the instruction.
* - FSC: tell if abort on translation table walk.
* - SET: tell if abort is recoverable, uncontainable, or
* restartable.
* - S1PTW: userspace can tell guest its stage-1 has problem.
* - FnV: userspace should avoid writing FAR_EL1 if FnV=1.
* - CM and WnR: make ESR "authentic" in general.
*/
sanitized |= esr & (ESR_ELx_EC_MASK | ESR_ELx_IL | ESR_ELx_FSC |
ESR_ELx_SET_MASK | ESR_ELx_S1PTW | ESR_ELx_FnV |
ESR_ELx_CM | ESR_ELx_WNR);
return sanitized;
+}
+/* Return true if faulting guest virtual address during SEA is valid. */ +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) +{
return !(kvm_vcpu_get_esr(vcpu) & ESR_ELx_FnV);
+}
+/* Return true if faulting guest physical address during SEA is valid. */ +static inline bool kvm_vcpu_sea_ipa_valid(const struct kvm_vcpu *vcpu) +{
return vcpu->arch.fault.hpfar_el2 & HPFAR_EL2_NS;
+}
static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu) { u64 esr = kvm_vcpu_get_esr(vcpu); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index d941abc6b5eef..4b27e988ec768 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -349,6 +349,14 @@ struct kvm_arch { #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9 /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */ #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10
/*
* When APEI failed to claim stage-2 synchronous external abort
* (SEA) return to userspace with fault information. Userspace
* can opt in this feature if KVM_CAP_ARM_SEA_TO_USER is
* supported. Userspace is encouraged to handle this VM exit
* by injecting a SEA to VCPU before resume the VCPU.
*/
+#define KVM_ARCH_FLAG_RETURN_SEA_TO_USER 11 unsigned long flags;
/* VM-wide vCPU feature set */
diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h index 9398ade632aaf..760a5e34489b1 100644 --- a/arch/arm64/include/asm/kvm_ras.h +++ b/arch/arm64/include/asm/kvm_ras.h @@ -14,7 +14,7 @@
- Was this synchronous external abort a RAS notification?
- Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
*/ -static inline int kvm_handle_guest_sea(void) +static inline int kvm_delegate_guest_sea(void) { /* apei_claim_sea(NULL) expects to mask interrupts itself */ lockdep_assert_irqs_enabled(); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 505d504b52b53..99e0c6c16e437 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -133,6 +133,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->lock); break;
case KVM_CAP_ARM_SEA_TO_USER:
r = 0;
set_bit(KVM_ARCH_FLAG_RETURN_SEA_TO_USER, &kvm->arch.flags);
break; default: break; }
@@ -322,6 +326,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_COUNTER_OFFSET: case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
case KVM_CAP_ARM_SEA_TO_USER: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index e445db2cb4a43..5a50d0ed76a68 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1775,6 +1775,53 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) read_unlock(&vcpu->kvm->mmu_lock); }
+/* Handle stage-2 synchronous external abort (SEA). */ +static int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) +{
struct kvm_run *run = vcpu->run;
/* Delegate to APEI for RAS and if it can claim SEA, resume guest. */
if (kvm_delegate_guest_sea() == 0)
return 1;
/*
* In addition to userspace opt out KVM_ARCH_FLAG_RETURN_SEA_TO_USER,
* when the SEA is caused on memory for stage-2 page table, returning
* to userspace doesn't bring any benefit: eventually a EL2 exception
* will crash the host kernel.
*/
if (!test_bit(KVM_ARCH_FLAG_RETURN_SEA_TO_USER,
&vcpu->kvm->arch.flags) ||
kvm_vcpu_sea_iss2ttw(vcpu)) {
/* Fallback behavior prior to KVM_EXIT_ARM_SEA. */
kvm_inject_vabt(vcpu);
return 1;
}
/*
* Exit to userspace, and provide faulting guest virtual and physical
* addresses in case userspace wants to emulate SEA to guest by
* writing to FAR_EL1 and HPFAR_EL1 registers.
*/
run->exit_reason = KVM_EXIT_ARM_SEA;
run->arm_sea.esr = kvm_vcpu_sea_esr_sanitized(vcpu);
run->arm_sea.flags = 0ULL;
run->arm_sea.gva = 0ULL;
run->arm_sea.gpa = 0ULL;
if (kvm_vcpu_sea_far_valid(vcpu)) {
run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GVA_VALID;
run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu);
}
if (kvm_vcpu_sea_ipa_valid(vcpu)) {
run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID;
run->arm_sea.gpa = kvm_vcpu_get_fault_ipa(vcpu);
}
return 0;
+}
/**
- kvm_handle_guest_abort - handles all 2nd stage aborts
- @vcpu: the VCPU pointer
@@ -1799,16 +1846,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) int ret, idx;
/* Synchronous External Abort? */
if (kvm_vcpu_abt_issea(vcpu)) {
/*
* For RAS the host kernel may handle this abort.
* There is no need to pass the error into the guest.
*/
if (kvm_handle_guest_sea())
kvm_inject_vabt(vcpu);
return 1;
}
if (kvm_vcpu_abt_issea(vcpu))
return kvm_handle_guest_sea(vcpu); esr = kvm_vcpu_get_esr(vcpu);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index c9d4a908976e8..4fed3fdfb13d6 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -178,6 +178,7 @@ struct kvm_xen_exit { #define KVM_EXIT_NOTIFY 37 #define KVM_EXIT_LOONGARCH_IOCSR 38 #define KVM_EXIT_MEMORY_FAULT 39 +#define KVM_EXIT_ARM_SEA 40
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -446,6 +447,15 @@ struct kvm_run { __u64 gpa; __u64 size; } memory_fault;
/* KVM_EXIT_ARM_SEA */
struct {
__u64 esr;
+#define KVM_EXIT_ARM_SEA_FLAG_GVA_VALID (1ULL << 0) +#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 1)
__u64 flags;
__u64 gva;
__u64 gpa;
} arm_sea; /* Fix the size of the union. */ char padding[256]; };
@@ -932,6 +942,7 @@ struct kvm_enable_cap { #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239 #define KVM_CAP_ARM_EL2 240 #define KVM_CAP_ARM_EL2_E2H0 241 +#define KVM_CAP_ARM_SEA_TO_USER 242
struct kvm_irq_routing_irqchip { __u32 irqchip; -- 2.49.0.1266.g31b7d2e469-goog
Humbly ping for reviews / comments
Hi Jiaqi,
On Wed, Jun 04, 2025 at 05:08:56AM +0000, Jiaqi Yan wrote:
When APEI fails to handle a stage-2 synchronous external abort (SEA), today KVM directly injects an async SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running.
- VMM can protect from future memory poison consumption by unmapping the page from stage-2, or interrupt guest of the poisoned guest page so guest kernel can unmap it from stage-1.
- VMM can also track SEA events that VM customers care about, restart VM when certain number of distinct poison events have happened, provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM to handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during SEA if it is not caused by access on memory of stage-2 translation table.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. KVM fills kvm_run.arm_sea with as much as possible information about the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits useful for userspace and relevant to guest memory. See code comments for why bits are hidden/reported.
- If faulting guest virtual and physical addresses are available.
- Faulting guest virtual address if available.
- Faulting guest physical address if available.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com
I was reviewing this locally and wound up making enough changes where it just made more sense to share the diff. General comments:
- Avoid adding helpers to headers when they're used in a single callsite / compilation unit
- Add some detail about FEAT_RAS where we may still exit to userspace for host-controlled memory, as we cannot differentiate between a stage-1 or stage-2 TTW SEA when taken on the descriptor PA
- Explicitly handle SEAs due to VNCR (I have a separate prereq patch)
From aac0bb8f90c43b5b17c3b4e50379cb8ca828812c Mon Sep 17 00:00:00 2001 From: Jiaqi Yan jiaqiyan@google.com Date: Wed, 4 Jun 2025 05:08:56 +0000 Subject: [PATCH] KVM: arm64: VM exit to userspace to handle SEA
When APEI fails to handle a stage-2 synchronous external abort (SEA), today KVM directly injects an async SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to - Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running. - VMM can protect from future memory poison consumption by unmapping the page from stage-2, or interrupt guest of the poisoned guest page so guest kernel can unmap it from stage-1. - VMM can also track SEA events that VM customers care about, restart VM when certain number of distinct poison events have happened, provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM to handle SEA: - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during SEA if it is not caused by access on memory of stage-2 translation table. - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. KVM fills kvm_run.arm_sea with as much as possible information about the SEA, enabling VMM to emulate SEA to guest by itself. - Sanitized ESR_EL2. The general rule is to keep only the bits useful for userspace and relevant to guest memory. See code comments for why bits are hidden/reported. - If faulting guest virtual and physical addresses are available. - Faulting guest virtual address if available. - Faulting guest physical address if available.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com Link: https://lore.kernel.org/r/20250604050902.3944054-2-jiaqiyan@google.com Signed-off-by: Oliver Upton oliver.upton@linux.dev --- arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/kvm/arm.c | 5 +++ arch/arm64/kvm/mmu.c | 67 ++++++++++++++++++++++++++++++- include/uapi/linux/kvm.h | 10 +++++ 4 files changed, 83 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index e54d29feb469..98ce2d58ac8d 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -349,6 +349,8 @@ struct kvm_arch { #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9 /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */ #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10 + /* Unhandled SEAs are taken to userspace */ +#define KVM_ARCH_FLAG_EXIT_SEA 11 unsigned long flags;
/* VM-wide vCPU feature set */ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 7a1a8210ff91..aec6034db1e7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -133,6 +133,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->lock); break; + case KVM_CAP_ARM_SEA_TO_USER: + r = 0; + set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags); + break; default: break; } @@ -322,6 +326,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_COUNTER_OFFSET: case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS: + case KVM_CAP_ARM_SEA_TO_USER: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index a34924d75069..26b2e71994be 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1813,8 +1813,48 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) read_unlock(&vcpu->kvm->mmu_lock); }
+/* + * Returns true if the SEA should be handled locally within KVM if the abort is + * caused by a kernel memory allocation (e.g. stage-2 table memory). + */ +static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr) +{ + /* + * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort + * taken from a guest EL to EL2 is due to a host-imposed access (e.g. + * stage-2 PTW). + */ + if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN)) + return true; + + /* KVM owns the VNCR when the vCPU isn't in a nested context. */ + if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR)) + return true; + + /* + * Determining if an external abort during a table walk happened at + * stage-2 is only possible with S1PTW is set. Otherwise, since KVM + * sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the PA + * of the stage-1 descriptor) can reach here and are reported with a + * TTW ESR value. + */ + return esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW); +} + int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) { + u64 esr = kvm_vcpu_get_esr(vcpu); + struct kvm_run *run = vcpu->run; + struct kvm *kvm = vcpu->kvm; + u64 esr_mask = ESR_ELx_EC_MASK | + ESR_ELx_FnV | + ESR_ELx_EA | + ESR_ELx_CM | + ESR_ELx_WNR | + ESR_ELx_FSC; + u64 ipa; + + /* * Give APEI the opportunity to claim the abort before handling it * within KVM. apei_claim_sea() expects to be called with IRQs @@ -1824,7 +1864,32 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) if (apei_claim_sea(NULL) == 0) return 1;
- return kvm_inject_serror(vcpu); + if (host_owns_sea(vcpu, esr) || !test_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags)) + return kvm_inject_serror(vcpu); + + /* ESR_ELx.SET is RES0 when FEAT_RAS isn't implemented. */ + if (kvm_has_ras(kvm)) + esr_mask |= ESR_ELx_SET_MASK; + + /* + * Exit to userspace, and provide faulting guest virtual and physical + * addresses in case userspace wants to emulate SEA to guest by + * writing to FAR_EL1 and HPFAR_EL1 registers. + */ + memset(&run->arm_sea, 0, sizeof(run->arm_sea)); + run->exit_reason = KVM_EXIT_ARM_SEA; + run->arm_sea.esr = esr & esr_mask; + + if (!(esr & ESR_ELx_FnV)) + run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu); + + ipa = kvm_vcpu_get_fault_ipa(vcpu); + if (ipa != INVALID_GPA) { + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID; + run->arm_sea.gpa = ipa; + } + + return 0; }
/** diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e4e566ff348b..b2cc3d74d769 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -179,6 +179,7 @@ struct kvm_xen_exit { #define KVM_EXIT_LOONGARCH_IOCSR 38 #define KVM_EXIT_MEMORY_FAULT 39 #define KVM_EXIT_TDX 40 +#define KVM_EXIT_ARM_SEA 41
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -469,6 +470,14 @@ struct kvm_run { } get_tdvmcall_info; }; } tdx; + /* KVM_EXIT_ARM_SEA */ + struct { +#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0) + __u64 flags; + __u64 esr; + __u64 gva; + __u64 gpa; + } arm_sea; /* Fix the size of the union. */ char padding[256]; }; @@ -957,6 +966,7 @@ struct kvm_enable_cap { #define KVM_CAP_ARM_EL2_E2H0 241 #define KVM_CAP_RISCV_MP_STATE_RESET 242 #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243 +#define KVM_CAP_ARM_SEA_TO_USER 244
struct kvm_irq_routing_irqchip { __u32 irqchip;
On Fri, Jul 11, 2025 at 12:40 PM Oliver Upton oliver.upton@linux.dev wrote:
Hi Jiaqi,
On Wed, Jun 04, 2025 at 05:08:56AM +0000, Jiaqi Yan wrote:
When APEI fails to handle a stage-2 synchronous external abort (SEA), today KVM directly injects an async SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running.
- VMM can protect from future memory poison consumption by unmapping the page from stage-2, or interrupt guest of the poisoned guest page so guest kernel can unmap it from stage-1.
- VMM can also track SEA events that VM customers care about, restart VM when certain number of distinct poison events have happened, provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM to handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during SEA if it is not caused by access on memory of stage-2 translation table.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. KVM fills kvm_run.arm_sea with as much as possible information about the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits useful for userspace and relevant to guest memory. See code comments for why bits are hidden/reported.
- If faulting guest virtual and physical addresses are available.
- Faulting guest virtual address if available.
- Faulting guest physical address if available.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com
I was reviewing this locally and wound up making enough changes where it just made more sense to share the diff. General comments:
Thanks for the diff, Oliver! I will work on a v3 based on it.
Avoid adding helpers to headers when they're used in a single callsite / compilation unit
Add some detail about FEAT_RAS where we may still exit to userspace for host-controlled memory, as we cannot differentiate between a stage-1 or stage-2 TTW SEA when taken on the descriptor PA
Ah, IIUC, you are saying even if the FSC code tells fault is on TTW (esr_fsc_is_secc_ttw or esr_fsc_is_sea_ttw), it can either be guest stage-1's or stage-2's descriptor PA, and we can tell which from which.
However, if ESR_ELx_S1PTW is set, we can tell this is a sub-case of stage-2 descriptor PA, their usage is for stage-1 PTW but they are stage-2 memory.
Is my current understanding right?
- Explicitly handle SEAs due to VNCR (I have a separate prereq patch)
From aac0bb8f90c43b5b17c3b4e50379cb8ca828812c Mon Sep 17 00:00:00 2001 From: Jiaqi Yan jiaqiyan@google.com Date: Wed, 4 Jun 2025 05:08:56 +0000 Subject: [PATCH] KVM: arm64: VM exit to userspace to handle SEA
When APEI fails to handle a stage-2 synchronous external abort (SEA), today KVM directly injects an async SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running.
- VMM can protect from future memory poison consumption by unmapping the page from stage-2, or interrupt guest of the poisoned guest page so guest kernel can unmap it from stage-1.
- VMM can also track SEA events that VM customers care about, restart VM when certain number of distinct poison events have happened, provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM to handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during SEA if it is not caused by access on memory of stage-2 translation table.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. KVM fills kvm_run.arm_sea with as much as possible information about the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits useful for userspace and relevant to guest memory. See code comments for why bits are hidden/reported.
- If faulting guest virtual and physical addresses are available.
- Faulting guest virtual address if available.
- Faulting guest physical address if available.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com Link: https://lore.kernel.org/r/20250604050902.3944054-2-jiaqiyan@google.com Signed-off-by: Oliver Upton oliver.upton@linux.dev
arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/kvm/arm.c | 5 +++ arch/arm64/kvm/mmu.c | 67 ++++++++++++++++++++++++++++++- include/uapi/linux/kvm.h | 10 +++++ 4 files changed, 83 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index e54d29feb469..98ce2d58ac8d 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -349,6 +349,8 @@ struct kvm_arch { #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9 /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */ #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10
/* Unhandled SEAs are taken to userspace */
+#define KVM_ARCH_FLAG_EXIT_SEA 11 unsigned long flags;
/* VM-wide vCPU feature set */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 7a1a8210ff91..aec6034db1e7 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -133,6 +133,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->lock); break;
case KVM_CAP_ARM_SEA_TO_USER:
r = 0;
set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
break; default: break; }
@@ -322,6 +326,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_COUNTER_OFFSET: case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
case KVM_CAP_ARM_SEA_TO_USER: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index a34924d75069..26b2e71994be 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1813,8 +1813,48 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) read_unlock(&vcpu->kvm->mmu_lock); }
+/*
- Returns true if the SEA should be handled locally within KVM if the abort is
- caused by a kernel memory allocation (e.g. stage-2 table memory).
- */
+static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr) +{
/*
* Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
* taken from a guest EL to EL2 is due to a host-imposed access (e.g.
* stage-2 PTW).
*/
if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
return true;
/* KVM owns the VNCR when the vCPU isn't in a nested context. */
if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
return true;
/*
* Determining if an external abort during a table walk happened at
* stage-2 is only possible with S1PTW is set. Otherwise, since KVM
* sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the PA
* of the stage-1 descriptor) can reach here and are reported with a
* TTW ESR value.
*/
return esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW);
Should we include esr_fsc_is_secc_ttw? like (esr_fsc_is_sea_ttw(esr) || esr_fsc_is_secc_ttw(esr)) && (esr & ESR_ELx_S1PTW)
+}
int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) {
u64 esr = kvm_vcpu_get_esr(vcpu);
struct kvm_run *run = vcpu->run;
struct kvm *kvm = vcpu->kvm;
u64 esr_mask = ESR_ELx_EC_MASK |
ESR_ELx_FnV |
ESR_ELx_EA |
ESR_ELx_CM |
ESR_ELx_WNR |
ESR_ELx_FSC;
Do you (and why) exclude ESR_ELx_IL on purpose?
BTW, if my previous statement about TTW SEA is correct, then I also understand why we need to explicitly exclude ESR_ELx_S1PTW.
u64 ipa;
/* * Give APEI the opportunity to claim the abort before handling it * within KVM. apei_claim_sea() expects to be called with IRQs
@@ -1824,7 +1864,32 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) if (apei_claim_sea(NULL) == 0)
I assume kvm should still lockdep_assert_irqs_enabled(), right? That is, a WARN_ON_ONCE is still useful in case?
return 1;
return kvm_inject_serror(vcpu);
if (host_owns_sea(vcpu, esr) || !test_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags))
return kvm_inject_serror(vcpu);
/* ESR_ELx.SET is RES0 when FEAT_RAS isn't implemented. */
if (kvm_has_ras(kvm))
esr_mask |= ESR_ELx_SET_MASK;
/*
* Exit to userspace, and provide faulting guest virtual and physical
* addresses in case userspace wants to emulate SEA to guest by
* writing to FAR_EL1 and HPFAR_EL1 registers.
*/
memset(&run->arm_sea, 0, sizeof(run->arm_sea));
run->exit_reason = KVM_EXIT_ARM_SEA;
run->arm_sea.esr = esr & esr_mask;
if (!(esr & ESR_ELx_FnV))
run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu);
ipa = kvm_vcpu_get_fault_ipa(vcpu);
if (ipa != INVALID_GPA) {
run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID;
run->arm_sea.gpa = ipa;
}
return 0;
}
/** diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e4e566ff348b..b2cc3d74d769 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -179,6 +179,7 @@ struct kvm_xen_exit { #define KVM_EXIT_LOONGARCH_IOCSR 38 #define KVM_EXIT_MEMORY_FAULT 39 #define KVM_EXIT_TDX 40 +#define KVM_EXIT_ARM_SEA 41
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -469,6 +470,14 @@ struct kvm_run { } get_tdvmcall_info; }; } tdx;
/* KVM_EXIT_ARM_SEA */
struct {
+#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0)
__u64 flags;
__u64 esr;
__u64 gva;
__u64 gpa;
} arm_sea; /* Fix the size of the union. */ char padding[256]; };
@@ -957,6 +966,7 @@ struct kvm_enable_cap { #define KVM_CAP_ARM_EL2_E2H0 241 #define KVM_CAP_RISCV_MP_STATE_RESET 242 #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243 +#define KVM_CAP_ARM_SEA_TO_USER 244
struct kvm_irq_routing_irqchip { __u32 irqchip; -- 2.39.5
On Fri, Jul 11, 2025 at 04:59:11PM -0700, Jiaqi Yan wrote:
- Add some detail about FEAT_RAS where we may still exit to userspace for host-controlled memory, as we cannot differentiate between a stage-1 or stage-2 TTW SEA when taken on the descriptor PA
Ah, IIUC, you are saying even if the FSC code tells fault is on TTW (esr_fsc_is_secc_ttw or esr_fsc_is_sea_ttw), it can either be guest stage-1's or stage-2's descriptor PA, and we can tell which from which.
However, if ESR_ELx_S1PTW is set, we can tell this is a sub-case of stage-2 descriptor PA, their usage is for stage-1 PTW but they are stage-2 memory.
Is my current understanding right?
Yep, that's exactly what I'm getting at. As you note, stage-2 aborts during a stage-1 walk are sufficiently described, but not much else.
+/*
- Returns true if the SEA should be handled locally within KVM if the abort is
- caused by a kernel memory allocation (e.g. stage-2 table memory).
- */
+static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr) +{
/*
* Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
* taken from a guest EL to EL2 is due to a host-imposed access (e.g.
* stage-2 PTW).
*/
if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
return true;
/* KVM owns the VNCR when the vCPU isn't in a nested context. */
if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
return true;
/*
* Determining if an external abort during a table walk happened at
* stage-2 is only possible with S1PTW is set. Otherwise, since KVM
* sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the PA
* of the stage-1 descriptor) can reach here and are reported with a
* TTW ESR value.
*/
return esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW);
Should we include esr_fsc_is_secc_ttw? like (esr_fsc_is_sea_ttw(esr) || esr_fsc_is_secc_ttw(esr)) && (esr & ESR_ELx_S1PTW)
Parity / ECC errors are not permitted if FEAT_RAS is implemented (which is tested for up front).
+}
int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) {
u64 esr = kvm_vcpu_get_esr(vcpu);
struct kvm_run *run = vcpu->run;
struct kvm *kvm = vcpu->kvm;
u64 esr_mask = ESR_ELx_EC_MASK |
ESR_ELx_FnV |
ESR_ELx_EA |
ESR_ELx_CM |
ESR_ELx_WNR |
ESR_ELx_FSC;
Do you (and why) exclude ESR_ELx_IL on purpose?
Unintended :)
BTW, if my previous statement about TTW SEA is correct, then I also understand why we need to explicitly exclude ESR_ELx_S1PTW.
Right, we shouldn't be exposing genuine stage-2 external aborts to userspace.
u64 ipa;
/* * Give APEI the opportunity to claim the abort before handling it * within KVM. apei_claim_sea() expects to be called with IRQs
@@ -1824,7 +1864,32 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) if (apei_claim_sea(NULL) == 0)
I assume kvm should still lockdep_assert_irqs_enabled(), right? That is, a WARN_ON_ONCE is still useful in case?
Ah, this is diffed against my VNCR prefix which has this context. Yes, I want to preserve the lockdep assertion.
From eb63dbf07b3d1f42b059f5c94abd147d195299c8 Mon Sep 17 00:00:00 2001 From: Oliver Upton oliver.upton@linux.dev Date: Thu, 10 Jul 2025 17:14:51 -0700 Subject: [PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection
Signed-off-by: Oliver Upton oliver.upton@linux.dev --- arch/arm64/include/asm/kvm_mmu.h | 1 + arch/arm64/include/asm/kvm_ras.h | 25 ------------------------- arch/arm64/kvm/mmu.c | 30 ++++++++++++++++++------------ arch/arm64/kvm/nested.c | 3 +++ 4 files changed, 22 insertions(+), 37 deletions(-) delete mode 100644 arch/arm64/include/asm/kvm_ras.h
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index ae563ebd6aee..e4069f2ce642 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -180,6 +180,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu); int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, phys_addr_t pa, unsigned long size, bool writable);
+int kvm_handle_guest_sea(struct kvm_vcpu *vcpu); int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
phys_addr_t kvm_mmu_get_httbr(void); diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h deleted file mode 100644 index 9398ade632aa..000000000000 --- a/arch/arm64/include/asm/kvm_ras.h +++ /dev/null @@ -1,25 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* Copyright (C) 2018 - Arm Ltd */ - -#ifndef __ARM64_KVM_RAS_H__ -#define __ARM64_KVM_RAS_H__ - -#include <linux/acpi.h> -#include <linux/errno.h> -#include <linux/types.h> - -#include <asm/acpi.h> - -/* - * Was this synchronous external abort a RAS notification? - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. - */ -static inline int kvm_handle_guest_sea(void) -{ - /* apei_claim_sea(NULL) expects to mask interrupts itself */ - lockdep_assert_irqs_enabled(); - - return apei_claim_sea(NULL); -} - -#endif /* __ARM64_KVM_RAS_H__ */ diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 1c78864767c5..6934f4acdc45 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -4,19 +4,20 @@ * Author: Christoffer Dall c.dall@virtualopensystems.com */
+#include <linux/acpi.h> #include <linux/mman.h> #include <linux/kvm_host.h> #include <linux/io.h> #include <linux/hugetlb.h> #include <linux/sched/signal.h> #include <trace/events/kvm.h> +#include <asm/acpi.h> #include <asm/pgalloc.h> #include <asm/cacheflush.h> #include <asm/kvm_arm.h> #include <asm/kvm_mmu.h> #include <asm/kvm_pgtable.h> #include <asm/kvm_pkvm.h> -#include <asm/kvm_ras.h> #include <asm/kvm_asm.h> #include <asm/kvm_emulate.h> #include <asm/virt.h> @@ -1811,6 +1812,20 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) read_unlock(&vcpu->kvm->mmu_lock); }
+int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) +{ + /* + * Give APEI the opportunity to claim the abort before handling it + * within KVM. apei_claim_sea() expects to be called with IRQs + * enabled. + */ + lockdep_assert_irqs_enabled(); + if (apei_claim_sea(NULL) == 0) + return 1; + + return kvm_inject_serror(vcpu); +} + /** * kvm_handle_guest_abort - handles all 2nd stage aborts * @vcpu: the VCPU pointer @@ -1834,17 +1849,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) gfn_t gfn; int ret, idx;
- /* Synchronous External Abort? */ - if (kvm_vcpu_abt_issea(vcpu)) { - /* - * For RAS the host kernel may handle this abort. - * There is no need to pass the error into the guest. - */ - if (kvm_handle_guest_sea()) - return kvm_inject_serror(vcpu); - - return 1; - } + if (kvm_vcpu_abt_issea(vcpu)) + return kvm_handle_guest_sea(vcpu);
esr = kvm_vcpu_get_esr(vcpu);
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index 096747a61bf6..38b0e3a9a6db 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -1289,6 +1289,9 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu)
BUG_ON(!(esr & ESR_ELx_VNCR_SHIFT));
+ if (kvm_vcpu_abt_issea(vcpu)) + return kvm_handle_guest_sea(vcpu); + if (esr_fsc_is_permission_fault(esr)) { inject_vncr_perm(vcpu); } else if (esr_fsc_is_translation_fault(esr)) {
On Sat, Jul 12, 2025 at 12:57 PM Oliver Upton oliver.upton@linux.dev wrote:
On Fri, Jul 11, 2025 at 04:59:11PM -0700, Jiaqi Yan wrote:
- Add some detail about FEAT_RAS where we may still exit to userspace for host-controlled memory, as we cannot differentiate between a stage-1 or stage-2 TTW SEA when taken on the descriptor PA
Ah, IIUC, you are saying even if the FSC code tells fault is on TTW (esr_fsc_is_secc_ttw or esr_fsc_is_sea_ttw), it can either be guest stage-1's or stage-2's descriptor PA, and we can tell which from which.
However, if ESR_ELx_S1PTW is set, we can tell this is a sub-case of stage-2 descriptor PA, their usage is for stage-1 PTW but they are stage-2 memory.
Is my current understanding right?
Yep, that's exactly what I'm getting at. As you note, stage-2 aborts during a stage-1 walk are sufficiently described, but not much else.
Got it, thanks!
+/*
- Returns true if the SEA should be handled locally within KVM if the abort is
- caused by a kernel memory allocation (e.g. stage-2 table memory).
- */
+static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr) +{
/*
* Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort
* taken from a guest EL to EL2 is due to a host-imposed access (e.g.
* stage-2 PTW).
*/
if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN))
return true;
/* KVM owns the VNCR when the vCPU isn't in a nested context. */
if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR))
return true;
/*
* Determining if an external abort during a table walk happened at
* stage-2 is only possible with S1PTW is set. Otherwise, since KVM
* sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the PA
* of the stage-1 descriptor) can reach here and are reported with a
* TTW ESR value.
*/
return esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW);
Should we include esr_fsc_is_secc_ttw? like (esr_fsc_is_sea_ttw(esr) || esr_fsc_is_secc_ttw(esr)) && (esr & ESR_ELx_S1PTW)
Parity / ECC errors are not permitted if FEAT_RAS is implemented (which is tested for up front).
Ah, thanks for pointing this out.
+}
int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) {
u64 esr = kvm_vcpu_get_esr(vcpu);
struct kvm_run *run = vcpu->run;
struct kvm *kvm = vcpu->kvm;
u64 esr_mask = ESR_ELx_EC_MASK |
ESR_ELx_FnV |
ESR_ELx_EA |
ESR_ELx_CM |
ESR_ELx_WNR |
ESR_ELx_FSC;
Do you (and why) exclude ESR_ELx_IL on purpose?
Unintended :)
Will add into my patch.
BTW, if my previous statement about TTW SEA is correct, then I also understand why we need to explicitly exclude ESR_ELx_S1PTW.
Right, we shouldn't be exposing genuine stage-2 external aborts to userspace.
u64 ipa;
/* * Give APEI the opportunity to claim the abort before handling it * within KVM. apei_claim_sea() expects to be called with IRQs
@@ -1824,7 +1864,32 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) if (apei_claim_sea(NULL) == 0)
I assume kvm should still lockdep_assert_irqs_enabled(), right? That is, a WARN_ON_ONCE is still useful in case?
Ah, this is diffed against my VNCR prefix which has this context. Yes, I want to preserve the lockdep assertion.
Thanks for sharing the patch! Should I wait for you to send and queue to kvmarm/next and rebase my v3 to it? Or should I insert it into my v3 patch series with you as the commit author, and Signed-off-by you?
BTW, while I am working on v3, I think it is probably better to decouple the current patchset into two. The first one for KVM_EXIT_ARM_SEA, and the second one for injecting (D|I)ABT with user-supplemented esr. This way may help KVM_EXIT_ARM_SEA, the more important feature, get reviewed and accepted sooner. I will send out a separate patchset for enhancing the guest SEA injection.
From eb63dbf07b3d1f42b059f5c94abd147d195299c8 Mon Sep 17 00:00:00 2001 From: Oliver Upton oliver.upton@linux.dev Date: Thu, 10 Jul 2025 17:14:51 -0700 Subject: [PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection
Signed-off-by: Oliver Upton oliver.upton@linux.dev
arch/arm64/include/asm/kvm_mmu.h | 1 + arch/arm64/include/asm/kvm_ras.h | 25 ------------------------- arch/arm64/kvm/mmu.c | 30 ++++++++++++++++++------------ arch/arm64/kvm/nested.c | 3 +++ 4 files changed, 22 insertions(+), 37 deletions(-) delete mode 100644 arch/arm64/include/asm/kvm_ras.h
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index ae563ebd6aee..e4069f2ce642 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -180,6 +180,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu); int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, phys_addr_t pa, unsigned long size, bool writable);
+int kvm_handle_guest_sea(struct kvm_vcpu *vcpu); int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
phys_addr_t kvm_mmu_get_httbr(void); diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h deleted file mode 100644 index 9398ade632aa..000000000000 --- a/arch/arm64/include/asm/kvm_ras.h +++ /dev/null @@ -1,25 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* Copyright (C) 2018 - Arm Ltd */
-#ifndef __ARM64_KVM_RAS_H__ -#define __ARM64_KVM_RAS_H__
-#include <linux/acpi.h> -#include <linux/errno.h> -#include <linux/types.h>
-#include <asm/acpi.h>
-/*
- Was this synchronous external abort a RAS notification?
- Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
- */
-static inline int kvm_handle_guest_sea(void) -{
/* apei_claim_sea(NULL) expects to mask interrupts itself */
lockdep_assert_irqs_enabled();
return apei_claim_sea(NULL);
-}
-#endif /* __ARM64_KVM_RAS_H__ */ diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 1c78864767c5..6934f4acdc45 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -4,19 +4,20 @@
- Author: Christoffer Dall c.dall@virtualopensystems.com
*/
+#include <linux/acpi.h> #include <linux/mman.h> #include <linux/kvm_host.h> #include <linux/io.h> #include <linux/hugetlb.h> #include <linux/sched/signal.h> #include <trace/events/kvm.h> +#include <asm/acpi.h> #include <asm/pgalloc.h> #include <asm/cacheflush.h> #include <asm/kvm_arm.h> #include <asm/kvm_mmu.h> #include <asm/kvm_pgtable.h> #include <asm/kvm_pkvm.h> -#include <asm/kvm_ras.h> #include <asm/kvm_asm.h> #include <asm/kvm_emulate.h> #include <asm/virt.h> @@ -1811,6 +1812,20 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) read_unlock(&vcpu->kvm->mmu_lock); }
+int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) +{
/*
* Give APEI the opportunity to claim the abort before handling it
* within KVM. apei_claim_sea() expects to be called with IRQs
* enabled.
*/
lockdep_assert_irqs_enabled();
if (apei_claim_sea(NULL) == 0)
return 1;
return kvm_inject_serror(vcpu);
+}
/**
- kvm_handle_guest_abort - handles all 2nd stage aborts
- @vcpu: the VCPU pointer
@@ -1834,17 +1849,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) gfn_t gfn; int ret, idx;
/* Synchronous External Abort? */
if (kvm_vcpu_abt_issea(vcpu)) {
/*
* For RAS the host kernel may handle this abort.
* There is no need to pass the error into the guest.
*/
if (kvm_handle_guest_sea())
return kvm_inject_serror(vcpu);
return 1;
}
if (kvm_vcpu_abt_issea(vcpu))
return kvm_handle_guest_sea(vcpu); esr = kvm_vcpu_get_esr(vcpu);
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index 096747a61bf6..38b0e3a9a6db 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -1289,6 +1289,9 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu)
BUG_ON(!(esr & ESR_ELx_VNCR_SHIFT));
if (kvm_vcpu_abt_issea(vcpu))
return kvm_handle_guest_sea(vcpu);
if (esr_fsc_is_permission_fault(esr)) { inject_vncr_perm(vcpu); } else if (esr_fsc_is_translation_fault(esr)) {
-- 2.39.5
Certain microarchitectures (e.g. Neoverse V2) do not keep track of the faulting address for a memory load that consumes poisoned data and results in a synchronous external abort (SEA). IOW, both FAR_EL2 register and kvm_vcpu_get_hfar holds a garbage value.
In case VMM later totally relies on KVM to synchronously inject a SEA into the guest, KVM should set FnV bit in VCPU's - ESR_EL1 to let guest kernel know FAR_EL1 is invalid - ESR_EL2 to let nested virtualization know FAR_EL2 is invalid
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- arch/arm64/kvm/inject_fault.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c index a640e839848e6..b4f9a09952ead 100644 --- a/arch/arm64/kvm/inject_fault.c +++ b/arch/arm64/kvm/inject_fault.c @@ -81,6 +81,9 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr if (!is_iabt) esr |= ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT;
+ if (!kvm_vcpu_sea_far_valid(vcpu)) + esr |= ESR_ELx_FnV; + esr |= ESR_ELx_FSC_EXTABT;
if (match_target_el(vcpu, unpack_vcpu_flag(EXCEPT_AA64_EL1_SYNC))) {
From: Raghavendra Rao Ananta rananta@google.com
When KVM returns to userspace for KVM_EXIT_ARM_SEA, the userspace is encouraged to inject the abort into the guest via KVM_SET_VCPU_EVENTS.
KVM_SET_VCPU_EVENTS currently only allows injecting external data aborts. However, the synchronous external abort that caused KVM_EXIT_ARM_SEA is possible to be an instruction abort. Userspace is already able to tell if an abort is due to data or instruction via kvm_run.arm_sea.esr, by checking its Exception Class value.
Extend the KVM_SET_VCPU_EVENTS ioctl to allow injecting instruction abort into the guest.
Signed-off-by: Raghavendra Rao Ananta rananta@google.com Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- arch/arm64/include/uapi/asm/kvm.h | 3 ++- arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/guest.c | 13 ++++++++++--- include/uapi/linux/kvm.h | 1 + 4 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index ed5f3892674c7..643e8c4825451 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -184,8 +184,9 @@ struct kvm_vcpu_events { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; + __u8 ext_iabt_pending; /* Align it to 8 bytes */ - __u8 pad[5]; + __u8 pad[4]; __u64 serror_esr; } exception; __u32 reserved[12]; diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 99e0c6c16e437..78e8a82c38cfc 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -319,6 +319,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ARM_IRQ_LINE_LAYOUT_2: case KVM_CAP_ARM_NISV_TO_USER: case KVM_CAP_ARM_INJECT_EXT_DABT: + case KVM_CAP_ARM_INJECT_EXT_IABT: case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index 2196979a24a32..4917361ecf5cb 100644 --- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -825,9 +825,9 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu, events->exception.serror_esr = vcpu_get_vsesr(vcpu);
/* - * We never return a pending ext_dabt here because we deliver it to - * the virtual CPU directly when setting the event and it's no longer - * 'pending' at this point. + * We never return a pending ext_dabt or ext_iabt here because we + * deliver it to the virtual CPU directly when setting the event + * and it's no longer 'pending' at this point. */
return 0; @@ -839,6 +839,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, bool serror_pending = events->exception.serror_pending; bool has_esr = events->exception.serror_has_esr; bool ext_dabt_pending = events->exception.ext_dabt_pending; + bool ext_iabt_pending = events->exception.ext_iabt_pending;
if (serror_pending && has_esr) { if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN)) @@ -852,8 +853,14 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, kvm_inject_vabt(vcpu); }
+ /* DABT and IABT cannot happen at the same time. */ + if (ext_dabt_pending && ext_iabt_pending) + return -EINVAL; + if (ext_dabt_pending) kvm_inject_dabt(vcpu, kvm_vcpu_get_hfar(vcpu)); + else if (ext_iabt_pending) + kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
return 0; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 4fed3fdfb13d6..2fc3775ac1183 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -943,6 +943,7 @@ struct kvm_enable_cap { #define KVM_CAP_ARM_EL2 240 #define KVM_CAP_ARM_EL2_E2H0 241 #define KVM_CAP_ARM_SEA_TO_USER 242 +#define KVM_CAP_ARM_INJECT_EXT_IABT 243
struct kvm_irq_routing_irqchip { __u32 irqchip;
On Wed, Jun 04, 2025 at 05:08:58AM +0000, Jiaqi Yan wrote:
From: Raghavendra Rao Ananta rananta@google.com
When KVM returns to userspace for KVM_EXIT_ARM_SEA, the userspace is encouraged to inject the abort into the guest via KVM_SET_VCPU_EVENTS.
KVM_SET_VCPU_EVENTS currently only allows injecting external data aborts. However, the synchronous external abort that caused KVM_EXIT_ARM_SEA is possible to be an instruction abort. Userspace is already able to tell if an abort is due to data or instruction via kvm_run.arm_sea.esr, by checking its Exception Class value.
Extend the KVM_SET_VCPU_EVENTS ioctl to allow injecting instruction abort into the guest.
Signed-off-by: Raghavendra Rao Ananta rananta@google.com Signed-off-by: Jiaqi Yan jiaqiyan@google.com
Hmm. Since we expose an ESR value to userspace I get the feeling that we should allow the user to supply an ISS for the external abort, similar to what we already do for SErrors.
Thanks, Oliver
On Fri, Jul 11, 2025 at 12:42 PM Oliver Upton oliver.upton@linux.dev wrote:
On Wed, Jun 04, 2025 at 05:08:58AM +0000, Jiaqi Yan wrote:
From: Raghavendra Rao Ananta rananta@google.com
When KVM returns to userspace for KVM_EXIT_ARM_SEA, the userspace is encouraged to inject the abort into the guest via KVM_SET_VCPU_EVENTS.
KVM_SET_VCPU_EVENTS currently only allows injecting external data aborts. However, the synchronous external abort that caused KVM_EXIT_ARM_SEA is possible to be an instruction abort. Userspace is already able to tell if an abort is due to data or instruction via kvm_run.arm_sea.esr, by checking its Exception Class value.
Extend the KVM_SET_VCPU_EVENTS ioctl to allow injecting instruction abort into the guest.
Signed-off-by: Raghavendra Rao Ananta rananta@google.com Signed-off-by: Jiaqi Yan jiaqiyan@google.com
Hmm. Since we expose an ESR value to userspace I get the feeling that we should allow the user to supply an ISS for the external abort, similar to what we already do for SErrors.
Oh, I will create something in v3, by extending kvm_vcpu_events to something like:
struct { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; __u8 ext_iabt_pending; __u8 ext_abt_has_esr; // <= new /* Align it to 8 bytes */ __u8 pad[3]; union { __u64 serror_esr; __u64 ext_abt_esr; // <= new }; } exception;
One question about the naming since we cannot change it once committed. Taking the existing SError injection as example, although the name in kvm_vcpu_events is serror_has_esr, it is essentially just the ISS fields of the ESR (which is also written in virt/kvm/api.rst). Why named after "esr" instead of "iss"? The only reason I can think of is, KVM wants to leave the room to accept more fields than ISS from userspace. Does this reason apply to external aborts? Asking in case if "iss" is a better name in kvm_vcpu_events, maybe for external aborts, we should use ext_abt_has_iss?
Thanks, Oliver
On Fri, Jul 11, 2025 at 04:58:57PM -0700, Jiaqi Yan wrote:
On Fri, Jul 11, 2025 at 12:42 PM Oliver Upton oliver.upton@linux.dev wrote:
On Wed, Jun 04, 2025 at 05:08:58AM +0000, Jiaqi Yan wrote:
From: Raghavendra Rao Ananta rananta@google.com
When KVM returns to userspace for KVM_EXIT_ARM_SEA, the userspace is encouraged to inject the abort into the guest via KVM_SET_VCPU_EVENTS.
KVM_SET_VCPU_EVENTS currently only allows injecting external data aborts. However, the synchronous external abort that caused KVM_EXIT_ARM_SEA is possible to be an instruction abort. Userspace is already able to tell if an abort is due to data or instruction via kvm_run.arm_sea.esr, by checking its Exception Class value.
Extend the KVM_SET_VCPU_EVENTS ioctl to allow injecting instruction abort into the guest.
Signed-off-by: Raghavendra Rao Ananta rananta@google.com Signed-off-by: Jiaqi Yan jiaqiyan@google.com
Hmm. Since we expose an ESR value to userspace I get the feeling that we should allow the user to supply an ISS for the external abort, similar to what we already do for SErrors.
Oh, I will create something in v3, by extending kvm_vcpu_events to something like:
struct { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; __u8 ext_iabt_pending; __u8 ext_abt_has_esr; // <= new /* Align it to 8 bytes */ __u8 pad[3]; union { __u64 serror_esr; __u64 ext_abt_esr; // <= new
This doesn't work. The ABI allows userspace to pend both an SError and SEA, so we can't use the same storage for the ESR.
}; } exception;
One question about the naming since we cannot change it once committed. Taking the existing SError injection as example, although the name in kvm_vcpu_events is serror_has_esr, it is essentially just the ISS fields of the ESR (which is also written in virt/kvm/api.rst). Why named after "esr" instead of "iss"? The only reason I can think of is, KVM wants to leave the room to accept more fields than ISS from userspace. Does this reason apply to external aborts? Asking in case if "iss" is a better name in kvm_vcpu_events, maybe for external aborts, we should use ext_abt_has_iss?
We will probably need to include more ESR fields in the future, like ESR_ELx.ISS2. So let's just keep the existing naming if that's OK with you.
Thanks, Oliver
On Sat, Jul 12, 2025 at 12:47 PM Oliver Upton oliver.upton@linux.dev wrote:
On Fri, Jul 11, 2025 at 04:58:57PM -0700, Jiaqi Yan wrote:
On Fri, Jul 11, 2025 at 12:42 PM Oliver Upton oliver.upton@linux.dev wrote:
On Wed, Jun 04, 2025 at 05:08:58AM +0000, Jiaqi Yan wrote:
From: Raghavendra Rao Ananta rananta@google.com
When KVM returns to userspace for KVM_EXIT_ARM_SEA, the userspace is encouraged to inject the abort into the guest via KVM_SET_VCPU_EVENTS.
KVM_SET_VCPU_EVENTS currently only allows injecting external data aborts. However, the synchronous external abort that caused KVM_EXIT_ARM_SEA is possible to be an instruction abort. Userspace is already able to tell if an abort is due to data or instruction via kvm_run.arm_sea.esr, by checking its Exception Class value.
Extend the KVM_SET_VCPU_EVENTS ioctl to allow injecting instruction abort into the guest.
Signed-off-by: Raghavendra Rao Ananta rananta@google.com Signed-off-by: Jiaqi Yan jiaqiyan@google.com
Hmm. Since we expose an ESR value to userspace I get the feeling that we should allow the user to supply an ISS for the external abort, similar to what we already do for SErrors.
Oh, I will create something in v3, by extending kvm_vcpu_events to something like:
struct { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; __u8 ext_iabt_pending; __u8 ext_abt_has_esr; // <= new /* Align it to 8 bytes */ __u8 pad[3]; union { __u64 serror_esr; __u64 ext_abt_esr; // <= new
This doesn't work. The ABI allows userspace to pend both an SError and SEA, so we can't use the same storage for the ESR.
You are right, the implementation (__kvm_arm_vcpu_set_events) indeed continues to inject SError after injecting SEA.
Then we may have to extend the size of exception and meanwhile reduce the size of reserved, because I believe we want to place ext_abt_esr into kvm_vcpu_events.exception. Something like: struct kvm_vcpu_events { struct { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; __u8 ext_iabt_pending; __u8 ext_abt_has_esr; __u8 pad[3]; __u64 serror_esr; __u64 ext_abt_esr; // <= +64 bits } exception; __u32 reserved[10]; // <= -64 bits };
The offset to kvm_vcpu_events .reserved changes; I don' think userspace will read/write reserved (so its offset is probably not very important?), but theoretically this is an ABI break.
Another safer but not very readable way is to add at the end: struct kvm_vcpu_events { struct { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; __u8 ext_iabt_pending; __u8 ext_abt_has_esr; __u8 pad[3]; __u64 serror_esr; } exception; __u32 reserved[10]; // <= -64 bits __u64 ext_abt_esr; // <= +64 bits };
Any better suggestions?
}; } exception;
One question about the naming since we cannot change it once committed. Taking the existing SError injection as example, although the name in kvm_vcpu_events is serror_has_esr, it is essentially just the ISS fields of the ESR (which is also written in virt/kvm/api.rst). Why named after "esr" instead of "iss"? The only reason I can think of is, KVM wants to leave the room to accept more fields than ISS from userspace. Does this reason apply to external aborts? Asking in case if "iss" is a better name in kvm_vcpu_events, maybe for external aborts, we should use ext_abt_has_iss?
We will probably need to include more ESR fields in the future, like ESR_ELx.ISS2. So let's just keep the existing naming if that's OK with you.
Ack to "esr", thanks Oliver!
Thanks, Oliver
Test how KVM handles guest stage-2 SEA when APEI is unable to claim it. The behavior is triggered by consuming recoverable memory error (UER) injected via EINJ. The test asserts two major things: 1. KVM returns to userspace with KVM_EXIT_ARM_SEA exit reason, and has provided expected fault information, e.g. esr, flags, gva, gpa. 2. Userspace is able to handle KVM_EXIT_ARM_SEA by injecting SEA to guest and KVM injects expected SEA into the VCPU.
Tested on a data center server running Siryn AmpereOne processor.
Several things to notice before attempting to run this selftest: - The test relies on EINJ support in both firmware and kernel to inject UER. Otherwise the test will be skipped. - The under-test platform's APEI should be unable to claim the SEA. Otherwise the test will be skipped. - Some platform doesn't support notrigger in EINJ, which may cause APEI and GHES to offline the memory before guest can consume injected UER, and making test unable to trigger SEA.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- tools/arch/arm64/include/asm/esr.h | 2 + tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/sea_to_user.c | 340 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 4 files changed, 344 insertions(+) create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
diff --git a/tools/arch/arm64/include/asm/esr.h b/tools/arch/arm64/include/asm/esr.h index bd592ca815711..0fa17b3af1f78 100644 --- a/tools/arch/arm64/include/asm/esr.h +++ b/tools/arch/arm64/include/asm/esr.h @@ -141,6 +141,8 @@ #define ESR_ELx_SF (UL(1) << ESR_ELx_SF_SHIFT) #define ESR_ELx_AR_SHIFT (14) #define ESR_ELx_AR (UL(1) << ESR_ELx_AR_SHIFT) +#define ESR_ELx_VNCR_SHIFT (13) +#define ESR_ELx_VNCR (UL(1) << ESR_ELx_VNCR_SHIFT) #define ESR_ELx_CM_SHIFT (8) #define ESR_ELx_CM (UL(1) << ESR_ELx_CM_SHIFT)
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index d37072054a3d0..9eecce6b8274f 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -152,6 +152,7 @@ TEST_GEN_PROGS_arm64 += arm64/hypercalls TEST_GEN_PROGS_arm64 += arm64/mmio_abort TEST_GEN_PROGS_arm64 += arm64/page_fault_test TEST_GEN_PROGS_arm64 += arm64/psci_test +TEST_GEN_PROGS_arm64 += arm64/sea_to_user TEST_GEN_PROGS_arm64 += arm64/set_id_regs TEST_GEN_PROGS_arm64 += arm64/smccc_filter TEST_GEN_PROGS_arm64 += arm64/vcpu_width_config diff --git a/tools/testing/selftests/kvm/arm64/sea_to_user.c b/tools/testing/selftests/kvm/arm64/sea_to_user.c new file mode 100644 index 0000000000000..381d8597ab406 --- /dev/null +++ b/tools/testing/selftests/kvm/arm64/sea_to_user.c @@ -0,0 +1,340 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Test KVM returns to userspace with KVM_EXIT_ARM_SEA if host APEI fails + * to handle SEA and userspace has opt-ed in KVM_CAP_ARM_SEA_TO_USER. + * + * After reaching userspace with expected arm_sea info, also test userspace + * injecting a synchronous external data abort into the guest. + * + * This test utilizes EINJ to generate a REAL synchronous external data + * abort by consuming a recoverable uncorrectable memory error. Therefore + * the device under test must support EINJ in both firmware and host kernel, + * including the notrigger feature. Otherwise the test will be skipped. + * The under-test platform's APEI should be unable to claim SEA. Otherwise + * the test will also be skipped. + */ + +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "guest_modes.h" + +#define PAGE_PRESENT (1ULL << 63) +#define PAGE_PHYSICAL 0x007fffffffffffffULL +#define PAGE_ADDR_MASK (~(0xfffULL)) + +/* Value for "Recoverable state (UER)". */ +#define ESR_ELx_SET_UER 0U + +/* Group ISV and ISS[23:14]. */ +#define ESR_ELx_INST_SYNDROME ((ESR_ELx_ISV) | (ESR_ELx_SAS) | \ + (ESR_ELx_SSE) | (ESR_ELx_SRT_MASK) | \ + (ESR_ELx_SF) | (ESR_ELx_AR)) + +#define EINJ_ETYPE "/sys/kernel/debug/apei/einj/error_type" +#define EINJ_ADDR "/sys/kernel/debug/apei/einj/param1" +#define EINJ_MASK "/sys/kernel/debug/apei/einj/param2" +#define EINJ_FLAGS "/sys/kernel/debug/apei/einj/flags" +#define EINJ_NOTRIGGER "/sys/kernel/debug/apei/einj/notrigger" +#define EINJ_DOIT "/sys/kernel/debug/apei/einj/error_inject" +/* Memory Uncorrectable non-fatal. */ +#define ERROR_TYPE_MEMORY_UER 0x10 +/* Memory address and mask valid (param1 and param2). */ +#define MASK_MEMORY_UER 0b10 + +/* Guest virtual address region = [2G, 3G). */ +#define START_GVA 0x80000000UL +#define VM_MEM_SIZE 0x40000000UL +/* Note: EINJ_OFFSET must < VM_MEM_SIZE. */ +#define EINJ_OFFSET 0x01234badUL +#define EINJ_GVA ((START_GVA) + (EINJ_OFFSET)) + +static vm_paddr_t einj_gpa; +static void *einj_hva; +static uint64_t einj_hpa; +static bool far_invalid; + +static uint64_t translate_to_host_paddr(unsigned long vaddr) +{ + uint64_t pinfo; + int64_t offset = vaddr / getpagesize() * sizeof(pinfo); + int fd; + uint64_t page_addr; + uint64_t paddr; + + fd = open("/proc/self/pagemap", O_RDONLY); + if (fd < 0) + ksft_exit_fail_perror("Failed to open /proc/self/pagemap"); + if (pread(fd, &pinfo, sizeof(pinfo), offset) != sizeof(pinfo)) { + close(fd); + ksft_exit_fail_perror("Failed to read /proc/self/pagemap"); + } + + close(fd); + + if ((pinfo & PAGE_PRESENT) == 0) + ksft_exit_fail_perror("Page not present"); + + page_addr = (pinfo & PAGE_PHYSICAL) << MIN_PAGE_SHIFT; + paddr = page_addr + (vaddr & (getpagesize() - 1)); + return paddr; +} + +static void write_einj_entry(const char *einj_path, uint64_t val) +{ + char cmd[256] = {0}; + FILE *cmdfile = NULL; + + sprintf(cmd, "echo %#lx > %s", val, einj_path); + cmdfile = popen(cmd, "r"); + + if (pclose(cmdfile) == 0) + ksft_print_msg("echo %#lx > %s - done\n", val, einj_path); + else + ksft_exit_fail_perror("Failed to write EINJ entry"); +} + +static void inject_uer(uint64_t paddr) +{ + if (access("/sys/firmware/acpi/tables/EINJ", R_OK) == -1) + ksft_test_result_skip("EINJ table no available in firmware"); + + if (access(EINJ_ETYPE, R_OK | W_OK) == -1) + ksft_test_result_skip("EINJ module probably not loaded?"); + + write_einj_entry(EINJ_ETYPE, ERROR_TYPE_MEMORY_UER); + write_einj_entry(EINJ_FLAGS, MASK_MEMORY_UER); + write_einj_entry(EINJ_ADDR, paddr); + write_einj_entry(EINJ_MASK, ~0x0UL); + write_einj_entry(EINJ_NOTRIGGER, 1); + write_einj_entry(EINJ_DOIT, 1); +} + +/* + * When host APEI successfully claims the SEA caused by guest_code, kernel + * will send SIGBUS signal with BUS_MCEERR_AR to test thread. + * + * We set up this SIGBUS handler to skip the test for that case. + */ +static void sigbus_signal_handler(int sig, siginfo_t *si, void *v) +{ + ksft_print_msg("SIGBUS (%d) received, dumping siginfo...\n", sig); + ksft_print_msg("si_signo=%d, si_errno=%d, si_code=%d, si_addr=%p\n", + si->si_signo, si->si_errno, si->si_code, si->si_addr); + if (si->si_code == BUS_MCEERR_AR) + ksft_test_result_skip("SEA is claimed by host APEI\n"); + else + ksft_test_result_fail("Exit with signal unhandled\n"); + + exit(0); +} + +static void setup_sigbus_handler(void) +{ + struct sigaction act; + + memset(&act, 0, sizeof(act)); + sigemptyset(&act.sa_mask); + act.sa_sigaction = sigbus_signal_handler; + act.sa_flags = SA_SIGINFO; + TEST_ASSERT(sigaction(SIGBUS, &act, NULL) == 0, + "Failed to setup SIGBUS handler"); +} + +static void guest_code(void) +{ + uint64_t guest_data; + + /* Consumes error will cause a SEA. */ + guest_data = *(uint64_t *)EINJ_GVA; + + GUEST_FAIL("Data corruption not prevented by SEA: gva=%#lx, data=%#lx", + EINJ_GVA, guest_data); +} + +static void expect_sea_handler(struct ex_regs *regs) +{ + u64 esr = read_sysreg(esr_el1); + u64 far = read_sysreg(far_el1); + bool expect_far_invalid = far_invalid; + + GUEST_PRINTF("Handling Guest SEA\n"); + GUEST_PRINTF(" ESR_EL1=%#lx, FAR_EL1=%#lx\n", esr, far); + GUEST_PRINTF(" Entire ISS2=%#llx\n", ESR_ELx_ISS2(esr)); + GUEST_PRINTF(" ISV + ISS[23:14]=%#lx\n", esr & ESR_ELx_INST_SYNDROME); + GUEST_PRINTF(" VNCR=%#lx\n", esr & ESR_ELx_VNCR); + GUEST_PRINTF(" SET=%#lx\n", esr & ESR_ELx_SET_MASK); + + GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_DABT_CUR); + GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); + + /* Asserts bits hidden by KVM. */ + GUEST_ASSERT_EQ(ESR_ELx_ISS2(esr), 0); + GUEST_ASSERT_EQ((esr & ESR_ELx_INST_SYNDROME), 0); + GUEST_ASSERT_EQ(esr & ESR_ELx_VNCR, 0); + GUEST_ASSERT_EQ(esr & ESR_ELx_SET_MASK, ESR_ELx_SET_UER); + + if (expect_far_invalid) { + GUEST_ASSERT_EQ(esr & ESR_ELx_FnV, ESR_ELx_FnV); + GUEST_PRINTF("Guest observed garbage value in FAR\n"); + } else { + GUEST_ASSERT_EQ(esr & ESR_ELx_FnV, 0); + GUEST_ASSERT_EQ(far, EINJ_GVA); + } + + GUEST_DONE(); +} + +static void vcpu_inject_sea(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_events events = {}; + + events.exception.ext_dabt_pending = true; + vcpu_events_set(vcpu, &events); +} + +static void run_vm(struct kvm_vm *vm, struct kvm_vcpu *vcpu) +{ + struct ucall uc; + bool guest_done = false; + struct kvm_run *run = vcpu->run; + + /* Resume the vCPU after error injection to consume the error. */ + vcpu_run(vcpu); + + ksft_print_msg("Dump kvm_run info about KVM_EXIT_%s\n", + exit_reason_str(run->exit_reason)); + ksft_print_msg("kvm_run.arm_sea: esr=%#llx, flags=%#llx\n", + run->arm_sea.esr, run->arm_sea.flags); + ksft_print_msg("kvm_run.arm_sea: gva=%#llx, gpa=%#llx\n", + run->arm_sea.gva, run->arm_sea.gpa); + + /* Validate the KVM_EXIT. */ + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_ARM_SEA); + TEST_ASSERT_EQ(ESR_ELx_EC(run->arm_sea.esr), ESR_ELx_EC_DABT_LOW); + TEST_ASSERT_EQ(run->arm_sea.esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); + TEST_ASSERT_EQ(run->arm_sea.esr & ESR_ELx_SET_MASK, ESR_ELx_SET_UER); + + if (run->arm_sea.flags & KVM_EXIT_ARM_SEA_FLAG_GVA_VALID) + TEST_ASSERT_EQ(run->arm_sea.gva, EINJ_GVA); + + if (run->arm_sea.flags & KVM_EXIT_ARM_SEA_FLAG_GPA_VALID) + TEST_ASSERT_EQ(run->arm_sea.gpa, einj_gpa & PAGE_ADDR_MASK); + + far_invalid = run->arm_sea.esr & ESR_ELx_FnV; + + /* Inject a SEA into guest and expect handled in SEA handler. */ + vcpu_inject_sea(vcpu); + + /* Expect the guest to reach GUEST_DONE gracefully. */ + do { + vcpu_run(vcpu); + switch (get_ucall(vcpu, &uc)) { + case UCALL_PRINTF: + ksft_print_msg("From guest: %s", uc.buffer); + break; + case UCALL_DONE: + ksft_print_msg("Guest done gracefully!\n"); + guest_done = 1; + break; + case UCALL_ABORT: + ksft_print_msg("Guest aborted!\n"); + guest_done = 1; + REPORT_GUEST_ASSERT(uc); + break; + default: + TEST_FAIL("Unexpected ucall: %lu\n", uc.cmd); + } + } while (!guest_done); +} + +static struct kvm_vm *vm_create_with_sea_handler(struct kvm_vcpu **vcpu) +{ + size_t backing_page_size; + size_t guest_page_size; + size_t alignment; + uint64_t num_guest_pages; + vm_paddr_t start_gpa; + enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB; + struct kvm_vm *vm; + + backing_page_size = get_backing_src_pagesz(src_type); + guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size; + alignment = max(backing_page_size, guest_page_size); + num_guest_pages = VM_MEM_SIZE / guest_page_size; + + vm = __vm_create_with_one_vcpu(vcpu, num_guest_pages, guest_code); + vm_init_descriptor_tables(vm); + vcpu_init_descriptor_tables(*vcpu); + + vm_install_sync_handler(vm, + /*vector=*/VECTOR_SYNC_CURRENT, + /*ec=*/ESR_ELx_EC_DABT_CUR, + /*handler=*/expect_sea_handler); + + start_gpa = (vm->max_gfn - num_guest_pages) * guest_page_size; + start_gpa = align_down(start_gpa, alignment); + + vm_userspace_mem_region_add( + /*vm=*/vm, + /*src_type=*/src_type, + /*guest_paddr=*/start_gpa, + /*slot=*/1, + /*npages=*/num_guest_pages, + /*flags=*/0); + + virt_map(vm, START_GVA, start_gpa, num_guest_pages); + + ksft_print_msg("Mapped %#lx pages: gva=%#lx to gpa=%#lx\n", + num_guest_pages, START_GVA, start_gpa); + return vm; +} + +static void vm_inject_memory_uer(struct kvm_vm *vm) +{ + uint64_t guest_data; + + einj_gpa = addr_gva2gpa(vm, EINJ_GVA); + einj_hva = addr_gva2hva(vm, EINJ_GVA); + + /* Populate certain data before injecting UER. */ + *(uint64_t *)einj_hva = 0xBAADCAFE; + guest_data = *(uint64_t *)einj_hva; + ksft_print_msg("Before EINJect: data=%#lx\n", + guest_data); + + einj_hpa = translate_to_host_paddr((unsigned long)einj_hva); + + ksft_print_msg("EINJ_GVA=%#lx, einj_gpa=%#lx, einj_hva=%p, einj_hpa=%#lx\n", + EINJ_GVA, einj_gpa, einj_hva, einj_hpa); + + inject_uer(einj_hpa); + ksft_print_msg("Memory UER EINJected\n"); +} + +int main(int argc, char *argv[]) +{ + struct kvm_vm *vm; + struct kvm_vcpu *vcpu; + + TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_SEA_TO_USER)); + + setup_sigbus_handler(); + + vm = vm_create_with_sea_handler(&vcpu); + + vm_enable_cap(vm, KVM_CAP_ARM_SEA_TO_USER, 0); + + vm_inject_memory_uer(vm); + + run_vm(vm, vcpu); + + kvm_vm_free(vm); + + return 0; +} diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index 815bc45dd8dc6..bc9fcf6c3295a 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -2021,6 +2021,7 @@ static struct exit_reason { KVM_EXIT_STRING(NOTIFY), KVM_EXIT_STRING(LOONGARCH_IOCSR), KVM_EXIT_STRING(MEMORY_FAULT), + KVM_EXIT_STRING(ARM_SEA), };
/*
Test userspace can use KVM_SET_VCPU_EVENTS to inject an external instruction abort into guest. The test injects instruction abort at an arbitrary time without real SEA happening in the guest VCPU, so only certain ESR_EL1 bits are expected and asserted.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- tools/arch/arm64/include/uapi/asm/kvm.h | 3 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++++++++++++++++ 3 files changed, 101 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h index af9d9acaf9975..d3a4530846311 100644 --- a/tools/arch/arm64/include/uapi/asm/kvm.h +++ b/tools/arch/arm64/include/uapi/asm/kvm.h @@ -184,8 +184,9 @@ struct kvm_vcpu_events { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; + __u8 ext_iabt_pending; /* Align it to 8 bytes */ - __u8 pad[5]; + __u8 pad[4]; __u64 serror_esr; } exception; __u32 reserved[12]; diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 9eecce6b8274f..e6b504ded9c1c 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -149,6 +149,7 @@ TEST_GEN_PROGS_arm64 += arm64/arch_timer_edge_cases TEST_GEN_PROGS_arm64 += arm64/debug-exceptions TEST_GEN_PROGS_arm64 += arm64/host_sve TEST_GEN_PROGS_arm64 += arm64/hypercalls +TEST_GEN_PROGS_arm64 += arm64/inject_iabt TEST_GEN_PROGS_arm64 += arm64/mmio_abort TEST_GEN_PROGS_arm64 += arm64/page_fault_test TEST_GEN_PROGS_arm64 += arm64/psci_test diff --git a/tools/testing/selftests/kvm/arm64/inject_iabt.c b/tools/testing/selftests/kvm/arm64/inject_iabt.c new file mode 100644 index 0000000000000..0c7999e5ba5b3 --- /dev/null +++ b/tools/testing/selftests/kvm/arm64/inject_iabt.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * inject_iabt.c - Tests for injecting instruction aborts into guest. + */ + +#include "processor.h" +#include "test_util.h" + +static void expect_iabt_handler(struct ex_regs *regs) +{ + u64 esr = read_sysreg(esr_el1); + + GUEST_PRINTF("Handling Guest SEA\n"); + GUEST_PRINTF(" ESR_EL1=%#lx\n", esr); + + GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_IABT_CUR); + GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); + + GUEST_DONE(); +} + +static void guest_code(void) +{ + GUEST_FAIL("Guest should only run SEA handler"); +} + +static void vcpu_run_expect_done(struct kvm_vcpu *vcpu) +{ + struct ucall uc; + bool guest_done = false; + + do { + vcpu_run(vcpu); + switch (get_ucall(vcpu, &uc)) { + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + break; + case UCALL_PRINTF: + ksft_print_msg("From guest: %s", uc.buffer); + break; + case UCALL_DONE: + ksft_print_msg("Guest done gracefully!\n"); + guest_done = true; + break; + default: + TEST_FAIL("Unexpected ucall: %lu", uc.cmd); + } + } while (!guest_done); +} + +static void vcpu_inject_ext_iabt(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_events events = {}; + + events.exception.ext_iabt_pending = true; + vcpu_events_set(vcpu, &events); +} + +static void vcpu_inject_invalid_abt(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_events events = {}; + int r; + + events.exception.ext_iabt_pending = true; + events.exception.ext_dabt_pending = true; + + ksft_print_msg("Injecting invalid external abort events\n"); + r = __vcpu_ioctl(vcpu, KVM_SET_VCPU_EVENTS, &events); + TEST_ASSERT(r && errno == EINVAL, + KVM_IOCTL_ERROR(KVM_SET_VCPU_EVENTS, r)); +} + +static void test_inject_iabt(void) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + + vm = vm_create_with_one_vcpu(&vcpu, guest_code); + + vm_init_descriptor_tables(vm); + vcpu_init_descriptor_tables(vcpu); + + vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, + ESR_ELx_EC_IABT_CUR, expect_iabt_handler); + + vcpu_inject_invalid_abt(vcpu); + + vcpu_inject_ext_iabt(vcpu); + vcpu_run_expect_done(vcpu); + + kvm_vm_free(vm); +} + +int main(void) +{ + test_inject_iabt(); + return 0; +}
On Wed, Jun 04, 2025 at 05:09:00AM +0000, Jiaqi Yan wrote:
Test userspace can use KVM_SET_VCPU_EVENTS to inject an external instruction abort into guest. The test injects instruction abort at an arbitrary time without real SEA happening in the guest VCPU, so only certain ESR_EL1 bits are expected and asserted.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com
I reworked mmio_abort to be a general external abort test, can you add your test cases there in the next spin (arm64/external_aborts.c)?
Thanks, Oliver
tools/arch/arm64/include/uapi/asm/kvm.h | 3 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++++++++++++++++ 3 files changed, 101 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h index af9d9acaf9975..d3a4530846311 100644 --- a/tools/arch/arm64/include/uapi/asm/kvm.h +++ b/tools/arch/arm64/include/uapi/asm/kvm.h @@ -184,8 +184,9 @@ struct kvm_vcpu_events { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending;
/* Align it to 8 bytes */__u8 ext_iabt_pending;
__u8 pad[5];
__u64 serror_esr; } exception; __u32 reserved[12];__u8 pad[4];
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 9eecce6b8274f..e6b504ded9c1c 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -149,6 +149,7 @@ TEST_GEN_PROGS_arm64 += arm64/arch_timer_edge_cases TEST_GEN_PROGS_arm64 += arm64/debug-exceptions TEST_GEN_PROGS_arm64 += arm64/host_sve TEST_GEN_PROGS_arm64 += arm64/hypercalls +TEST_GEN_PROGS_arm64 += arm64/inject_iabt TEST_GEN_PROGS_arm64 += arm64/mmio_abort TEST_GEN_PROGS_arm64 += arm64/page_fault_test TEST_GEN_PROGS_arm64 += arm64/psci_test diff --git a/tools/testing/selftests/kvm/arm64/inject_iabt.c b/tools/testing/selftests/kvm/arm64/inject_iabt.c new file mode 100644 index 0000000000000..0c7999e5ba5b3 --- /dev/null +++ b/tools/testing/selftests/kvm/arm64/inject_iabt.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0-only +/*
- inject_iabt.c - Tests for injecting instruction aborts into guest.
- */
+#include "processor.h" +#include "test_util.h"
+static void expect_iabt_handler(struct ex_regs *regs) +{
- u64 esr = read_sysreg(esr_el1);
- GUEST_PRINTF("Handling Guest SEA\n");
- GUEST_PRINTF(" ESR_EL1=%#lx\n", esr);
- GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_IABT_CUR);
- GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT);
- GUEST_DONE();
+}
+static void guest_code(void) +{
- GUEST_FAIL("Guest should only run SEA handler");
+}
+static void vcpu_run_expect_done(struct kvm_vcpu *vcpu) +{
- struct ucall uc;
- bool guest_done = false;
- do {
vcpu_run(vcpu);
switch (get_ucall(vcpu, &uc)) {
case UCALL_ABORT:
REPORT_GUEST_ASSERT(uc);
break;
case UCALL_PRINTF:
ksft_print_msg("From guest: %s", uc.buffer);
break;
case UCALL_DONE:
ksft_print_msg("Guest done gracefully!\n");
guest_done = true;
break;
default:
TEST_FAIL("Unexpected ucall: %lu", uc.cmd);
}
- } while (!guest_done);
+}
+static void vcpu_inject_ext_iabt(struct kvm_vcpu *vcpu) +{
- struct kvm_vcpu_events events = {};
- events.exception.ext_iabt_pending = true;
- vcpu_events_set(vcpu, &events);
+}
+static void vcpu_inject_invalid_abt(struct kvm_vcpu *vcpu) +{
- struct kvm_vcpu_events events = {};
- int r;
- events.exception.ext_iabt_pending = true;
- events.exception.ext_dabt_pending = true;
- ksft_print_msg("Injecting invalid external abort events\n");
- r = __vcpu_ioctl(vcpu, KVM_SET_VCPU_EVENTS, &events);
- TEST_ASSERT(r && errno == EINVAL,
KVM_IOCTL_ERROR(KVM_SET_VCPU_EVENTS, r));
+}
+static void test_inject_iabt(void) +{
- struct kvm_vcpu *vcpu;
- struct kvm_vm *vm;
- vm = vm_create_with_one_vcpu(&vcpu, guest_code);
- vm_init_descriptor_tables(vm);
- vcpu_init_descriptor_tables(vcpu);
- vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT,
ESR_ELx_EC_IABT_CUR, expect_iabt_handler);
- vcpu_inject_invalid_abt(vcpu);
- vcpu_inject_ext_iabt(vcpu);
- vcpu_run_expect_done(vcpu);
- kvm_vm_free(vm);
+}
+int main(void) +{
- test_inject_iabt();
- return 0;
+}
2.49.0.1266.g31b7d2e469-goog
On Fri, Jul 11, 2025 at 12:45 PM Oliver Upton oliver.upton@linux.dev wrote:
On Wed, Jun 04, 2025 at 05:09:00AM +0000, Jiaqi Yan wrote:
Test userspace can use KVM_SET_VCPU_EVENTS to inject an external instruction abort into guest. The test injects instruction abort at an arbitrary time without real SEA happening in the guest VCPU, so only certain ESR_EL1 bits are expected and asserted.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com
I reworked mmio_abort to be a general external abort test, can you add your test cases there in the next spin (arm64/external_aborts.c)?
For sure!
Thanks, Oliver
tools/arch/arm64/include/uapi/asm/kvm.h | 3 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++++++++++++++++ 3 files changed, 101 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
diff --git a/tools/arch/arm64/include/uapi/asm/kvm.h b/tools/arch/arm64/include/uapi/asm/kvm.h index af9d9acaf9975..d3a4530846311 100644 --- a/tools/arch/arm64/include/uapi/asm/kvm.h +++ b/tools/arch/arm64/include/uapi/asm/kvm.h @@ -184,8 +184,9 @@ struct kvm_vcpu_events { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending;
__u8 ext_iabt_pending; /* Align it to 8 bytes */
__u8 pad[5];
__u8 pad[4]; __u64 serror_esr; } exception; __u32 reserved[12];
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 9eecce6b8274f..e6b504ded9c1c 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -149,6 +149,7 @@ TEST_GEN_PROGS_arm64 += arm64/arch_timer_edge_cases TEST_GEN_PROGS_arm64 += arm64/debug-exceptions TEST_GEN_PROGS_arm64 += arm64/host_sve TEST_GEN_PROGS_arm64 += arm64/hypercalls +TEST_GEN_PROGS_arm64 += arm64/inject_iabt TEST_GEN_PROGS_arm64 += arm64/mmio_abort TEST_GEN_PROGS_arm64 += arm64/page_fault_test TEST_GEN_PROGS_arm64 += arm64/psci_test diff --git a/tools/testing/selftests/kvm/arm64/inject_iabt.c b/tools/testing/selftests/kvm/arm64/inject_iabt.c new file mode 100644 index 0000000000000..0c7999e5ba5b3 --- /dev/null +++ b/tools/testing/selftests/kvm/arm64/inject_iabt.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0-only +/*
- inject_iabt.c - Tests for injecting instruction aborts into guest.
- */
+#include "processor.h" +#include "test_util.h"
+static void expect_iabt_handler(struct ex_regs *regs) +{
u64 esr = read_sysreg(esr_el1);
GUEST_PRINTF("Handling Guest SEA\n");
GUEST_PRINTF(" ESR_EL1=%#lx\n", esr);
GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_IABT_CUR);
GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT);
GUEST_DONE();
+}
+static void guest_code(void) +{
GUEST_FAIL("Guest should only run SEA handler");
+}
+static void vcpu_run_expect_done(struct kvm_vcpu *vcpu) +{
struct ucall uc;
bool guest_done = false;
do {
vcpu_run(vcpu);
switch (get_ucall(vcpu, &uc)) {
case UCALL_ABORT:
REPORT_GUEST_ASSERT(uc);
break;
case UCALL_PRINTF:
ksft_print_msg("From guest: %s", uc.buffer);
break;
case UCALL_DONE:
ksft_print_msg("Guest done gracefully!\n");
guest_done = true;
break;
default:
TEST_FAIL("Unexpected ucall: %lu", uc.cmd);
}
} while (!guest_done);
+}
+static void vcpu_inject_ext_iabt(struct kvm_vcpu *vcpu) +{
struct kvm_vcpu_events events = {};
events.exception.ext_iabt_pending = true;
vcpu_events_set(vcpu, &events);
+}
+static void vcpu_inject_invalid_abt(struct kvm_vcpu *vcpu) +{
struct kvm_vcpu_events events = {};
int r;
events.exception.ext_iabt_pending = true;
events.exception.ext_dabt_pending = true;
ksft_print_msg("Injecting invalid external abort events\n");
r = __vcpu_ioctl(vcpu, KVM_SET_VCPU_EVENTS, &events);
TEST_ASSERT(r && errno == EINVAL,
KVM_IOCTL_ERROR(KVM_SET_VCPU_EVENTS, r));
+}
+static void test_inject_iabt(void) +{
struct kvm_vcpu *vcpu;
struct kvm_vm *vm;
vm = vm_create_with_one_vcpu(&vcpu, guest_code);
vm_init_descriptor_tables(vm);
vcpu_init_descriptor_tables(vcpu);
vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT,
ESR_ELx_EC_IABT_CUR, expect_iabt_handler);
vcpu_inject_invalid_abt(vcpu);
vcpu_inject_ext_iabt(vcpu);
vcpu_run_expect_done(vcpu);
kvm_vm_free(vm);
+}
+int main(void) +{
test_inject_iabt();
return 0;
+}
2.49.0.1266.g31b7d2e469-goog
Document the new userspace-visible features and APIs for handling synchronous external abort (SEA) - KVM_CAP_ARM_SEA_TO_USER: How userspace enables the new feature. - KVM_EXIT_ARM_SEA: When userspace needs to handle SEA and what userspace gets while taking the SEA. - KVM_CAP_ARM_INJECT_EXT_(D|I)ABT: How userspace injects SEA to guest while taking the SEA.
Signed-off-by: Jiaqi Yan jiaqiyan@google.com --- Documentation/virt/kvm/api.rst | 128 +++++++++++++++++++++++++++++---- 1 file changed, 115 insertions(+), 13 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index fe3d6b5d2acca..c58ecb72a4b4d 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1236,8 +1236,9 @@ directly to the virtual CPU). __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; + __u8 ext_iabt_pending; /* Align it to 8 bytes */ - __u8 pad[5]; + __u8 pad[4]; __u64 serror_esr; } exception; __u32 reserved[12]; @@ -1292,20 +1293,57 @@ ARM64:
User space may need to inject several types of events to the guest.
+Inject SError +~~~~~~~~~~~~~ + Set the pending SError exception state for this VCPU. It is not possible to 'cancel' an Serror that has been made pending.
-If the guest performed an access to I/O memory which could not be handled by -userspace, for example because of missing instruction syndrome decode -information or because there is no device mapped at the accessed IPA, then -userspace can ask the kernel to inject an external abort using the address -from the exiting fault on the VCPU. It is a programming error to set -ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or -KVM_EXIT_ARM_NISV. This feature is only available if the system supports -KVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality in -how userspace reports accesses for the above cases to guests, across different -userspace implementations. Nevertheless, userspace can still emulate all Arm -exceptions by manipulating individual registers using the KVM_SET_ONE_REG API. +Inject SEA (synchronous external abort) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- If the guest performed an access to I/O memory which could not be handled by + userspace, for example because of missing instruction syndrome decode + information or because there is no device mapped at the accessed IPA. + +- If the guest consumed an uncorrected memory error, and RAS extension in the + Trusted Firmware chooses to notify PE with SEA, KVM has to handle it when + host APEI is unable to claim the SEA. For the following types of faults, + if userspace has enabled KVM_CAP_ARM_SEA_TO_USER, KVM returns to userspace + with KVM_EXIT_ARM_SEA: + + - Synchronous external abort, not on translation table walk or hardware + update of translation table. + + - Synchronous external abort on stage-1 translation table walk or hardware + update of stage-1 translation table, including all levels. + + - Synchronous parity or ECC error on memory access, not on translation table + walk. + + - Synchronous parity or ECC error on memory access on stage-1 translation + table walk or hardware update of stage-1 translation table, including + all levels. + +Note that external abort or ECC error on memory access on stage-2 translation +table walk or hardware update of stage-2 translation table does not results in +KVM_EXIT_ARM_SEA, even if KVM_CAP_ARM_SEA_TO_USER is enabled. + +For the cases above, userspace can ask the kernel to replay either an external +data abort (by setting ext_dabt_pending) or an external instruction abort +(by setting ext_iabt_pending) into the faulting VCPU. KVM will use the address +from the existing fault on the VCPU. Setting both ext_dabt_pending and +ext_iabt_pending at the same time will return -EINVAL. + +It is a programming error to set ext_dabt_pending or ext_iabt_pending after an +exit which was not KVM_EXIT_MMIO, KVM_EXIT_ARM_NISV or KVM_EXIT_ARM_SEA. +Injecting SEA for data and instruction abort is only available if KVM supports +KVM_CAP_ARM_INJECT_EXT_DABT and KVM_CAP_ARM_INJECT_EXT_IABT respectively. + +This is a helper which provides commonality in how userspace reports accesses +for the above cases to guests, across different userspace implementations. +Nevertheless, userspace can still emulate all Arm exceptions by manipulating +individual registers using the KVM_SET_ONE_REG API.
See KVM_GET_VCPU_EVENTS for the data structure.
@@ -7163,6 +7201,58 @@ The valid value for 'flags' is: - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid in VMCS. It would run into unknown result if resume the target VM.
+:: + + /* KVM_EXIT_ARM_SEA */ + struct { + __u64 esr; + #define KVM_EXIT_ARM_SEA_FLAG_GVA_VALID (1ULL << 0) + #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 1) + __u64 flags; + __u64 gva; + __u64 gpa; + } arm_sea; + +Used on arm64 systems. When the VM capability KVM_CAP_ARM_SEA_TO_USER is +enabled, a VM exit is generated if guest causes a synchronous external abort +(SEA) and the host APEI fails to handle the SEA. + +Historically KVM handles SEA by first delegating the SEA to host APEI as there +is high chance that the SEA is caused by consuming uncorrected memory error. +However, not all platforms support SEA handling in APEI, and KVM's fallback +handling is to inject an async SError into the guest, which usually panics +guest kernel unpleasantly. As an alternative, userspace can participate into +the SEA handling by enabling KVM_CAP_ARM_SEA_TO_USER at VM creation, after +querying the capability. Once enabled, when KVM has to handle the guest +caused SEA, it returns to userspace with KVM_EXIT_ARM_SEA, with details +about the SEA available in 'arm_sea'. + +The 'esr' field holds the value of the exception syndrome register (ESR) while +KVM taking the SEA, which tells userspace the character of the current SEA, +such as its Exception Class, Synchronous Error Type, Fault Specific Code and +so on. For more details on ESR, check the Arm Architecture Registers +documentation. + +The 'flags' field indicates if the faulting addresses are valid while taking +the SEA: + + - KVM_EXIT_ARM_SEA_FLAG_GVA_VALID -- the faulting guest virtual address + is valid and userspace can get its value in the 'gva' field. + - KVM_EXIT_ARM_SEA_FLAG_GPA_VALID -- the faulting guest physical address + is valid and userspace can get its value in the 'gpa' field. + +Userspace needs to take actions to handle guest SEA synchronously, namely in +the same thread that runs KVM_RUN and receives KVM_EXIT_ARM_SEA. One of the +encouraged approaches is to utilize the KVM_SET_VCPU_EVENTS to inject the SEA +to the faulting VCPU. This way, the guest has the opportunity to keep running +and limit the blast radius of the SEA to the particular guest application that +caused the SEA. If the Exception Class indicated by 'esr' field in 'arm_sea' +is data abort, userspace should inject data abort. If the Exception Class is +instruction abort, userspace should inject instruction abort. Userspace may +also emulate the SEA to VM by itself using the KVM_SET_ONE_REG API. In this +case, it can use the valid values from 'gva' and 'gpa' fields to manipulate +VCPU's registers (e.g. FAR_EL1, HPFAR_EL1). + ::
/* Fix the size of the union. */ @@ -8490,7 +8580,7 @@ ENOSYS for the others. When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
-7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS +7.42 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS -------------------------------------
:Architectures: arm64 @@ -8508,6 +8598,18 @@ aforementioned registers before the first KVM_RUN. These registers are VM scoped, meaning that the same set of values are presented on all vCPUs in a given VM.
+7.43 KVM_CAP_ARM_SEA_TO_USER +---------------------------- + +:Architecture: arm64 +:Target: VM +:Parameters: none +:Returns: 0 on success, -EINVAL if unsupported. + +This capability, if KVM_CHECK_EXTENSION indicates that it is available, means +that KVM has an implementation that allows userspace to participate in handling +synchronous external abort caused by VM, by an exit of KVM_EXIT_ARM_SEA. + 8. Other capabilities. ======================
linux-kselftest-mirror@lists.linaro.org