On Thu, Nov 13, 2025 at 02:54:33PM +0100, Mauro Carvalho Chehab wrote:
Hi,
On Mon, Nov 10, 2025 at 09:41:33AM -0800, Jiaqi Yan wrote:
On Mon, Oct 20, 2025 at 7:46 AM Jason Gunthorpe jgg@nvidia.com wrote:
On Mon, Oct 13, 2025 at 06:59:00PM +0000, Jiaqi Yan wrote:
Problem
When host APEI is unable to claim a synchronous external abort (SEA) during guest abort, today KVM directly injects an asynchronous SError into the VCPU then resumes it. The injected SError usually results in unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable uncorrected memory error (UER), which is not uncommon at all in modern datacenter servers with large amounts of physical memory. Although SError and guest panic is sufficient to stop the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner.
Proposed Solution
The idea is, we can replay the SEA to the faulting VCPU. If the memory error consumption or the fault that cause SEA is not from guest kernel, the blast radius can be limited to the poison-consuming guest process, while the VM can keep running.
I like the idea of having a "guest-first"/"host-first" approach for APEI, letting userspace (likely rasdaemon) to decide to handle hardware errors either at the guest or at the host. Yet, it sounds wrong to have a flag called KVM_EXIT_ARM_SEA, as:
1. This is not exclusive to ARM; 2. There are other notification mechanisms that can rise an APEI errors. For instance QEMU code defines: ACPI_GHES_NOTIFY_POLLED = 0, ACPI_GHES_NOTIFY_EXTERNAL = 1, ACPI_GHES_NOTIFY_LOCAL = 2, ACPI_GHES_NOTIFY_SCI = 3, ACPI_GHES_NOTIFY_NMI = 4, ACPI_GHES_NOTIFY_CMCI = 5, ACPI_GHES_NOTIFY_MCE = 6, ACPI_GHES_NOTIFY_GPIO = 7, ACPI_GHES_NOTIFY_SEA = 8, ACPI_GHES_NOTIFY_SEI = 9, ACPI_GHES_NOTIFY_GSIV = 10, ACPI_GHES_NOTIFY_SDEI = 11, ACPI_GHES_NOTIFY_RESERVED = 12
- even on arm. QEMU currently implements two mechanisms (SEA and GPIO);
- once we implement the same feature on Intel, it will likely use NMI, MCE and/or SCI.
So, IMO, the best would be to use a more generic name like KVM_EXIT_APEI or KVM_EXIT_GHES - or maybe even name it the way it really is meant: KVM_EXIT_ACPI_GUEST_FIRST.
This is not the sort of thing that I'd like to seen dressed up as an arch-generic interface.
What Jiaqi is dealing with is the very sorry state of RAS on arm64, giving userspace the opportunity to decide how an SEA is handled when a platform's firmware couldn't be bothered to do so. The SEA is an architecture-specific event so we provide the hardware context to the VMM to sort things out.
If the APEI driver actually registers to handle the SEA then it will continue to handle the SEA before ever involving the VMM. I'm not aware of any system that does this. If you're lucky you'll take an *asynchronous* vector after to process a CPER and still have to deal with a 'bare' SEA.
And of course, none of this even matters for the several billion DT-based hosts out in the wild.
Thanks, Oliver