At the moment, the ability to direct-inject vLPIs is only enableable on an all-or-nothing per-VM basis, causing unnecessary I/O performance loss in cases where a VM's vCPU count exceeds available vPEs. This RFC introduces per-vCPU control over vLPI injection to realize potential I/O performance gain in such situations.
Background ----------
The value of dynamically enabling the direct injection of vLPIs on a per-vCPU basis is the ability to run guest VMs with simultaneous hardware-forwarded and software-forwarded message-signaled interrupts.
Currently, hardware-forwarded vLPI direct injection on a KVM guest requires GICv4 and is enabled on a per-VM, all-or-nothing basis. vLPI injection enablment happens in two stages:
1) At vGIC initialization, allocate direct injection structures for each vCPU (doorbell IRQ, vPE table entry, virtual pending table, vPEID). 2) When a PCI device is configured for passthrough, map its MSIs to vLPIs using the structures allocated in step 1.
Step 1 is all-or-nothing; if any vCPU cannot be configured with the vPE structures necessary for direct injection, the vPEs of all vCPUs are torn down and direct injection is disabled VM-wide.
This universality of direct vLPI injection enablement sparks several issues, with the most pressing being performance degradation on overcommitted hosts.
VM-wide vLPI enablement creates resource inefficiency when guest VMs have more vCPUs than the host has available vPEIDs. The amount of vPEIDs (and consequently, vPEs) a host can allocate is constrained by hardware and defined by GICD_TYPER2.VID + 1 (ITS_MAX_VPEID). Since direct injection requires a vCPU to be assigned a vPEID, at most ITS_MAX_VPEID vCPUs can be configured for direct injection at a time. Because vLPI direct injection is all-or-nothing on a VM, if a new guest VM would exhaust remaining vPEIDs, all vCPUs on that VM would fall back to hypervisor-forwarded LPIs, causing considerable I/O performance degradation.
Such performance degradation is exemplified on hosts with CPU overcommitment. Overcommitting an arbitrarily high number of vCPUs enables a VM's vCPU count to easily exceed the host's available vPEIDs. Even with marginally more vCPUs than vPEIDs, the current all-or-nothing vLPI paradigm disables direct injection entirely. This creates two problems: first, a single many-vCPU overcommitted VM loses all direct injection despite having vPEIDs available; second, on multi-tenant hosts, VMs booted first consume all vPEIDs, leaving later VMs without direct injection regardless of their I/O intensity. Per-vCPU control would allow userspace to allocate available vPEIDs across VMs based on I/O workload rather than boot order or per-VM vCPU count. This per-vCPU granularity recovers most of the direct injection performance benefit instead of losing it completely.
To allow this per-vCPU granularity, this RFC introduces three new ioctls to the KVM API that enables userspace the ability to activate/deactivate direct vLPI injection capability and resources to vCPUs ad-hoc during VM runtime.
This RFC proposes userspace control, rather than kernel control, over vPEID allocation for simplicity of implementation, ease of testability, and autonomy over resource usage. In the future, the vLPI enable/disable building blocks from this RFC may be used to implement a full vPE allocation policy in the kernel.
The solution comes in several parts -----------------------------------
1) [P 1] General declarations (ioctl definitions/stubs, kconfig option)
2) [P 2] Conditionally disable auto vLPI injection init routines
To prevent vCPUs from exceeding vPEID allocation limits upon VM boot, disable automatic vPEID allocation in the GICv4 initialization routine when the per-vCPU kconfig is active. Likewise, disable automatic hardware forwarding for PCI device-backed MSIs upon device registration.
3) [P 3-6] Implement per-vCPU vLPI enablement routine, which:
a) Creates per-vCPU doorbell IRQ on new vCPU-scoped, rather than VM-scoped, interrupt domain hierarchies.
b) Allocates per-vCPU vPE table entries and virtual pending table, linking them to the vCPU's doorbell IRQ.
c) Iterates through interrupt translation table to set hardware forwarding for all PCI device–backed interrupts targeting the specific vCPU.
3) [P 7-8] Implement per-vCPU vLPI disablement routine, which
a) Iterates through interrupt translation table to unset hardware forwarding for all interrupts targeting the specific vCPU.
b) Frees per-vCPU vPE table entries, virtual pending table, and doorbell IRQ, then removes vgic_dist's pointer to the vCPU's freed vPE.
4) [P 9] Couple vSGI enablement with per-vCPU vPE allocation
Since vSGIs cannot be direct-injected without an allocated vPE on the receiving vCPU, couple vSGI enablement with vLPI enablement on GICv4.1.
5) [P 10-13] Write selftests for vLPI direct injection
PCI devices cannot be passed through to selftest guests, so define an ioctl that mocks a hardware source for software-defined MSI interrupts and sets vLPI "hardware" forwarding for the MSIs. Use these vLPIs to selftest per-vCPU vLPI enablement/disablement ioctls.
Testing ------- Testing has been carried out via selftests and QEMU-emulated guests.
Selftests have covered diverse vLPI configurations and race conditions. These include: 1) Stress testing LPI injection across multiple vCPUs while concurrently and repeatedly toggling the vCPUs' vLPI injection capability. 2) Enabling/disabling vLPI direct injection while scheduling or unscheduling a vCPU. 3) Allocating and freeing a single vPEID to multiple vCPUs, ensuring reusability. 4) Attempting to allocate a vPEID when all are already allocated, validating an error is thrown. 5) Calling enable/disable vLPI ioctls when GIC is not initialized. 6) Idempotent ioctl calls.
PCI device passthrough and interrupt injection to QEMU guest demonstrated: 1) Complete hypervisor circumvention when vLPI injection is enabled on a vCPU, hypervisor forwarding when vLPI injection is disabled. 2) Interrupts are not lost when received during per-vCPU vLPI state transitions.
Caveats -------
1) Pending interrupts are flushed when vLPI injection is disabled for a vCPU; hardware pending state is not transfered to software. This may cause pending interrupts to be lost upon vPE disablement.
Unlike vSGIs, vLPIs do not expose their pending state through a GICD_ISPENDR register. Thus, we would need to read the pending state of the vLPI from the vPT. To read the pending status of the vLPI from vPT, we would need to invalidate any vPT cache associated with the vCPU's vPE. This requires unmapping the vPE and halting the vCPU, which would be incredibly expensive and unecessary given that MSIs are usually recoverable by the driver.
2) Direct-injected vSGIs (GICv4.1) require vCPUs to have associated vPEs. Since disabling vLPI injection on a vCPU frees its vPE, vSGI direct injection must simultaenously be disabled as well. At the moment, we use the per-vCPU vSGI toggle mechanism introduced in commit bacf2c6 to enable/disable vSGI injection alongside vLPI injection.
Maximilian Dittgen (13): KVM: Introduce config option for per-vCPU vLPI enablement KVM: arm64: Disable auto vCPU vPE assignment with per-vCPU vLPI config KVM: arm64: Refactor out locked section of kvm_vgic_v4_set_forwarding() KVM: arm64: Implement vLPI QUERY ioctl for per-vCPU vLPI injection API KVM: arm64: Implement vLPI ENABLE ioctl for per-vCPU vLPI injection API KVM: arm64: Resolve race between vCPU scheduling and vLPI enablement KVM: arm64: Implement vLPI DISABLE ioctl for per-vCPU vLPI Injection API KVM: arm64: Make per-vCPU vLPI control ioctls atomic KVM: arm64: Couple vSGI enablement with per-vCPU vPE allocation KVM: selftests: fix MAPC RDbase target formatting in vgic_lpi_stress KVM: Ioctl to set up userspace-injected MSIs as software-bypassing vLPIs KVM: arm64: selftests: Add support for stress testing direct-injected vLPIs KVM: arm64: selftests: Add test for per-vCPU vLPI control API
Documentation/virt/kvm/api.rst | 56 +++ arch/arm64/kvm/arm.c | 89 +++++ arch/arm64/kvm/vgic/vgic-its.c | 142 ++++++- arch/arm64/kvm/vgic/vgic-v3.c | 14 +- arch/arm64/kvm/vgic/vgic-v4.c | 370 +++++++++++++++++- arch/arm64/kvm/vgic/vgic.h | 10 + drivers/irqchip/Kconfig | 13 + drivers/irqchip/irq-gic-v3-its.c | 58 ++- drivers/irqchip/irq-gic-v4.c | 75 +++- include/kvm/arm_vgic.h | 8 + include/linux/irqchip/arm-gic-v3.h | 5 + include/linux/irqchip/arm-gic-v4.h | 10 +- include/linux/kvm_host.h | 11 + include/uapi/linux/kvm.h | 22 ++ tools/testing/selftests/kvm/Makefile.kvm | 1 + .../selftests/kvm/arm64/per_vcpu_vlpi.c | 274 +++++++++++++ .../selftests/kvm/arm64/vgic_lpi_stress.c | 181 ++++++++- .../selftests/kvm/lib/arm64/gic_v3_its.c | 9 +- 18 files changed, 1307 insertions(+), 41 deletions(-) create mode 100644 tools/testing/selftests/kvm/arm64/per_vcpu_vlpi.c