Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS handler to set pending LAPIC INIT event regardless of if vCPU is in SMM mode or not.
However, latch INIT without checking CPU state exists race condition, which causes the loss of INIT event. This is fatal during the VM startup process because it will cause some AP to never switch to non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of pending INIT due to race") said: BSP AP kvm_vcpu_ioctl_x86_get_vcpu_events events->smi.latched_init = 0
kvm_vcpu_block kvm_vcpu_check_block schedule
send INIT to AP kvm_vcpu_ioctl_x86_set_vcpu_events (e.g. `info registers -a` when VM starts/reboots) if (events->smi.latched_init == 0) clear INIT in pending_events
kvm_apic_accept_events test_bit(KVM_APIC_INIT, &pe) == false vcpu->arch.mp_state maintains UNINITIALIZED
send SIPI to AP kvm_apic_accept_events test_bit(KVM_APIC_SIPI, &pe) == false vcpu->arch.mp_state will never change to RUNNABLE (defy: UNINITIALIZED => INIT_RECEIVED => RUNNABLE) AP will never switch to non-root operation
In such race result, VM hangs. E.g., BSP loops in SeaBIOS's SMPLock and AP will never be reset, and qemu hmp "info registers -a" shows: CPU#0 EAX=00000002 EBX=00000002 ECX=00000000 EDX=00020000 ESI=00000000 EDI=00000000 EBP=00000008 ESP=00006c6c EIP=000ef570 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ...... CPU#1 EAX=00000000 EBX=00000000 ECX=00000000 EDX=00080660 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 00009300 CS =f000 ffff0000 0000ffff 00009b00 ......
Fix this by handling latched INITs only in specific CPU states (SMM, VMX non-root mode, SVM with GIF=0) in KVM_SET_VCPU_EVENTS.
Cc: stable@vger.kernel.org Fixes: ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM") Signed-off-by: Fei Li lifei.shirley@bytedance.com --- arch/x86/kvm/x86.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a1c49bc681c46..7001b2af00ed1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5556,7 +5556,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu, return -EINVAL; #endif
- if (lapic_in_kernel(vcpu)) { + if (!kvm_apic_init_sipi_allowed(vcpu) && lapic_in_kernel(vcpu)) { if (events->smi.latched_init) set_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events); else
On Wed, Aug 27, 2025, Fei Li wrote:
Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS handler to set pending LAPIC INIT event regardless of if vCPU is in SMM mode or not.
However, latch INIT without checking CPU state exists race condition, which causes the loss of INIT event. This is fatal during the VM startup process because it will cause some AP to never switch to non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of pending INIT due to race") said: BSP AP kvm_vcpu_ioctl_x86_get_vcpu_events events->smi.latched_init = 0
kvm_vcpu_block kvm_vcpu_check_block schedule
send INIT to AP kvm_vcpu_ioctl_x86_set_vcpu_events (e.g. `info registers -a` when VM starts/reboots) if (events->smi.latched_init == 0) clear INIT in pending_events
This is a QEMU bug, no? IIUC, it's invoking kvm_vcpu_ioctl_x86_set_vcpu_events() with stale data. I'm also a bit confused as to how QEMU is even gaining control of the vCPU to emit KVM_SET_VCPU_EVENTS if the vCPU is in kvm_vcpu_block().
kvm_apic_accept_events test_bit(KVM_APIC_INIT, &pe) == false vcpu->arch.mp_state maintains UNINITIALIZED
send SIPI to AP kvm_apic_accept_events test_bit(KVM_APIC_SIPI, &pe) == false vcpu->arch.mp_state will never change to RUNNABLE (defy: UNINITIALIZED => INIT_RECEIVED => RUNNABLE) AP will never switch to non-root operation
In such race result, VM hangs. E.g., BSP loops in SeaBIOS's SMPLock and AP will never be reset, and qemu hmp "info registers -a" shows: CPU#0 EAX=00000002 EBX=00000002 ECX=00000000 EDX=00020000 ESI=00000000 EDI=00000000 EBP=00000008 ESP=00006c6c EIP=000ef570 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ...... CPU#1 EAX=00000000 EBX=00000000 ECX=00000000 EDX=00080660 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 00009300 CS =f000 ffff0000 0000ffff 00009b00 ......
Fix this by handling latched INITs only in specific CPU states (SMM, VMX non-root mode, SVM with GIF=0) in KVM_SET_VCPU_EVENTS.
Cc: stable@vger.kernel.org Fixes: ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM") Signed-off-by: Fei Li lifei.shirley@bytedance.com
arch/x86/kvm/x86.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a1c49bc681c46..7001b2af00ed1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5556,7 +5556,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu, return -EINVAL; #endif
if (lapic_in_kernel(vcpu)) {
if (!kvm_apic_init_sipi_allowed(vcpu) && lapic_in_kernel(vcpu)) { if (events->smi.latched_init) set_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events); else
-- 2.39.2 (Apple Git-143)
On Wed, Aug 27, 2025 at 6:01 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Aug 27, 2025, Fei Li wrote:
Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS handler to set pending LAPIC INIT event regardless of if vCPU is in SMM mode or not.
However, latch INIT without checking CPU state exists race condition, which causes the loss of INIT event. This is fatal during the VM startup process because it will cause some AP to never switch to non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of pending INIT due to race") said: BSP AP kvm_vcpu_ioctl_x86_get_vcpu_events events->smi.latched_init = 0
kvm_vcpu_block kvm_vcpu_check_block schedule
send INIT to AP kvm_vcpu_ioctl_x86_set_vcpu_events (e.g. `info registers -a` when VM starts/reboots) if (events->smi.latched_init == 0) clear INIT in pending_events
This is a QEMU bug, no?
I think I agree.
IIUC, it's invoking kvm_vcpu_ioctl_x86_set_vcpu_events() with stale data.
More precisely, it's not expecting other vCPUs to change the pending events asynchronously.
I'm also a bit confused as to how QEMU is even gaining control of the vCPU to emit KVM_SET_VCPU_EVENTS if the vCPU is in kvm_vcpu_block().
With a signal. :)
Paolo
On 8/28/25 12:08 AM, Paolo Bonzini wrote:
On Wed, Aug 27, 2025 at 6:01 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Aug 27, 2025, Fei Li wrote:
Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS handler to set pending LAPIC INIT event regardless of if vCPU is in SMM mode or not.
However, latch INIT without checking CPU state exists race condition, which causes the loss of INIT event. This is fatal during the VM startup process because it will cause some AP to never switch to non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of pending INIT due to race") said: BSP AP kvm_vcpu_ioctl_x86_get_vcpu_events events->smi.latched_init = 0
kvm_vcpu_block kvm_vcpu_check_block schedule
send INIT to AP kvm_vcpu_ioctl_x86_set_vcpu_events (e.g. `info registers -a` when VM starts/reboots) if (events->smi.latched_init == 0) clear INIT in pending_events
This is a QEMU bug, no?
I think I agree.
Actually this is a bug triggered by one monitor tool in our production environment. This monitor executes 'info registers -a' hmp at a fixed frequency, even during VM startup process, which makes some AP stay in KVM_MP_STATE_UNINITIALIZED forever. But thisrace only occurs with extremely low probability, about 1~2 VM hangs per week.
Considering other emulators, like cloud-hypervisor and firecracker maybe also have similar potential race issues, I think KVM had better do some handling. But anyway, I will check Qemu code to avoid such race. Thanks for both of your comments. 🙂
Have a nice day, thanks Fei
IIUC, it's invoking kvm_vcpu_ioctl_x86_set_vcpu_events() with stale data.
More precisely, it's not expecting other vCPUs to change the pending events asynchronously.
Yes, will sort out a more complete calling process later.
I'm also a bit confused as to how QEMU is even gaining control of the vCPU to emit KVM_SET_VCPU_EVENTS if the vCPU is in kvm_vcpu_block().
With a signal. :)
Paolo
On Thu, Aug 28, 2025 at 5:13 PM Fei Li lifei.shirley@bytedance.com wrote:
Actually this is a bug triggered by one monitor tool in our production environment. This monitor executes 'info registers -a' hmp at a fixed frequency, even during VM startup process, which makes some AP stay in KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with extremely low probability, about 1~2 VM hangs per week.
Considering other emulators, like cloud-hypervisor and firecracker maybe also have similar potential race issues, I think KVM had better do some handling. But anyway, I will check Qemu code to avoid such race. Thanks for both of your comments. 🙂
If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in similar cases, that of course would help understanding the situation better.
In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all vCPUs have halted.
Paolo
On 8/29/25 12:44 AM, Paolo Bonzini wrote:
On Thu, Aug 28, 2025 at 5:13 PM Fei Li lifei.shirley@bytedance.com wrote:
Actually this is a bug triggered by one monitor tool in our production environment. This monitor executes 'info registers -a' hmp at a fixed frequency, even during VM startup process, which makes some AP stay in KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with extremely low probability, about 1~2 VM hangs per week.
Considering other emulators, like cloud-hypervisor and firecracker maybe also have similar potential race issues, I think KVM had better do some handling. But anyway, I will check Qemu code to avoid such race. Thanks for both of your comments. 🙂
If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in similar cases, that of course would help understanding the situation better.
In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all vCPUs have halted.
Paolo
Hi Paolo and Sean,
Sorry for the late response, I have been a little busy with other things recently. The complete calling processes for the bad case are as follows:
`info registers -a` hmp per 2ms[1] AP(vcpu1) thread[2] BSP(vcpu0) send INIT/SIPI[3]
[2] KVM: KVM_RUN and then schedule() in kvm_vcpu_block() loop
[1] for each cpu: cpu_synchronize_state if !qemu_thread_is_self() 1. insert to cpu->work_list, and handle asynchronously 2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal
[2] KVM: checks signal_pending, breaks loop and returns -EINTR Qemu: break kvm_cpu_exec loop, run 1. qemu_wait_io_event() => process_queued_cpu_work => cpu->work_list.func() e.i. do_kvm_cpu_synchronize_state() callback => kvm_arch_get_registers => kvm_get_mp_state /* KVM: get_mpstate also calls kvm_apic_accept_events() to handle INIT and SIPI */ => cpu->vcpu_dirty = true; // end of qemu_wait_io_event
[3] SeaBIOS: BSP enters non-root mode and runs reset_vector() in SeaBIOS. send INIT and then SIPI by writing APIC_ICR during smp_scan KVM: BSP(vcpu0) exits, then => handle_apic_write => kvm_lapic_reg_write => kvm_apic_send_ipi to all APs => for each AP: __apic_accept_irq, e.g. for AP(vcpu1) => case APIC_DM_INIT: apic->pending_events = (1UL << KVM_APIC_INIT) (not kick the AP yet) => case APIC_DM_STARTUP: set_bit(KVM_APIC_SIPI, &apic->pending_events) (not kick the AP yet)
[2] 2. kvm_cpu_exec() => if (cpu->vcpu_dirty): => kvm_arch_put_registers => kvm_put_vcpu_events KVM: kvm_vcpu_ioctl_x86_set_vcpu_events => clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events); e.i. pending_events changes from 11b to 10b // end of kvm_vcpu_ioctl_x86_set_vcpu_events Qemu: => after put_registers, cpu->vcpu_dirty = false; => kvm_vcpu_ioctl(cpu, KVM_RUN, 0) KVM: KVM_RUN => schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1 signal /* But AP(vcpu1)'s mp_state will never change from KVM_MP_STATE_UNINITIALIZED to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE without handling INIT inside kvm_apic_accept_events(), considering BSP will never send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter non-root mode */
[3] SeaBIOS: waits CountCPUs == expected_cpus_count and loops forever e.i. the AP(vcpu1) stays: EIP=0000fff0 && CS =f000 ffff0000 and BSP(vcpu0) appears 100% utilized as it is in a while loop.
As for other emulators (like cloud-hypervisor and firecracker), there is no interactive command like 'info registers -a'. But sorry again that I haven't had time to check code to confirm whether they invoke KVM_SET_VCPU_EVENTS in similar cases, maybe later. :)
Have a nice day, thanks Fei
linux-stable-mirror@lists.linaro.org