On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
All of this is solved already otherwise CPU hot unplug would explode in your face every time. The software IPI send side is carefully synchronized vs. hotplug (at least in theory). May I ask you politely to make yourself familiar with all that before touting "We do need..." based on random assumptions?
I'm aware that the software send IPI side is synchronized against hotplug. But SENDUIPI is not unless we're going to have the CPU offline code IPI every other CPU to make sure that their SENDUIPIs have completed -- we don't control the SENDUIPI code.
That's correct, but on CPU hot unplug _all_ running tasks have been migrated to an online CPU _before_ the APIC is turned off. So they all went through schedule() which set the UPID->SN bit. That's obviously racy, but that has to be handled in exit to user mode anyway because that's not different from any other migration or preemption. So that's _not_ a problem at all.
The problem only exists if we can't do the #GP trick for tasks which are sitting in uintr_wait(). Because then we _have_ to be careful vs. a concurrent SENDUPI. But that'd be not any different from the problem vs. device interrupts which we have today.
If we can use #GP then there is no problem at all and we avoid all the nasty stuff vs. hotplug and avoid the list walk etc.
After reading the ISE docs again, I think it might be possible to use the ON bit to synchronize. In the schedule-out path, if we discover that ON = 1, then there is an IPI in flight to us. In theory, we could wait for it, although actually doing so could be a mess. That's why I'm asking whether there's a way to tell the APIC to literally wait for all IPIs that are *already sent* to be delivered.
You could busy poll with interrupts enabled, but that does not solve anything. What guarantees that after APIC.IRR is clear no further IPI is sent? Nothing at all. But again, that's not any different from device interrupts and we already handle that today:
cpu down() ... disable interrupts(); for_each_interrupt_affine_to_cpu(irq) { change_affinity_to_online_cpu(irq, new_target_cpu); // Did device send to the old vector? if (APIC.IRR & vector_bit(old_vector)) send_IPI(new_target_cpu, new_vector); }
So for uintr_wait() w/o #GP we'd need to do:
for_each_waiter_on_cpu(w) { move_waiter_to_new_target_cpu_wait_list(w); w->ndest = new_target_cpu; if (w->ON) send_IPI(new_target_cpu, UIWAIT_VECTOR); }
The uintr_wait() syscall creates the very same problem as we have with device interrupts. Which means we need to make that wait thing:
upid = T1->upid; upid->vector = UINTR_WAIT_VECTOR;
This is exactly what I'm suggesting we *don't* do. Instead we set a reserved bit, we decode SENDUIPI in the #GP handler, and we emulate, in-kernel, the notification process for non-running tasks.
Yes, under the assumption that we can use #GP without breaking device delivery.
Now that I read the docs some more, I'm seriously concerned about this XSAVE design. XSAVES with UINTR is destructive -- it clears UINV. If we actually use this, then the whole last_cpu "preserve the state in registers" optimization goes out the window. So does anything that happens to assume that merely saving the state doesn't destroy it on respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which makes me nervous and would need a serious audit of our XRSTORS paths.
I have no idea what you are fantasizing about. You can XRSTORS five times in a row as long as your XSTATE memory image is correct.
If you don't want to use XSAVES to manage UINTR then you have to manualy fiddle with the MSRs and UIF in schedule() and return to user space.
Also keeping UINV alive when scheduling out creates a life time issue vs. UPID:
CPU 0 CPU 1 CPU2 T1 -> schedule // UPID is live in UINTR MSRs do_stuff_in_kernel() local_irq_disable(); SENDUIPI(T1 -> CPU1) pull T1 T1 exits free UPID
local_irq_enable(); ucode handles UINV -> UAF
Clearing UINV prevents the ucode from handling the IPI and fiddling with UPID. The CPU will forward the IPI vector to the kernel which acks it and does nothing else, i.e. it's a spurious interrupt.
Coming back to state preserving. All what needs to be done for a situation where the rest of the XSTATE is live in the registers, i.e. the T -> kthread -> T scheduling scenario, is to restore UINV on exit to user mode and handle UPID.PIR which might contain newly set bits which are obviously not yet in UPID.IRR. That can be done by MSR fiddling or by issuing an self IPI on the UINV vector which will be handled in ucode on the first user space instruction after return.
When the FPU has to be restored then the state has to be updated in the XSAVE memory image before doing XRSTORS.
Thanks,
tglx