Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall

1 Oct 2021


      On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
...
On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
...
All of this is solved already otherwise CPU hot unplug would explode in
your face every time. The software IPI send side is carefully
synchronized vs. hotplug (at least in theory). May I ask you politely to
make yourself familiar with all that before touting "We do need..." based
on random assumptions?
I'm aware that the software send IPI side is synchronized against
hotplug.  But SENDUIPI is not unless we're going to have the CPU
offline code IPI every other CPU to make sure that their SENDUIPIs
have completed -- we don't control the SENDUIPI code.
That's correct, but on CPU hot unplug _all_ running tasks have been
migrated to an online CPU _before_ the APIC is turned off. So they all
went through schedule() which set the UPID->SN bit. That's obviously
racy, but that has to be handled in exit to user mode anyway because
that's not different from any other migration or preemption. So that's
_not_ a problem at all.
The problem only exists if we can't do the #GP trick for tasks which are
sitting in uintr_wait(). Because then we _have_ to be careful vs. a
concurrent SENDUPI. But that'd be not any different from the problem
vs. device interrupts which we have today.
If we can use #GP then there is no problem at all and we avoid all the
nasty stuff vs. hotplug and avoid the list walk etc.
...
After reading the ISE docs again, I think it might be possible to use
the ON bit to synchronize.  In the schedule-out path, if we discover
that ON = 1, then there is an IPI in flight to us.  In theory, we
could wait for it, although actually doing so could be a mess.  That's
why I'm asking whether there's a way to tell the APIC to literally
wait for all IPIs that are *already sent* to be delivered.
You could busy poll with interrupts enabled, but that does not solve
anything. What guarantees that after APIC.IRR is clear no further IPI is
sent? Nothing at all. But again, that's not any different from device
interrupts and we already handle that today:
cpu down()
      ...
      disable interrupts();
      for_each_interrupt_affine_to_cpu(irq) {
      	change_affinity_to_online_cpu(irq, new_target_cpu);
        // Did device send to the old vector?
        if (APIC.IRR & vector_bit(old_vector))
           send_IPI(new_target_cpu, new_vector);
      }
So for uintr_wait() w/o #GP we'd need to do:
for_each_waiter_on_cpu(w) {
           move_waiter_to_new_target_cpu_wait_list(w);
           w->ndest = new_target_cpu;
           if (w->ON)
              send_IPI(new_target_cpu, UIWAIT_VECTOR);
      }
...
...
The uintr_wait() syscall creates the very same problem as we have with
device interrupts. Which means we need to make that wait thing:
 upid = T1->upid;
 upid->vector = UINTR_WAIT_VECTOR;

This is exactly what I'm suggesting we *don't* do.  Instead we set a
reserved bit, we decode SENDUIPI in the #GP handler, and we emulate,
in-kernel, the notification process for non-running tasks.
Yes, under the assumption that we can use #GP without breaking device
delivery.
...
Now that I read the docs some more, I'm seriously concerned about this
XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If
we actually use this, then the whole last_cpu "preserve the state in
registers" optimization goes out the window.  So does anything that
happens to assume that merely saving the state doesn't destroy it on
respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which
makes me nervous and would need a serious audit of our XRSTORS paths.
I have no idea what you are fantasizing about. You can XRSTORS five
times in a row as long as your XSTATE memory image is correct.
If you don't want to use XSAVES to manage UINTR then you have to manualy
fiddle with the MSRs and UIF in schedule() and return to user space.
Also keeping UINV alive when scheduling out creates a life time issue
vs. UPID:
CPU 0   CPU 1                   CPU2
        T1 -> schedule         // UPID is live in UINTR MSRs
        do_stuff_in_kernel()
        local_irq_disable();
                                SENDUIPI(T1 -> CPU1)
pull T1
T1 exits
free UPID
local_irq_enable();
        ucode handles UINV -> UAF
Clearing UINV prevents the ucode from handling the IPI and fiddling with
UPID. The CPU will forward the IPI vector to the kernel which acks it
and does nothing else, i.e. it's a spurious interrupt.
Coming back to state preserving. All what needs to be done for a
situation where the rest of the XSTATE is live in the registers, i.e.
the T -> kthread -> T scheduling scenario, is to restore UINV on exit to
user mode and handle UPID.PIR which might contain newly set bits which
are obviously not yet in UPID.IRR. That can be done by MSR fiddling or
by issuing an self IPI on the UINV vector which will be handled in ucode
on the first user space instruction after return.
When the FPU has to be restored then the state has to be updated in the
XSAVE memory image before doing XRSTORS.
Thanks,
tglx

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall