Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall

1 Oct 2021


      On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
...
On Thu, Sep 30 2021 at 15:01, Andy Lutomirski wrote:
...
On Thu, Sep 30, 2021, at 12:29 PM, Thomas Gleixner wrote:
...
But even with that we still need to keep track of the armed ones per CPU
so we can handle CPU hotunplug correctly. Sigh...
I don’t think any real work is needed. We will only ever have armed
UPIDs (with notification interrupts enabled) for running tasks, and
hot-unplugged CPUs don’t have running tasks.
That's not the problem. The problem is the wait for uintr case where the
task is obviously not running:
CPU 1
     upid = T1->upid;
     upid->vector = UINTR_WAIT_VECTOR;
     upid->ndst = local_apic_id();
     ...
     do {
         ....
         schedule();
     }
CPU 0
    unplug CPU 1
SENDUPI(index)
    // Hardware does:
    tblentry = &ttable[index];
    upid = tblentry->upid;
    upid->pir |= tblentry->uv;
    send_IPI(upid->vector, upid->ndst);


So SENDUPI will send the IPI to the APIC ID provided by T1->upid.ndst
which points to the offlined CPU 1 and therefore is obviously going to
/dev/null. IOW, lost wakeup...
Yes, but I don't think this is how we should structure this.
CPU 1
 upid->vector = UINV;
 upid->ndst = local_apic_id()
 exit to usermode;
 return from usermode;
 ...
schedule();
 fpu__save_crap [see below]:
   if (this task is waiting for a uintr) {
     upid->resv0 = 1;  /* arm #GP */
   } else {
     upid->sn = 1;
   }
...
...
We do need a way to drain pending IPIs before we offline a CPU, but
that’s a separate problem and may be unsolvable for all I know. Is
there a magic APIC operation to wait until all initiated IPIs
targeting the local CPU arrive?  I guess we can also just mask the
notification vector so that it won’t crash us if we get a stale IPI
after going offline.
All of this is solved already otherwise CPU hot unplug would explode in
your face every time. The software IPI send side is carefully
synchronized vs. hotplug (at least in theory). May I ask you politely to
make yourself familiar with all that before touting "We do need..." based
on random assumptions?
I'm aware that the software send IPI side is synchronized against hotplug.  But SENDUIPI is not unless we're going to have the CPU offline code IPI every other CPU to make sure that their SENDUIPIs have completed -- we don't control the SENDUIPI code.
After reading the ISE docs again, I think it might be possible to use the ON bit to synchronize.  In the schedule-out path, if we discover that ON = 1, then there is an IPI in flight to us.  In theory, we could wait for it, although actually doing so could be a mess.  That's why I'm asking whether there's a way to tell the APIC to literally wait for all IPIs that are *already sent* to be delivered.
...
The above SENDUIPI vs. CPU hotplug scenario is the same problem as we
have with regular device interrupts which are targeted at an outgoing
CPU. We have magic mechanisms in place to handle that to the extent
possible, but due to the insanity of X86 interrupt handling mechanics
that still leaves a very tiny hole which might cause a lost and
subsequently stale interrupt. Nothing we can fix in software.
So on CPU offline the hotplug code walks through all device interrupts
and checks whether they are targeted at the outgoing CPU. If so they are
rerouted to an online CPU with lots of care to make the possible race
window as small as it gets. That's nowadays only a problem on systems
where interrupt remapping is not available or disabled via commandline.
For tasks which just have the user interrupt armed there is no problem
because SENDUPI modifies UPID->PIR which is reevaluated when the task
which got migrated to an online CPU is going back to user space.
The uintr_wait() syscall creates the very same problem as we have with
device interrupts. Which means we need to make that wait thing:
 upid = T1->upid;
 upid->vector = UINTR_WAIT_VECTOR;

This is exactly what I'm suggesting we *don't* do.  Instead we set a reserved bit, we decode SENDUIPI in the #GP handler, and we emulate, in-kernel, the notification process for non-running tasks.
Now that I read the docs some more, I'm seriously concerned about this XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If we actually use this, then the whole last_cpu "preserve the state in registers" optimization goes out the window.  So does anything that happens to assume that merely saving the state doesn't destroy it on respectable modern CPUs  XRSTORS will #GP if you XRSTORS twice, which makes me nervous and would need a serious audit of our XRSTORS paths.
This is gross.
--Andy

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall