Excerpts from André Almeida's message of June 5, 2021 6:01 am:
Às 08:36 de 04/06/21, Nicholas Piggin escreveu:
Excerpts from André Almeida's message of June 4, 2021 5:59 am:
Hi,
This patch series introduces the futex2 syscalls.
- What happened to the current futex()?
For some years now, developers have been trying to add new features to futex, but maintainers have been reluctant to accept then, given the multiplexed interface full of legacy features and tricky to do big changes. Some problems that people tried to address with patchsets are: NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2]. NUMA, for instance, just doesn't fit the current API in a reasonable way. Considering that, it's not possible to merge new features into the current futex.
** The NUMA problem
At the current implementation, all futex kernel side infrastructure is stored on a single node. Given that, all futex() calls issued by processors that aren't located on that node will have a memory access penalty when doing it.
** The 32bit sized futex problem
Futexes are used to implement atomic operations in userspace. Supporting 8, 16, 32 and 64 bit sized futexes allows user libraries to implement all those sizes in a performant way. Thanks Boost devs for feedback: https://lists.boost.org/Archives/boost/2021/05/251508.php
Embedded systems or anything with memory constrains could benefit of using smaller sizes for the futex userspace integer.
** The wait on multiple problem
The use case lies in the Wine implementation of the Windows NT interface WaitMultipleObjects. This Windows API function allows a thread to sleep waiting on the first of a set of event sources (mutexes, timers, signal, console input, etc) to signal. Considering this is a primitive synchronization operation for Windows applications, being able to quickly signal events on the producer side, and quickly go to sleep on the consumer side is essential for good performance of those running over Wine.
[0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/ [1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/ [2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.c...
- The solution
As proposed by Peter Zijlstra and Florian Weimer[3], a new interface is required to solve this, which must be designed with those features in mind. futex2() is that interface. As opposed to the current multiplexed interface, the new one should have one syscall per operation. This will allow the maintainability of the API if it gets extended, and will help users with type checking of arguments.
In particular, the new interface is extended to support the ability to wait on any of a list of futexes at a time, which could be seen as a vectored extension of the FUTEX_WAIT semantics.
[3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-a...
- The interface
The new interface can be seen in details in the following patches, but this is a high level summary of what the interface can do:
- Supports wake/wait semantics, as in futex()
- Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with individual flags for each address
- Supports waiting for a vector of futexes, using a new syscall named futex_waitv()
- Supports variable sized futexes (8bits, 16bits, 32bits and 64bits)
- Supports NUMA-awareness operations, where the user can specify on which memory node would like to operate
A few comments.
- Do you need a syscall for wait and for waitv, or can wait just be a
degenerate case of waitv (which could use the stack variable)? I guess it does save the user access unlock.
Yes. waitv with one element has a overhead compared to wait, given the extra user_copy(). Given that waiting on a single futex is the common case, I think it's worth to have it.
Okay.
- Did you consider a wakev interface? An example is a read-write mutex
which has read-blocking futexes split (e.g., per-node) for scalability then the writer may unlock and wake all readers with one call. We actually have some scalability challenges of this nature with certain large database programs.
Not really, I haven't heard any use case for that until now. It should be easy to implement it, though, and I think you have an interesting use case here. Could you point me some of those database programs?
Not source code unfortunately. I know that's not a very good answer, but they are far ahead of what most open source apps are doing scalability wise today, and they end up rolling their own complex locking. Hopefully the example I give is simple enough to understand.
- Great to see 64-bit support in, it is helpful to do some interesting
things with locks without hacks (e.g., putting an address in the lock word).
- Are we really keen on squashing node ID into flags in this day and age?
I guess okay but seems like it would be nice to allow a bit more space in general for the operations. I don't want to turn it into a whole big multiplexing nightmare again with lots of such flags, or propose complexity with no code behind it, but I think we need a bit of leeway for unforeseen locking innovations to be added carefully. The pthread locking today is still fairly primitive really, I don't think we know what will work best for the next 10 years.
In the interface that I'd proposed, the node ID isn't part of the flags. You have a flag FUTEX_FLAG_NUMA, and when that is used, you pass in `void *uaddr` a pointer to a `struct futex_numa { int value, int hint }`, where hint should be the node ID you would like to work on, and value is just the userspace futex. This is documented in more details in patch 7 "docs: locking: futex2: Add documentation".
If you have any feedback about how this NUMA interface looks like, I would like to hear.
Also, did something in my writing indicated that the node ID would be part of the flags? I'll improve this it if so.
Oh I did miss this, thank you. No it wasn't your writing, I think it was me trying to read through a lot of messages and got confused with some earlier conversations.
I'll look a bit more at the NUMA interface.
One scalability issue we are starting to hit and will only get worse is futex queue spinlock contention. Perhaps this is better addressed in userspace but the kernel could play a part so I like to leave some doors open. One example is that the wait (or wake) side may like to depend not just on the memory value, but on the success of a cmpxchg to avoid unqueueing and queueing spuriously, which increases lock contention but also ends up putting the poor task on the back of the list -- yes RT priorities can help the realtime case, but non-RT cases can get bad outlier latencies if lock stealing is allowed (which can be very good for performance).
Sorry, I'm not sure what do you mean here. Are you proposing to have a cmpxchg in kernel side, so the lock would be taken by the kernel, and not by the userspace like it's now?
Yes. Only in slow paths, of course, to reduce starvation / erratic latencies and spurious wait queue manipulations.
Actually one other scalability thing while I remember it:
futex_wait currently requires that the lock word is tested under the queue spin lock (to avoid consuming a wakeup). The problem with this is that the lock word can be a very hot cache line if you have a lot of concurrency, so accessing it under the queue lock can increase queue lock hold time.
I would prefer if the new API was relaxed to avoid this restriction (e.g., any wait call may consume a wakeup so it's up to userspace to avoid that if it is a problem).
- The private global futex hash table sucks for various reasons, and
having 128 waiters per thread makes it orders of magnitude easier for userspace to DoS stuff with hash collisions. NUMA doesn't fix that, the per process hashing that Thomas suggested does fix the DoS but the non-deterministic hash collisions still seem to be a problem for real time response, and at the other end of the scale some apps (certain databases, etc) can have ten thousand futex waiters at once so birthday paradox can also lead to guaranteed (low level) variable beahviour within a single process.
I know the kernel in general is not very robust against this kind of DoS/nondeterminism, but it's a bit sad to introduce new APIs with the problem still there. Yes we could address it later, but I think it's better done first because the solution might influence what the best syscall API is.
For example the idea of using the address as the handle for the wait queue _and_ the value is very convenient but hashing is annoying for all the above reasons and the shared wait queue case is pretty clunky. It's also constraining in some corner cases to have the wait queue associated with the address 1:1. For example a type of NUMA mutex might want to have per-node waitqueues associated with a lock word, and wake local node waiters 99% of the time. Not trivial to do with futexes and seems to at least require bouncing of more cache lines, possibly more atomics, etc.
Could anything else be done?
I wasn't aware that userspace doing DoS is something to be concerned from the kernel point of view. Is this scenario considering a malicious actor? If so, there are plenty of resources to be denied, so not sure how futex could be protected of this. Or is this just a program that uses tons of futexes?
Both really. AFAIKS one of the efforts that prompted the futex modernisation work was the RT latency issues from Thomas in 2016 when the per process table was proposed.
I'll be burned at the stake for suggesting it but it would be great if we could use file descriptors. At least for the shared futex, maybe private could use a per-process futex allocator. It solves all of the above, although I'm sure has many of its own problem. It may not play so nicely with the pthread mutex API because of the whole static initialiser problem, but the first futex proposal did use fds. But it's an example of an alternate API.
FDs and futex doesn't play well, because for futex_wait() you need to tell the kernel the expected value in the futex address to avoid sleeping in a free lock. FD operations (poll, select) don't have this `value` argument, so they could sleep forever, but I'm not sure if you had taken this in consideration.
I had. The futex wait API would take a fd additional. The only difference is the waitqueue that is used when a sleep or wake is required is derived from the fd, not from an address.
I think the bigger sticking points would be if it's too heavyweight an object to use (which could be somewhat mitigated with a simpler ida allocator although that's difficult to do with shared), and whether libc could sanely use them due to the static initialiser problem of pthread mutexes.
Thanks, Nick