* Peter Zijlstra:
So how about we introduce new syscalls:
sys_futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timo);
struct futex_wait { void *uaddr; unsigned long val; unsigned long flags; }; sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters, unsigned long flags, ktime_t *timo);
sys_futex_wake(void *uaddr, unsigned int nr, unsigned long flags);
sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake, unsigned int nr_requeue, unsigned long cmpval, unsigned long flags);
Where flags:
- has 2 bits for size: 8,16,32,64
- has 2 more bits for size (requeue) ??
- has ... bits for clocks
- has private/shared
- has numa
What's the actual type of *uaddr? Does it vary by size (which I assume is in bits?)? Are there alignment constraints?
These system calls seemed to be type-polymorphic still, which is problematic for defining a really nice C interface. I would really like to have a strongly typed interface for this, with a nice struct futex wrapper type (even if it means that we need four of them).
Will all architectures support all sizes? If not, how do we probe which size/flags combinations are supported?
For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int node_id', with the following semantics:
on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is directly used to index into per-node hash-tables. When -1, it is replaced by the current node_id and an smp_mb() is issued before we load and compare the @uaddr.
on WAKE/REQUEUE, it is an immediate index.
Does this mean the first waiter determines the NUMA index, and all future waiters use the same chain even if they are on different nodes?
I think documenting this as a node index would be a mistake. It could be an arbitrary hint for locating the corresponding kernel data structures.
Any invalid value with result in EINVAL.
Using uaddr-4 is slightly tricky with a 64-bit futex value, due to the need to maintain alignment and avoid padding.
Thanks, Florian