Hi,
Passing a non-blocking pidfd to waitid() currently has no effect, i.e.
is not supported. There are users which would like to use waitid() on
pidfds that are O_NONBLOCK and mix it with pidfds that are blocking and
both pass them to waitid().
The expected behavior is to have waitid() return -EAGAIN for
non-blocking pidfds and to block for blocking pidfds without needing to
perform any additional checks for flags set on the pidfd before passing
it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child
process is ready yet. Returning -EAGAIN for non-blocking pidfds makes it
easier for event loops that handle EAGAIN specially.
It also makes the API more consistent and uniform. In essence, waitid()
is treated like a read on a non-blocking pidfd or a recvmsg() on a
non-blocking socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports
MSG_DONTWAIT for pidfds waitid() supports WNOHANG. Both flags are
per-call options. In contrast non-blocking pidfds and non-blocking
sockets are a setting on an open file description affecting all threads
in the calling process as well as other processes that hold file
descriptors referring to the same open file description. Both behaviors,
per call and per open file description, have genuine use-cases.
A concrete use-case that was brought on-list (see [1]) was Josh's async
pidfd library. Ever since the introduction of pidfds and more advanced
async io various programming languages such as Rust have grown support
for async event libraries. These libraries are created to help build
epoll-based event loops around file descriptors. A common pattern is to
automatically make all file descriptors they manage to O_NONBLOCK.
For such libraries the EAGAIN error code is treated specially. When a
function is called that returns EAGAIN the function isn't called again
until the event loop indicates the the file descriptor is ready.
Supporting EAGAIN when waiting on pidfds makes such libraries just work
with little effort.
Thanks!
Christian
[1]: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Christian Brauner (4):
pidfd: support PIDFD_NONBLOCK in pidfd_open()
exit: support non-blocking pidfds
tests: port pidfd_wait to kselftest harness
tests: add waitid() tests for non-blocking pidfds
include/uapi/linux/pidfd.h | 12 +
kernel/exit.c | 19 +-
kernel/pid.c | 12 +-
tools/testing/selftests/pidfd/pidfd.h | 4 +
tools/testing/selftests/pidfd/pidfd_wait.c | 298 +++++++++------------
5 files changed, 161 insertions(+), 184 deletions(-)
create mode 100644 include/uapi/linux/pidfd.h
base-commit: d012a7190fc1fd72ed48911e77ca97ba4521bccd
--
2.28.0
Some applications, especially tracing ones, benefit from avoiding the
syscall overhead for getcpu() so it is common for architectures to have
vDSO implementations. Add one for arm64, using TPIDRRO_EL0 to pass a
pointer to per-CPU data rather than just store the immediate value in
order to allow for future extensibility.
It is questionable if something TPIDRRO_EL0 based is worthwhile at all
on current kernels, since v4.18 we have had support for restartable
sequences which can be used to provide a sched_getcpu() implementation
with generally better performance than the vDSO approach on
architectures which have that[1]. Work is ongoing to implement this for
glibc:
https://lore.kernel.org/lkml/20200527185130.5604-3-mathieu.desnoyers@effici…
but is not yet merged and will need similar work for other userspaces.
The main advantages for the vDSO implementation are the node parameter
(though this is a static mapping to CPU number so could be looked up
separately when processing data if it's needed, it shouldn't need to be
in the hot path) and ease of implementation for users.
This is currently not compatible with KPTI due to the use of TPIDRRO_EL0
by the KPTI trampoline, this could be addressed by reinitializing that
system register in the return path but I have found it hard to justify
adding that overhead for all users for something that is essentially a
profiling optimization which is likely to get superceeded by a more
modern implementation - if there are other uses for the per-CPU data
then the balance might change here.
This builds on work done by Kristina Martsenko some time ago but is a
new implementation.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
v3:
- Rebase on v5.9-rc1.
- Drop in progress portions of the series.
v2:
- Rebase on v5.8-rc3.
- Add further cleanup patches & a first draft of multi-page support.
Mark Brown (5):
arm64: vdso: Provide a define when building the vDSO
arm64: vdso: Add per-CPU data
arm64: vdso: Initialise the per-CPU vDSO data
arm64: vdso: Add getcpu() implementation
selftests: vdso: Support arm64 in getcpu() test
arch/arm64/include/asm/processor.h | 12 +----
arch/arm64/include/asm/vdso/datapage.h | 54 +++++++++++++++++++
arch/arm64/kernel/process.c | 26 ++++++++-
arch/arm64/kernel/vdso.c | 33 +++++++++++-
arch/arm64/kernel/vdso/Makefile | 4 +-
arch/arm64/kernel/vdso/vdso.lds.S | 1 +
arch/arm64/kernel/vdso/vgetcpu.c | 48 +++++++++++++++++
.../testing/selftests/vDSO/vdso_test_getcpu.c | 10 ++++
8 files changed, 172 insertions(+), 16 deletions(-)
create mode 100644 arch/arm64/include/asm/vdso/datapage.h
create mode 100644 arch/arm64/kernel/vdso/vgetcpu.c
--
2.20.1