Hi,
Passing a non-blocking pidfd to waitid() currently has no effect, i.e.
is not supported. There are users which would like to use waitid() on
pidfds that are O_NONBLOCK and mix it with pidfds that are blocking and
both pass them to waitid().
The expected behavior is to have waitid() return -EAGAIN for
non-blocking pidfds and to block for blocking pidfds without needing to
perform any additional checks for flags set on the pidfd before passing
it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child
process is ready yet. Returning -EAGAIN for non-blocking pidfds makes it
easier for event loops that handle EAGAIN specially.
It also makes the API more consistent and uniform. In essence, waitid()
is treated like a read on a non-blocking pidfd or a recvmsg() on a
non-blocking socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports
MSG_DONTWAIT for pidfds waitid() supports WNOHANG. Both flags are
per-call options. In contrast non-blocking pidfds and non-blocking
sockets are a setting on an open file description affecting all threads
in the calling process as well as other processes that hold file
descriptors referring to the same open file description. Both behaviors,
per call and per open file description, have genuine use-cases.
A concrete use-case that was brought on-list (see [1]) was Josh's async
pidfd library. Ever since the introduction of pidfds and more advanced
async io various programming languages such as Rust have grown support
for async event libraries. These libraries are created to help build
epoll-based event loops around file descriptors. A common pattern is to
automatically make all file descriptors they manage to O_NONBLOCK.
For such libraries the EAGAIN error code is treated specially. When a
function is called that returns EAGAIN the function isn't called again
until the event loop indicates the the file descriptor is ready.
Supporting EAGAIN when waiting on pidfds makes such libraries just work
with little effort.
Thanks!
Christian
[1]: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Christian Brauner (4):
pidfd: support PIDFD_NONBLOCK in pidfd_open()
exit: support non-blocking pidfds
tests: port pidfd_wait to kselftest harness
tests: add waitid() tests for non-blocking pidfds
include/uapi/linux/pidfd.h | 12 +
kernel/exit.c | 19 +-
kernel/pid.c | 12 +-
tools/testing/selftests/pidfd/pidfd.h | 4 +
tools/testing/selftests/pidfd/pidfd_wait.c | 298 +++++++++------------
5 files changed, 161 insertions(+), 184 deletions(-)
create mode 100644 include/uapi/linux/pidfd.h
base-commit: d012a7190fc1fd72ed48911e77ca97ba4521bccd
--
2.28.0
Some applications, especially tracing ones, benefit from avoiding the
syscall overhead for getcpu() so it is common for architectures to have
vDSO implementations. Add one for arm64, using TPIDRRO_EL0 to pass a
pointer to per-CPU data rather than just store the immediate value in
order to allow for future extensibility.
It is questionable if something TPIDRRO_EL0 based is worthwhile at all
on current kernels, since v4.18 we have had support for restartable
sequences which can be used to provide a sched_getcpu() implementation
with generally better performance than the vDSO approach on
architectures which have that[1]. Work is ongoing to implement this for
glibc:
https://lore.kernel.org/lkml/20200527185130.5604-3-mathieu.desnoyers@effici…
but is not yet merged and will need similar work for other userspaces.
The main advantages for the vDSO implementation are the node parameter
(though this is a static mapping to CPU number so could be looked up
separately when processing data if it's needed, it shouldn't need to be
in the hot path) and ease of implementation for users.
This is currently not compatible with KPTI due to the use of TPIDRRO_EL0
by the KPTI trampoline, this could be addressed by reinitializing that
system register in the return path but I have found it hard to justify
adding that overhead for all users for something that is essentially a
profiling optimization which is likely to get superceeded by a more
modern implementation - if there are other uses for the per-CPU data
then the balance might change here.
This builds on work done by Kristina Martsenko some time ago but is a
new implementation.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
v3:
- Rebase on v5.9-rc1.
- Drop in progress portions of the series.
v2:
- Rebase on v5.8-rc3.
- Add further cleanup patches & a first draft of multi-page support.
Mark Brown (5):
arm64: vdso: Provide a define when building the vDSO
arm64: vdso: Add per-CPU data
arm64: vdso: Initialise the per-CPU vDSO data
arm64: vdso: Add getcpu() implementation
selftests: vdso: Support arm64 in getcpu() test
arch/arm64/include/asm/processor.h | 12 +----
arch/arm64/include/asm/vdso/datapage.h | 54 +++++++++++++++++++
arch/arm64/kernel/process.c | 26 ++++++++-
arch/arm64/kernel/vdso.c | 33 +++++++++++-
arch/arm64/kernel/vdso/Makefile | 4 +-
arch/arm64/kernel/vdso/vdso.lds.S | 1 +
arch/arm64/kernel/vdso/vgetcpu.c | 48 +++++++++++++++++
.../testing/selftests/vDSO/vdso_test_getcpu.c | 10 ++++
8 files changed, 172 insertions(+), 16 deletions(-)
create mode 100644 arch/arm64/include/asm/vdso/datapage.h
create mode 100644 arch/arm64/kernel/vdso/vgetcpu.c
--
2.20.1
Currently kunit_tool does not work correctly when executed from a path
outside of the kernel tree, so make sure that the current working
directory is correct and the kunit_dir is properly initialized before
running.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
tools/testing/kunit/kunit.py | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/tools/testing/kunit/kunit.py b/tools/testing/kunit/kunit.py
index 425ef40067e7e..e2caf4e24ecb2 100755
--- a/tools/testing/kunit/kunit.py
+++ b/tools/testing/kunit/kunit.py
@@ -237,9 +237,13 @@ def main(argv, linux=None):
cli_args = parser.parse_args(argv)
+ if get_kernel_root_path():
+ os.chdir(get_kernel_root_path())
+
if cli_args.subcommand == 'run':
if not os.path.exists(cli_args.build_dir):
os.mkdir(cli_args.build_dir)
+ create_default_kunitconfig()
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -257,6 +261,7 @@ def main(argv, linux=None):
if cli_args.build_dir:
if not os.path.exists(cli_args.build_dir):
os.mkdir(cli_args.build_dir)
+ create_default_kunitconfig()
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -270,10 +275,6 @@ def main(argv, linux=None):
if result.status != KunitStatus.SUCCESS:
sys.exit(1)
elif cli_args.subcommand == 'build':
- if cli_args.build_dir:
- if not os.path.exists(cli_args.build_dir):
- os.mkdir(cli_args.build_dir)
-
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -288,10 +289,6 @@ def main(argv, linux=None):
if result.status != KunitStatus.SUCCESS:
sys.exit(1)
elif cli_args.subcommand == 'exec':
- if cli_args.build_dir:
- if not os.path.exists(cli_args.build_dir):
- os.mkdir(cli_args.build_dir)
-
if not linux:
linux = kunit_kernel.LinuxSourceTree()
base-commit: 30185b69a2d533c4ba6ca926b8390ce7de495e29
--
2.28.0.236.gb10cc79966-goog
Some tests might not be able to be run if resources like huge pages are
not available. Mark these tests as skipped instead of simply passing.
Signed-off-by: Ralph Campbell <rcampbell(a)nvidia.com>
---
This applies to linux-mm and is for Andrew Morton's tree.
tools/testing/selftests/vm/hmm-tests.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 93fc5cadce61..0a28a6a29581 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -680,7 +680,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
n = gethugepagesizes(pagesizes, 4);
if (n <= 0)
- return;
+ SKIP(return, "Huge page size could not be determined");
for (idx = 0; --n > 0; ) {
if (pagesizes[n] < pagesizes[idx])
idx = n;
@@ -694,7 +694,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
buffer->ptr = get_hugepage_region(size, GHR_STRICT);
if (buffer->ptr == NULL) {
free(buffer);
- return;
+ SKIP(return, "Huge page could not be allocated");
}
buffer->fd = -1;
--
2.20.1
From: Mattias Nissler <mnissler(a)chromium.org>
For mounts that have the new "nosymfollow" option, don't follow symlinks
when resolving paths. The new option is similar in spirit to the
existing "nodev", "noexec", and "nosuid" options, as well as to the
LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD
variants have been supporting the "nosymfollow" mount option for a long
time with equivalent implementations.
Note that symlinks may still be created on file systems mounted with
the "nosymfollow" option present. readlink() remains functional, so
user space code that is aware of symlinks can still choose to follow
them explicitly.
Setting the "nosymfollow" mount option helps prevent privileged
writers from modifying files unintentionally in case there is an
unexpected link along the accessed path. The "nosymfollow" option is
thus useful as a defensive measure for systems that need to deal with
untrusted file systems in privileged contexts.
More information on the history and motivation for this patch can be
found here:
https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-d…
Signed-off-by: Mattias Nissler <mnissler(a)chromium.org>
Signed-off-by: Ross Zwisler <zwisler(a)google.com>
Reviewed-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes since v7 [1]:
* Rebased onto v5.9-rc1.
* Added selftest in second patch.
* Added Aleska's Reviewed-By tag. Thank you for the review!
After this lands I will upstream changes to util-linux[2] and man-pages
[3].
[1]: https://lkml.org/lkml/2020/8/11/896
[2]: https://github.com/rzwisler/util-linux/commit/7f8771acd85edb70d97921c026c55…
[3]: https://github.com/rzwisler/man-pages/commit/b8fe8079f64b5068940c0144586e58…
---
fs/namei.c | 3 ++-
fs/namespace.c | 2 ++
fs/proc_namespace.c | 1 +
fs/statfs.c | 2 ++
include/linux/mount.h | 3 ++-
include/linux/statfs.h | 1 +
include/uapi/linux/mount.h | 1 +
7 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index e99e2a9da0f7d..12d92af2e2ca7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1626,7 +1626,8 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
return ERR_PTR(error);
}
- if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS) ||
+ unlikely(nd->path.mnt->mnt_flags & MNT_NOSYMFOLLOW))
return ERR_PTR(-ELOOP);
if (!(nd->flags & LOOKUP_RCU)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index bae0e95b3713a..6408788a649e1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3160,6 +3160,8 @@ int path_mount(const char *dev_name, struct path *path,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+ if (flags & MS_NOSYMFOLLOW)
+ mnt_flags |= MNT_NOSYMFOLLOW;
/* The default atime for remount is preservation */
if ((flags & MS_REMOUNT) &&
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 3059a9394c2d6..e59d4bb3a89e4 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -70,6 +70,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
+ { MNT_NOSYMFOLLOW, ",nosymfollow" },
{ 0, NULL }
};
const struct proc_fs_opts *fs_infop;
diff --git a/fs/statfs.c b/fs/statfs.c
index 2616424012ea7..59f33752c1311 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -29,6 +29,8 @@ static int flags_by_mnt(int mnt_flags)
flags |= ST_NODIRATIME;
if (mnt_flags & MNT_RELATIME)
flags |= ST_RELATIME;
+ if (mnt_flags & MNT_NOSYMFOLLOW)
+ flags |= ST_NOSYMFOLLOW;
return flags;
}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index de657bd211fa6..aaf343b38671c 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -30,6 +30,7 @@ struct fs_context;
#define MNT_NODIRATIME 0x10
#define MNT_RELATIME 0x20
#define MNT_READONLY 0x40 /* does the user want this to be r/o? */
+#define MNT_NOSYMFOLLOW 0x80
#define MNT_SHRINKABLE 0x100
#define MNT_WRITE_HOLD 0x200
@@ -46,7 +47,7 @@ struct fs_context;
#define MNT_SHARED_MASK (MNT_UNBINDABLE)
#define MNT_USER_SETTABLE_MASK (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
| MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
- | MNT_READONLY)
+ | MNT_READONLY | MNT_NOSYMFOLLOW)
#define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
diff --git a/include/linux/statfs.h b/include/linux/statfs.h
index 9bc69edb8f188..fac4356ea1bfc 100644
--- a/include/linux/statfs.h
+++ b/include/linux/statfs.h
@@ -40,6 +40,7 @@ struct kstatfs {
#define ST_NOATIME 0x0400 /* do not update access times */
#define ST_NODIRATIME 0x0800 /* do not update directory access times */
#define ST_RELATIME 0x1000 /* update atime relative to mtime/ctime */
+#define ST_NOSYMFOLLOW 0x2000 /* do not follow symlinks */
struct dentry;
extern int vfs_get_fsid(struct dentry *dentry, __kernel_fsid_t *fsid);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fed..dd8306ea336c1 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -16,6 +16,7 @@
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_NOSYMFOLLOW 256 /* Do not follow symlinks */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
--
2.28.0.220.ged08abb693-goog