This is something that I've been thinking about for a while. We had a discussion at LPC 2020 about this[1] but the proposals suggested there never materialised.
In short, it is quite difficult for userspace to detect the feature capability of syscalls at runtime. This is something a lot of programs want to do, but they are forced to create elaborate scenarios to try to figure out if a feature is supported without causing damage to the system. For the vast majority of cases, each individual feature also needs to be tested individually (because syscall results are all-or-nothing), so testing even a single syscall's feature set can easily inflate the startup time of programs.
This patchset implements the fairly minimal design I proposed in this talk[2] and in some old LKML threads (though I can't find the exact references ATM). The general flow looks like:
1. Userspace will indicate to the kernel that a syscall should a be no-op by setting the top bit of the extensible struct size argument.
We will almost certainly never support exabyte sized structs, so the top bits are free for us to use as makeshift flag bits. This is preferable to using the per-syscall flag field inside the structure because seccomp can easily detect the bit in the flag and allow the probe or forcefully return -EEXTSYS_NOOP.
2. The kernel will then fill the provided structure with every valid bit pattern that the current kernel understands.
For flags or other bitflag-like fields, this is the set of valid flags or bits. For pointer fields or fields that take an arbitrary value, the field has every bit set (0xFF... to fill the field) to indicate that any value is valid in the field.
3. The syscall then returns -EEXTSYS_NOOP which is an errno that will only ever be used for this purpose (so userspace can be sure that the request succeeded).
On older kernels, the syscall will return a different error (usually -E2BIG or -EFAULT) and userspace can do their old-fashioned checks.
4. Userspace can then check which flags and fields are supported by looking at the fields in the returned structure. Flags are checked by doing an AND with the flags field, and field support can checked by comparing to 0. In principle you could just AND the entire structure if you wanted to do this check generically without caring about the structure contents (this is what libraries might consider doing).
Userspace can even find out the internal kernel structure size by passing a PAGE_SIZE buffer and seeing how many bytes are non-zero.
As with copy_struct_from_user(), this is designed to be forward- and backwards- compatible.
This allows programas to get a one-shot understanding of what features a syscall supports without having to do any elaborate setups or tricks to detect support for destructive features. Flags can simply be ANDed to check if they are in the supported set, and fields can just be checked to see if they are non-zero.
This patchset is IMHO the simplest way we can add the ability to introspect the feature set of extensible struct (copy_struct_from_user) syscalls. It doesn't preclude the chance of a more generic mechanism being added later.
The intended way of using this interface to get feature information looks something like the following (imagine that openat2 has gained a new field and a new flag in the future):
static bool openat2_no_automount_supported; static bool openat2_cwd_fd_supported;
int check_openat2_support(void) { int err; struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT); openat2_cwd_fd_supported = (how.cwd_fd != 0); break; } }
This series adds CHECK_FIELDS support for the following extensible struct syscalls, as they are quite likely to grow flags in the near future:
* openat2 * clone3 * mount_setattr
[1]: https://lwn.net/Articles/830666/ [2]: https://youtu.be/ggD-eb3yPVs
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- Changes in v3: - Fix copy_struct_to_user() return values in case of clear_user() failure. - v2: https://lore.kernel.org/r/20240906-extensible-structs-check_fields-v2-0-0f46d2de9bad@cyphar.com Changes in v2: - Add CHECK_FIELDS support to mount_setattr(2). - Fix build failure on architectures with custom errno values. - Rework selftests to use the tools/ uAPI headers rather than custom defining EEXTSYS_NOOP. - Make sure we return -EINVAL and -E2BIG for invalid sizes even if CHECK_FIELDS is set, and add some tests for that. - v1: https://lore.kernel.org/r/20240902-extensible-structs-check_fields-v1-0-545e93ede2f2@cyphar.com
--- Aleksa Sarai (10): uaccess: add copy_struct_to_user helper sched_getattr: port to copy_struct_to_user openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) openat2: add CHECK_FIELDS flag to usize argument selftests: openat2: add 0xFF poisoned data after misaligned struct selftests: openat2: add CHECK_FIELDS selftests clone3: add CHECK_FIELDS flag to usize argument selftests: clone3: add CHECK_FIELDS selftests mount_setattr: add CHECK_FIELDS flag to usize argument selftests: mount_setattr: add CHECK_FIELDS selftest
arch/alpha/include/uapi/asm/errno.h | 3 + arch/mips/include/uapi/asm/errno.h | 3 + arch/parisc/include/uapi/asm/errno.h | 3 + arch/sparc/include/uapi/asm/errno.h | 3 + fs/namespace.c | 17 ++ fs/open.c | 18 ++ include/linux/uaccess.h | 97 ++++++++ include/uapi/asm-generic/errno.h | 3 + include/uapi/linux/openat2.h | 2 + kernel/fork.c | 30 ++- kernel/sched/syscalls.c | 42 +--- tools/arch/alpha/include/uapi/asm/errno.h | 3 + tools/arch/mips/include/uapi/asm/errno.h | 3 + tools/arch/parisc/include/uapi/asm/errno.h | 3 + tools/arch/sparc/include/uapi/asm/errno.h | 3 + tools/include/uapi/asm-generic/errno.h | 3 + tools/include/uapi/asm-generic/posix_types.h | 101 ++++++++ tools/testing/selftests/clone3/.gitignore | 1 + tools/testing/selftests/clone3/Makefile | 4 +- .../testing/selftests/clone3/clone3_check_fields.c | 264 +++++++++++++++++++++ tools/testing/selftests/mount_setattr/Makefile | 2 +- .../selftests/mount_setattr/mount_setattr_test.c | 53 ++++- tools/testing/selftests/openat2/Makefile | 2 + tools/testing/selftests/openat2/openat2_test.c | 165 ++++++++++++- 24 files changed, 777 insertions(+), 51 deletions(-) --- base-commit: 98f7e32f20d28ec452afb208f9cffc08448a2652 change-id: 20240803-extensible-structs-check_fields-a47e94cef691
Best regards,
This is based on copy_struct_from_user(), but there is one additional case to consider when creating a syscall that returns an extensible-struct to userspace -- how should data in the struct that cannot fit into the userspace struct be handled (ksize > usize)?
There are three possibilies:
1. The interface is like sched_getattr(2), where new information will be silently not provided to userspace. This is probably what most interfaces will want to do, as it provides the most possible backwards-compatibility.
2. The interface is like lsm_list_modules(2), where you want to return an error like -EMSGSIZE if not providing information could result in the userspace program making a serious mistake (such as one that could lead to a security problem) or if you want to provide some flag to userspace so they know that they are missing some information.
3. The interface is like statx(2), where there some kind of a request mask that indicates what data userspace would like. One could imagine that statx2(2) (using extensible structs) would want to return -EMSGSIZE if the user explicitly requested a field that their structure is too small to fit, but not return an error if the field was not explicitly requested. This is kind of a mix between (1) and (2) based on the requested mask.
The copy_struct_to_user() helper includes a an extra argument that is used to return a boolean flag indicating whether there was a non-zero byte in the trailing bytes that were not copied to userspace. This can be used in the following ways to handle all three cases, respectively:
1. Just pass NULL, as you don't care about this case.
2. Return an error (say -EMSGSIZE) if the argument was set to true by copy_struct_to_user().
3. If the argument was set to true by copy_struct_to_user(), check if there is a flag that implies a field larger than usize.
This is the only case where callers of copy_struct_to_user() should check usize themselves. This will probably require scanning an array that specifies what flags were added for each version of the flags struct and returning an error if the request mask matches any of the flags that were added in versions of the struct that are larger than usize.
At the moment we don't have any users of (3), so this patch doesn't include any helpers to make the necessary scanning easier, but it should be fairly easy to add some if necessary.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- include/linux/uaccess.h | 97 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index d8e4105a2f21..cf0994605c10 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -387,6 +387,103 @@ copy_struct_from_user(void *dst, size_t ksize, const void __user *src, return 0; }
+/** + * copy_struct_to_user: copy a struct to userspace + * @dst: Destination address, in userspace. This buffer must be @ksize + * bytes long. + * @usize: (Alleged) size of @dst struct. + * @src: Source address, in kernel space. + * @ksize: Size of @src struct. + * @ignored_trailing: Set to %true if there was a non-zero byte in @src that + * userspace cannot see because they are using an smaller struct. + * + * Copies a struct from kernel space to userspace, in a way that guarantees + * backwards-compatibility for struct syscall arguments (as long as future + * struct extensions are made such that all new fields are *appended* to the + * old struct, and zeroed-out new fields have the same meaning as the old + * struct). + * + * Some syscalls may wish to make sure that userspace knows about everything in + * the struct, and if there is a non-zero value that userspce doesn't know + * about, they want to return an error (such as -EMSGSIZE) or have some other + * fallback (such as adding a "you're missing some information" flag). If + * @ignored_trailing is non-%NULL, it will be set to %true if there was a + * non-zero byte that could not be copied to userspace (ie. was past @usize). + * + * While unconditionally returning an error in this case is the simplest + * solution, for maximum backward compatibility you should try to only return + * -EMSGSIZE if the user explicitly requested the data that couldn't be copied. + * Note that structure sizes can change due to header changes and simple + * recompilations without code changes(!), so if you care about + * @ignored_trailing you probably want to make sure that any new field data is + * associated with a flag. Otherwise you might assume that a program knows + * about data it does not. + * + * @ksize is just sizeof(*src), and @usize should've been passed by userspace. + * The recommended usage is something like the following: + * + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) + * { + * int err; + * bool ignored_trailing; + * struct foo karg = {}; + * + * if (usize > PAGE_SIZE) + * return -E2BIG; + * if (usize < FOO_SIZE_VER0) + * return -EINVAL; + * + * // ... modify karg somehow ... + * + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg), + * &ignored_trailing); + * if (err) + * return err; + * if (ignored_trailing) + * return -EMSGSIZE: + * + * // ... + * } + * + * There are three cases to consider: + * * If @usize == @ksize, then it's copied verbatim. + * * If @usize < @ksize, then the kernel is trying to pass userspace a newer + * struct than it supports. Thus we only copy the interoperable portions + * (@usize) and ignore the rest (but @ignored_trailing is set to %true if + * any of the trailing (@ksize - @usize) bytes are non-zero). + * * If @usize > @ksize, then the kernel is trying to pass userspace an older + * struct than userspace supports. In order to make sure the + * unknown-to-the-kernel fields don't contain garbage values, we zero the + * trailing (@usize - @ksize) bytes. + * + * Returns (in all cases, some data may have been copied): + * * -EFAULT: access to userspace failed. + */ +static __always_inline __must_check int +copy_struct_to_user(void __user *dst, size_t usize, const void *src, + size_t ksize, bool *ignored_trailing) +{ + size_t size = min(ksize, usize); + size_t rest = max(ksize, usize) - size; + + /* Double check if ksize is larger than a known object size. */ + if (WARN_ON_ONCE(ksize > __builtin_object_size(src, 1))) + return -E2BIG; + + /* Deal with trailing bytes. */ + if (usize > ksize) { + if (clear_user(dst + size, rest)) + return -EFAULT; + } + if (ignored_trailing) + *ignored_trailing = ksize < usize && + memchr_inv(src + size, 0, rest) != NULL; + /* Copy the interoperable parts of the struct. */ + if (copy_to_user(dst, src, size)) + return -EFAULT; + return 0; +} + bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size);
long copy_from_kernel_nofault(void *dst, const void *src, size_t size);
sched_getattr(2) doesn't care about trailing non-zero bytes in the (ksize > usize) case, so just use copy_struct_to_user() without checking ignored_trailing.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- kernel/sched/syscalls.c | 42 ++---------------------------------------- 1 file changed, 2 insertions(+), 40 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index ae1b42775ef9..4ccc058bae16 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -1147,45 +1147,6 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param) return copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0; }
-/* - * Copy the kernel size attribute structure (which might be larger - * than what user-space knows about) to user-space. - * - * Note that all cases are valid: user-space buffer can be larger or - * smaller than the kernel-space buffer. The usual case is that both - * have the same size. - */ -static int -sched_attr_copy_to_user(struct sched_attr __user *uattr, - struct sched_attr *kattr, - unsigned int usize) -{ - unsigned int ksize = sizeof(*kattr); - - if (!access_ok(uattr, usize)) - return -EFAULT; - - /* - * sched_getattr() ABI forwards and backwards compatibility: - * - * If usize == ksize then we just copy everything to user-space and all is good. - * - * If usize < ksize then we only copy as much as user-space has space for, - * this keeps ABI compatibility as well. We skip the rest. - * - * If usize > ksize then user-space is using a newer version of the ABI, - * which part the kernel doesn't know about. Just ignore it - tooling can - * detect the kernel's knowledge of attributes from the attr->size value - * which is set to ksize in this case. - */ - kattr->size = min(usize, ksize); - - if (copy_to_user(uattr, kattr, kattr->size)) - return -EFAULT; - - return 0; -} - /** * sys_sched_getattr - similar to sched_getparam, but with sched_attr * @pid: the pid in question. @@ -1230,7 +1191,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, #endif }
- return sched_attr_copy_to_user(uattr, &kattr, usize); + kattr.size = min(usize, sizeof(kattr)); + return copy_struct_to_user(uattr, usize, &kattr, sizeof(kattr), NULL); }
#ifdef CONFIG_SMP
* Aleksa Sarai:
sched_getattr(2) doesn't care about trailing non-zero bytes in the (ksize > usize) case, so just use copy_struct_to_user() without checking ignored_trailing.
I think this is what causes glibc's misc/tst-sched_setattr test to fail on recent kernels. The previous non-modifying behavior was documented in the manual page:
If the caller-provided attr buffer is larger than the kernel's sched_attr structure, the additional bytes in the user-space structure are not touched.
I can just drop this part of the test if the kernel deems both behaviors valid.
Thanks, Florian
On Tue, Dec 10, 2024 at 07:14:07PM +0100, Florian Weimer wrote:
- Aleksa Sarai:
sched_getattr(2) doesn't care about trailing non-zero bytes in the (ksize > usize) case, so just use copy_struct_to_user() without checking ignored_trailing.
I think this is what causes glibc's misc/tst-sched_setattr test to fail on recent kernels. The previous non-modifying behavior was documented in the manual page:
If the caller-provided attr buffer is larger than the kernel's sched_attr structure, the additional bytes in the user-space structure are not touched.
I can just drop this part of the test if the kernel deems both behaviors valid.
I think in general both behaviors are valid but I would consider zeroing the unknown parts of the provided buffer to be the safer option. And all newer extensible struct system calls do that.
But if sched_getattr(2) wants to keep its old behavior it wouldn't be a problem to just handle this case:
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 0d71fcbaf1e3..46140ec449ba 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -1126,6 +1126,15 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, }
kattr.size = min(usize, sizeof(kattr)); + /* + * If userspace passed a larger structure than the kernel knows + * we historically didn't zero the unknown bits but + * copy_struct_to_user() will. Retain the old behavior by + * limiting the copy_to_user() to the size the kernel knows + * about. + */ + if (usize > sizeof(kattr)) + usize = sizeof(kattr); return copy_struct_to_user(uattr, usize, &kattr, sizeof(kattr), NULL); }
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Cc: stable@vger.kernel.org # v5.6+ Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall") Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/open.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/open.c b/fs/open.c index 22adbef7ecc2..30bfcddd505d 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1458,6 +1458,8 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL; + if (unlikely(usize > PAGE_SIZE)) + return -E2BIG;
err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize); if (err)
On Thu, Oct 10, 2024 at 07:40:36AM +1100, Aleksa Sarai wrote:
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Cc: stable@vger.kernel.org # v5.6+ Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall") Signed-off-by: Aleksa Sarai cyphar@cyphar.com
fs/open.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/open.c b/fs/open.c index 22adbef7ecc2..30bfcddd505d 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1458,6 +1458,8 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename, if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL;
- if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize); if (err)
-- 2.46.1
Why isn't this just sent as a normal fix to be included now and not burried in a RFC series?
thanks,
greg k-h
On Thu, 10 Oct 2024 07:40:36 +1100, Aleksa Sarai wrote:
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Applied to the vfs.fixes branch of the vfs/vfs.git tree. Patches in the vfs.fixes branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.fixes
[03/10] openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) https://git.kernel.org/vfs/vfs/c/f92f0a1b0569
In order for userspace to be able to know what flags and fields the kernel supports, it is currently necessary for them to do a bunch of fairly subtle self-checks where you need to get a syscall to return a non-EINVAL error or no-op for each flag you wish to use. If you get -EINVAL you know the flag is unsupported, otherwise you know it is supported.
This doesn't scale well for programs that need to check many flags, and not all syscalls can be easily checked (how would you check for new flags for umount2 or clone3 without side-effects?). To solve this problem, we can take advantage of the extensible struct API used by copy_struct_from_user() by providing a special CHECK_FIELDS flag to extensible struct syscalls (like openat2 and clone3) which will:
1. Cause the syscall to fill the structure with every valid bit the kernel understands. For flag arguments, this is the set of all valid flag bits. For pointer and file descriptor arguments, this would be all 0xFF bits (to indicate that any bits are valid). Userspace can then easily check whether the flag they wanted is supported (by doing a simple bitwise AND) or if a field itself is supported (by checking if it is non-zero / all 0xFF).
2. Return a specific no-op error (-EEXTSYS_NOOP) that is not used as an error by any other kernel code, so that userspace can be absolutely sure that the kernel supports CHECK_FIELDS.
Rather than passing CHECK_FIELDS using the standard flags arguments for the syscall, CHECK_FIELDS is instead the highest bit in the provided struct size. The high bits of the size are never going to be non-zero (we currently only allow size to be up to PAGE_SIZE, and it seems very unlikely we will ever allow several exabyte structure arguments).
By passing the flag in the structure size, we can be sure that old kernels will return a consistent error code (-EFAULT in openat2's case) and that seccomp can properly filter this syscall mode (which is guaranteed to be a no-op on all kernels -- it could even force -EEXTSYS_NOOP to make the userspace program think the kernel doesn't support any syscall features).
The intended way of using this interface to get feature information looks something like the following (imagine that openat2 has gained a new field and a new flag in the future):
static bool openat2_no_automount_supported; static bool openat2_cwd_fd_supported;
int check_openat2_support(void) { int err; struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT); openat2_cwd_fd_supported = (how.cwd_fd != 0); break; } }
Link: https://youtu.be/ggD-eb3yPVs Link: https://lwn.net/Articles/830666/ Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- arch/alpha/include/uapi/asm/errno.h | 3 +++ arch/mips/include/uapi/asm/errno.h | 3 +++ arch/parisc/include/uapi/asm/errno.h | 3 +++ arch/sparc/include/uapi/asm/errno.h | 3 +++ fs/open.c | 16 ++++++++++++++++ include/uapi/asm-generic/errno.h | 3 +++ include/uapi/linux/openat2.h | 2 ++ tools/arch/alpha/include/uapi/asm/errno.h | 3 +++ tools/arch/mips/include/uapi/asm/errno.h | 3 +++ tools/arch/parisc/include/uapi/asm/errno.h | 3 +++ tools/arch/sparc/include/uapi/asm/errno.h | 3 +++ tools/include/uapi/asm-generic/errno.h | 3 +++ 12 files changed, 48 insertions(+)
diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h index 3d265f6babaf..41157139fdad 100644 --- a/arch/alpha/include/uapi/asm/errno.h +++ b/arch/alpha/include/uapi/asm/errno.h @@ -125,4 +125,7 @@
#define EHWPOISON 139 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 140 /* Extensible syscall performed no operation */ + #endif diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h index 2fb714e2d6d8..dd1e0ba61105 100644 --- a/arch/mips/include/uapi/asm/errno.h +++ b/arch/mips/include/uapi/asm/errno.h @@ -124,6 +124,9 @@
#define EHWPOISON 168 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 169 /* Extensible syscall performed no operation */ + #define EDQUOT 1133 /* Quota exceeded */
diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h index 8d94739d75c6..e2f6998ad61c 100644 --- a/arch/parisc/include/uapi/asm/errno.h +++ b/arch/parisc/include/uapi/asm/errno.h @@ -122,4 +122,7 @@
#define EHWPOISON 257 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 258 /* Extensible syscall performed no operation */ + #endif diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h index 81a732b902ee..1b11e4651093 100644 --- a/arch/sparc/include/uapi/asm/errno.h +++ b/arch/sparc/include/uapi/asm/errno.h @@ -115,4 +115,7 @@
#define EHWPOISON 135 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 136 /* Extensible syscall performed no operation */ + #endif diff --git a/fs/open.c b/fs/open.c index 30bfcddd505d..3fbb0c82fece 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1451,16 +1451,32 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename, struct open_how __user *, how, size_t, usize) { int err; + bool check_fields; struct open_how tmp;
BUILD_BUG_ON(sizeof(struct open_how) < OPEN_HOW_SIZE_VER0); BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_LATEST);
+ check_fields = usize & CHECK_FIELDS; + usize &= ~CHECK_FIELDS; + if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL; if (unlikely(usize > PAGE_SIZE)) return -E2BIG;
+ if (unlikely(check_fields)) { + memset(&tmp, 0, sizeof(tmp)); + tmp = (struct open_how) { + .flags = VALID_OPEN_FLAGS, + .mode = S_IALLUGO, + .resolve = VALID_RESOLVE_FLAGS, + }; + + err = copy_struct_to_user(how, usize, &tmp, sizeof(tmp), NULL); + return err ?: -EEXTSYS_NOOP; + } + err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize); if (err) return err; diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h index cf9c51ac49f9..f5bfe081e73a 100644 --- a/include/uapi/asm-generic/errno.h +++ b/include/uapi/asm-generic/errno.h @@ -120,4 +120,7 @@
#define EHWPOISON 133 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 134 /* Extensible syscall performed no operation */ + #endif diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h index a5feb7604948..6052a504cfa4 100644 --- a/include/uapi/linux/openat2.h +++ b/include/uapi/linux/openat2.h @@ -4,6 +4,8 @@
#include <linux/types.h>
+#define CHECK_FIELDS (1ULL << 63) + /* * Arguments for how openat2(2) should open the target path. If only @flags and * @mode are non-zero, then openat2(2) operates very similarly to openat(2). diff --git a/tools/arch/alpha/include/uapi/asm/errno.h b/tools/arch/alpha/include/uapi/asm/errno.h index 3d265f6babaf..41157139fdad 100644 --- a/tools/arch/alpha/include/uapi/asm/errno.h +++ b/tools/arch/alpha/include/uapi/asm/errno.h @@ -125,4 +125,7 @@
#define EHWPOISON 139 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 140 /* Extensible syscall performed no operation */ + #endif diff --git a/tools/arch/mips/include/uapi/asm/errno.h b/tools/arch/mips/include/uapi/asm/errno.h index 2fb714e2d6d8..dd1e0ba61105 100644 --- a/tools/arch/mips/include/uapi/asm/errno.h +++ b/tools/arch/mips/include/uapi/asm/errno.h @@ -124,6 +124,9 @@
#define EHWPOISON 168 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 169 /* Extensible syscall performed no operation */ + #define EDQUOT 1133 /* Quota exceeded */
diff --git a/tools/arch/parisc/include/uapi/asm/errno.h b/tools/arch/parisc/include/uapi/asm/errno.h index 8d94739d75c6..e2f6998ad61c 100644 --- a/tools/arch/parisc/include/uapi/asm/errno.h +++ b/tools/arch/parisc/include/uapi/asm/errno.h @@ -122,4 +122,7 @@
#define EHWPOISON 257 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 258 /* Extensible syscall performed no operation */ + #endif diff --git a/tools/arch/sparc/include/uapi/asm/errno.h b/tools/arch/sparc/include/uapi/asm/errno.h index 81a732b902ee..1b11e4651093 100644 --- a/tools/arch/sparc/include/uapi/asm/errno.h +++ b/tools/arch/sparc/include/uapi/asm/errno.h @@ -115,4 +115,7 @@
#define EHWPOISON 135 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 136 /* Extensible syscall performed no operation */ + #endif diff --git a/tools/include/uapi/asm-generic/errno.h b/tools/include/uapi/asm-generic/errno.h index cf9c51ac49f9..f5bfe081e73a 100644 --- a/tools/include/uapi/asm-generic/errno.h +++ b/tools/include/uapi/asm-generic/errno.h @@ -120,4 +120,7 @@
#define EHWPOISON 133 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 134 /* Extensible syscall performed no operation */ + #endif
We should also verify that poisoned data after a misaligned struct is also handled correctly by is_zeroed_user(). This test passes with no kernel changes needed, so is_zeroed_user() was correct already.
Fixes: b28a10aedcd4 ("selftests: add openat2(2) selftests") Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/openat2/openat2_test.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c index 5790ab446527..4ca175a16ad6 100644 --- a/tools/testing/selftests/openat2/openat2_test.c +++ b/tools/testing/selftests/openat2/openat2_test.c @@ -112,9 +112,9 @@ void test_openat2_struct(void) * * This is effectively to check that is_zeroed_user() works. */ - copy = malloc(misalign + sizeof(how_ext)); + copy = malloc(misalign*2 + sizeof(how_ext)); how_copy = copy + misalign; - memset(copy, 0xff, misalign); + memset(copy, 0xff, misalign*2 + sizeof(how_ext)); memcpy(how_copy, &how_ext, sizeof(how_ext)); }
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/openat2/Makefile | 2 + tools/testing/selftests/openat2/openat2_test.c | 161 ++++++++++++++++++++++++- 2 files changed, 161 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile index 185dc76ebb5f..239a7fd0cb35 100644 --- a/tools/testing/selftests/openat2/Makefile +++ b/tools/testing/selftests/openat2/Makefile @@ -12,6 +12,8 @@ ifeq ($(LLVM),) endif
LOCAL_HDRS += helpers.h +# Needed for EEXTSYS_NOOP definition. +CFLAGS += $(TOOLS_INCLUDES)
include ../lib.mk
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c index 4ca175a16ad6..9d3382bdfce9 100644 --- a/tools/testing/selftests/openat2/openat2_test.c +++ b/tools/testing/selftests/openat2/openat2_test.c @@ -1,11 +1,12 @@ // SPDX-License-Identifier: GPL-2.0-or-later /* * Author: Aleksa Sarai cyphar@cyphar.com - * Copyright (C) 2018-2019 SUSE LLC. + * Copyright (C) 2018-2024 SUSE LLC. */
#define _GNU_SOURCE #define __SANE_USERSPACE_TYPES__ // Use ll64 +#include <errno.h> #include <fcntl.h> #include <sched.h> #include <sys/stat.h> @@ -29,6 +30,10 @@ #define O_LARGEFILE 0x8000 #endif
+#ifndef CHECK_FIELDS +#define CHECK_FIELDS (1ULL << 63) +#endif + struct open_how_ext { struct open_how inner; uint32_t extra1; @@ -45,6 +50,152 @@ struct struct_test { int err; };
+static bool check(bool *failed, bool pred) +{ + *failed |= pred; + return pred; +} + +#define NUM_OPENAT2_CHECK_FIELDS_BAD_TESTS 2 + +static void test_openat2_check_fields_bad(const char *name, size_t struct_size) +{ + int fd, expected_err; + bool failed = false; + struct open_how how = {}; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + if (struct_size < sizeof(struct open_how)) + expected_err = -EINVAL; + if (struct_size > 4096) + expected_err = -E2BIG; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + fd = raw_openat2(AT_FDCWD, "", &how, CHECK_FIELDS | struct_size); + if (check(&failed, fd != expected_err)) + ksft_print_msg("openat2(CHECK_FIELDS) returned wrong error code: %d (%s) != %d (%s)\n", + fd, strerror(-fd), + expected_err, strerror(-expected_err)); + + if (fd >= 0) + close(fd); + + if (failed) + resultfn = ksft_test_result_fail; + +skip: + resultfn("openat2(CHECK_FIELDS) with %s returns %d (%s)\n", name, + expected_err, strerror(-expected_err)); + fflush(stdout); +} + +#define NUM_OPENAT2_CHECK_FIELDS_TESTS 1 +#define NUM_OPENAT2_CHECK_FIELDS_VARIATIONS 13 + +static void test_openat2_check_fields(void) +{ + int misalignments[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 17, 87 }; + + for (int i = 0; i < ARRAY_LEN(misalignments); i++) { + int fd, misalign = misalignments[i]; + bool failed = false; + char *fdpath = NULL; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + struct open_how_ext how_ext = {}, *how_copy = &how_ext; + void *copy = NULL; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + if (misalign) { + /* + * Explicitly misalign the structure copying it with + * the given (mis)alignment offset. The other data is + * set to zero and we verify this afterwards to make + * sure CHECK_FIELDS doesn't write outside the buffer. + */ + copy = malloc(misalign*2 + sizeof(how_ext)); + how_copy = copy + misalign; + memset(copy, 0x00, misalign*2 + sizeof(how_ext)); + memcpy(how_copy, &how_ext, sizeof(how_ext)); + } + + fd = raw_openat2(AT_FDCWD, ".", how_copy, CHECK_FIELDS | sizeof(*how_copy)); + if (check(&failed, (fd != -EEXTSYS_NOOP))) + ksft_print_msg("openat2(CHECK_FIELDS) returned wrong error code: %d (%s) != %d\n", + fd, strerror(-fd), -EEXTSYS_NOOP); + if (fd >= 0) { + fdpath = fdreadlink(fd); + close(fd); + } + + if (failed) { + ksft_print_msg("openat2(CHECK_FIELDS) unexpectedly returned "); + if (fdpath) + ksft_print_msg("%d['%s']\n", fd, fdpath); + else + ksft_print_msg("%d (%s)\n", fd, strerror(-fd)); + } + + if (check(&failed, !(how_copy->inner.flags & O_PATH))) + ksft_print_msg("openat2(CHECK_FIELDS) returned flags is missing O_PATH (0x%.16x): 0x%.16llx\n", + O_PATH, how_copy->inner.flags); + + if (check(&failed, (how_copy->inner.mode != 07777))) + ksft_print_msg("openat2(CHECK_FIELDS) returned mode is invalid (0o%o): 0o%.4llo\n", + 07777, how_copy->inner.mode); + + if (check(&failed, !(how_copy->inner.resolve & RESOLVE_IN_ROOT))) + ksft_print_msg("openat2(CHECK_FIELDS) returned resolve flags is missing RESOLVE_IN_ROOT (0x%.16x): 0x%.16llx\n", + RESOLVE_IN_ROOT, how_copy->inner.resolve); + + /* Verify that the buffer space outside the struct wasn't written to. */ + if (check(&failed, how_copy->extra1 != 0)) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside open_how (extra1): 0x%x\n", + how_copy->extra1); + if (check(&failed, how_copy->extra2 != 0)) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside open_how (extra2): 0x%x\n", + how_copy->extra2); + if (check(&failed, how_copy->extra3 != 0)) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside open_how (extra3): 0x%x\n", + how_copy->extra3); + + if (misalign) { + for (size_t i = 0; i < misalign; i++) { + char *p = copy + i; + if (check(&failed, *p != '\x00')) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside the size: buffer[%ld] = 0x%.2x\n", + p - (char *) copy, *p); + } + for (size_t i = 0; i < misalign; i++) { + char *p = copy + misalign + sizeof(how_ext) + i; + if (check(&failed, *p != '\x00')) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside the size: buffer[%ld] = 0x%.2x\n", + p - (char *) copy, *p); + } + } + + if (failed) + resultfn = ksft_test_result_fail; + +skip: + resultfn("openat2(CHECK_FIELDS) [misalign=%d]\n", misalign); + + free(copy); + free(fdpath); + fflush(stdout); + } +} + #define NUM_OPENAT2_STRUCT_TESTS 7 #define NUM_OPENAT2_STRUCT_VARIATIONS 13
@@ -320,7 +471,9 @@ void test_openat2_flags(void) } }
-#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \ +#define NUM_TESTS (NUM_OPENAT2_CHECK_FIELDS_TESTS * NUM_OPENAT2_CHECK_FIELDS_VARIATIONS + \ + NUM_OPENAT2_CHECK_FIELDS_BAD_TESTS + \ + NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \ NUM_OPENAT2_FLAG_TESTS)
int main(int argc, char **argv) @@ -328,6 +481,10 @@ int main(int argc, char **argv) ksft_print_header(); ksft_set_plan(NUM_TESTS);
+ test_openat2_check_fields(); + test_openat2_check_fields_bad("small struct (< v0)", 0); + test_openat2_check_fields_bad("large struct (> PAGE_SIZE)", 0xF000); + test_openat2_struct(); test_openat2_flags();
As with openat2(2), this allows userspace to easily figure out what flags and fields are supported by clone3(2). For fields which are not flag-based, we simply set every bit in the field so that a naive bitwise-and would show that any value of the field is valid.
For args->exit_signal, since we have an explicit bitmask for the field defined already (CSIGNAL) we can indicate that only those bits are supported by current kernels. If we add some extra bits to exit_signal in the future, being able to detect them as new features would be quite useful.
The intended way of using this interface to get feature information looks something like the following:
static bool clone3_clear_sighand_supported; static bool clone3_cgroup_supported;
int check_clone3_support(void) { int err; struct clone_args args = {};
err = clone3(&args, CHECK_FIELDS | sizeof(args)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: clone3_clear_sighand_supported = (how.flags & CLONE_CLEAR_SIGHAND); clone3_cgroup_supported = (how.flags & CLONE_INTO_CGROUP) && (how.cgroup != 0); break; } }
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- kernel/fork.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index cc760491f201..bded93f4f5f4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2925,11 +2925,15 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, } #endif
+ +#define CLONE3_VALID_FLAGS (CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP) + noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, struct clone_args __user *uargs, size_t usize) { int err; + bool check_fields; struct clone_args args; pid_t *kset_tid = kargs->set_tid;
@@ -2941,11 +2945,34 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, CLONE_ARGS_SIZE_VER2); BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
+ check_fields = usize & CHECK_FIELDS; + usize &= ~CHECK_FIELDS; + if (unlikely(usize > PAGE_SIZE)) return -E2BIG; if (unlikely(usize < CLONE_ARGS_SIZE_VER0)) return -EINVAL;
+ if (unlikely(check_fields)) { + memset(&args, 0, sizeof(args)); + args = (struct clone_args) { + .flags = CLONE3_VALID_FLAGS, + .pidfd = 0xFFFFFFFFFFFFFFFF, + .child_tid = 0xFFFFFFFFFFFFFFFF, + .parent_tid = 0xFFFFFFFFFFFFFFFF, + .exit_signal = (u64) CSIGNAL, + .stack = 0xFFFFFFFFFFFFFFFF, + .stack_size = 0xFFFFFFFFFFFFFFFF, + .tls = 0xFFFFFFFFFFFFFFFF, + .set_tid = 0xFFFFFFFFFFFFFFFF, + .set_tid_size = 0xFFFFFFFFFFFFFFFF, + .cgroup = 0xFFFFFFFFFFFFFFFF, + }; + + err = copy_struct_to_user(uargs, usize, &args, sizeof(args), NULL); + return err ?: -EEXTSYS_NOOP; + } + err = copy_struct_from_user(&args, sizeof(args), uargs, usize); if (err) return err; @@ -3025,8 +3052,7 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs) static bool clone3_args_valid(struct kernel_clone_args *kargs) { /* Verify that no unknown flags are passed along. */ - if (kargs->flags & - ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP)) + if (kargs->flags & ~CLONE3_VALID_FLAGS) return false;
/*
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/clone3/.gitignore | 1 + tools/testing/selftests/clone3/Makefile | 4 +- .../testing/selftests/clone3/clone3_check_fields.c | 264 +++++++++++++++++++++ 3 files changed, 267 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore index 83c0f6246055..4ec3e1ecd273 100644 --- a/tools/testing/selftests/clone3/.gitignore +++ b/tools/testing/selftests/clone3/.gitignore @@ -3,3 +3,4 @@ clone3 clone3_clear_sighand clone3_set_tid clone3_cap_checkpoint_restore +clone3_check_fields diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile index 84832c369a2e..37141ca13f7c 100644 --- a/tools/testing/selftests/clone3/Makefile +++ b/tools/testing/selftests/clone3/Makefile @@ -1,8 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 -CFLAGS += -g -std=gnu99 $(KHDR_INCLUDES) +CFLAGS += -g -std=gnu99 $(KHDR_INCLUDES) $(TOOLS_INCLUDES) LDLIBS += -lcap
TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \ - clone3_cap_checkpoint_restore + clone3_cap_checkpoint_restore clone3_check_fields
include ../lib.mk diff --git a/tools/testing/selftests/clone3/clone3_check_fields.c b/tools/testing/selftests/clone3/clone3_check_fields.c new file mode 100644 index 000000000000..477604f9a3fb --- /dev/null +++ b/tools/testing/selftests/clone3/clone3_check_fields.c @@ -0,0 +1,264 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2024 SUSE LLC + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <inttypes.h> +#include <linux/types.h> +#include <linux/sched.h> +#include <stdbool.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/syscall.h> +#include <sys/types.h> +#include <sys/un.h> +#include <sys/wait.h> +#include <unistd.h> +#include <sched.h> + +#include "../kselftest.h" +#include "clone3_selftests.h" + +#ifndef CHECK_FIELDS +#define CHECK_FIELDS (1ULL << 63) +#endif + +struct __clone_args_v0 { + __aligned_u64 flags; + __aligned_u64 pidfd; + __aligned_u64 child_tid; + __aligned_u64 parent_tid; + __aligned_u64 exit_signal; + __aligned_u64 stack; + __aligned_u64 stack_size; + __aligned_u64 tls; +}; + +struct __clone_args_v1 { + __aligned_u64 flags; + __aligned_u64 pidfd; + __aligned_u64 child_tid; + __aligned_u64 parent_tid; + __aligned_u64 exit_signal; + __aligned_u64 stack; + __aligned_u64 stack_size; + __aligned_u64 tls; + __aligned_u64 set_tid; + __aligned_u64 set_tid_size; +}; + +struct __clone_args_v2 { + __aligned_u64 flags; + __aligned_u64 pidfd; + __aligned_u64 child_tid; + __aligned_u64 parent_tid; + __aligned_u64 exit_signal; + __aligned_u64 stack; + __aligned_u64 stack_size; + __aligned_u64 tls; + __aligned_u64 set_tid; + __aligned_u64 set_tid_size; + __aligned_u64 cgroup; +}; + +static int call_clone3(void *clone_args, size_t size) +{ + int status; + pid_t pid; + + pid = sys_clone3(clone_args, size); + if (pid < 0) + return -errno; + + if (pid == 0) { + ksft_print_msg("I am the child, my PID is %d\n", getpid()); + _exit(EXIT_SUCCESS); + } + + ksft_print_msg("I am the parent (%d). My child's pid is %d\n", + getpid(), pid); + + if (waitpid(-1, &status, __WALL) < 0) { + ksft_print_msg("waitpid() returned %s\n", strerror(errno)); + return -errno; + } + if (!WIFEXITED(status)) { + ksft_print_msg("Child did not exit normally, status 0x%x\n", + status); + return EXIT_FAILURE; + } + if (WEXITSTATUS(status)) + return WEXITSTATUS(status); + + return 0; +} + +static bool check(bool *failed, bool pred) +{ + *failed |= pred; + return pred; +} + +static void test_clone3_check_fields(const char *test_name, size_t struct_size) +{ + size_t bufsize; + void *buffer; + pid_t pid; + bool failed = false; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + /* Allocate some bytes after clone_args to verify they are cleared. */ + bufsize = struct_size + 16; + buffer = malloc(bufsize); + /* Set the structure to dummy values. */ + memset(buffer, 0, bufsize); + memset(buffer, 0xAB, struct_size); + + pid = call_clone3(buffer, CHECK_FIELDS | struct_size); + if (check(&failed, (pid != -EEXTSYS_NOOP))) + ksft_print_msg("clone3(CHECK_FIELDS) returned the wrong error code: %d (%s) != %d\n", + pid, strerror(-pid), -EEXTSYS_NOOP); + + switch (struct_size) { + case sizeof(struct __clone_args_v2): { + struct __clone_args_v2 *args = buffer; + + if (check(&failed, (args->cgroup != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong cgroup field: 0x%.16llx != 0x%.16llx\n", + args->cgroup, 0xFFFFFFFFFFFFFFFF); + + /* fallthrough; */ + } + case sizeof(struct __clone_args_v1): { + struct __clone_args_v1 *args = buffer; + + if (check(&failed, (args->set_tid != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong set_tid field: 0x%.16llx != 0x%.16llx\n", + args->set_tid, 0xFFFFFFFFFFFFFFFF); + if (check(&failed, (args->set_tid_size != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong set_tid_size field: 0x%.16llx != 0x%.16llx\n", + args->set_tid_size, 0xFFFFFFFFFFFFFFFF); + + /* fallthrough; */ + } + case sizeof(struct __clone_args_v0): { + struct __clone_args_v0 *args = buffer; + + if (check(&failed, !(args->flags & CLONE_NEWUSER))) + ksft_print_msg("clone3(CHECK_FIELDS) is missing CLONE_NEWUSER in flags: 0x%.16llx (0x%.16llx)\n", + args->flags, CLONE_NEWUSER); + if (check(&failed, !(args->flags & CLONE_THREAD))) + ksft_print_msg("clone3(CHECK_FIELDS) is missing CLONE_THREAD in flags: 0x%.16llx (0x%.16llx)\n", + args->flags, CLONE_THREAD); + /* + * CLONE_INTO_CGROUP was added in v2, but it will be set even + * with smaller structure sizes. + */ + if (check(&failed, !(args->flags & CLONE_INTO_CGROUP))) + ksft_print_msg("clone3(CHECK_FIELDS) is missing CLONE_INTO_CGROUP in flags: 0x%.16llx (0x%.16llx)\n", + args->flags, CLONE_INTO_CGROUP); + + if (check(&failed, (args->exit_signal != 0xFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong exit_signal field: 0x%.16llx != 0x%.16llx\n", + args->exit_signal, 0xFF); + + if (check(&failed, (args->stack != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong stack field: 0x%.16llx != 0x%.16llx\n", + args->stack, 0xFFFFFFFFFFFFFFFF); + if (check(&failed, (args->stack_size != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong stack_size field: 0x%.16llx != 0x%.16llx\n", + args->stack_size, 0xFFFFFFFFFFFFFFFF); + if (check(&failed, (args->tls != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong tls field: 0x%.16llx != 0x%.16llx\n", + args->tls, 0xFFFFFFFFFFFFFFFF); + + break; + } + default: + fprintf(stderr, "INVALID STRUCTURE SIZE: %d\n", struct_size); + abort(); + } + + /* Verify that the trailing parts of the buffer are still 0. */ + for (size_t i = struct_size; i < bufsize; i++) { + char ch = ((char *)buffer)[i]; + if (check(&failed, (ch != '\x00'))) + ksft_print_msg("clone3(CHECK_FIELDS) touched a byte outside the size: buffer[%d] = 0x%.2x\n", + i, ch); + } + + if (failed) + resultfn = ksft_test_result_fail; + + resultfn("clone3(CHECK_FIELDS) with %s\n", test_name); + free(buffer); +} + +struct check_fields_test { + const char *name; + size_t struct_size; +}; + +static struct check_fields_test check_fields_tests[] = { + {"struct v0", sizeof(struct __clone_args_v0)}, + {"struct v1", sizeof(struct __clone_args_v1)}, + {"struct v2", sizeof(struct __clone_args_v2)}, +}; + +static int test_clone3_check_fields_badsize(const char *test_name, + size_t struct_size) +{ + void *buffer; + pid_t pid; + bool failed = false; + int expected_err; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + buffer = malloc(struct_size); + memset(buffer, 0xAB, struct_size); + + if (struct_size < sizeof(struct __clone_args_v0)) + expected_err = -EINVAL; + if (struct_size > 4096) + expected_err = -E2BIG; + + pid = call_clone3(buffer, CHECK_FIELDS | struct_size); + if (check(&failed, (pid != expected_err))) + ksft_print_msg("clone3(CHECK_FIELDS) returned the wrong error code: %d (%s) != %d (%s)\n", + pid, strerror(-pid), + expected_err, strerror(-expected_err)); + + if (failed) + resultfn = ksft_test_result_fail; + + resultfn("clone3(CHECK_FIELDS) with %s returns %d (%s)\n", test_name, + expected_err, strerror(-expected_err)); + free(buffer); +} + +static struct check_fields_test bad_size_tests[] = { + {"short struct (< v0)", 1}, + {"long struct (> PAGE_SIZE)", 0xF000}, +}; + +int main(void) +{ + ksft_print_header(); + ksft_set_plan(ARRAY_SIZE(check_fields_tests) + ARRAY_SIZE(bad_size_tests)); + test_clone3_supported(); + + for (int i = 0; i < ARRAY_SIZE(check_fields_tests); i++) { + struct check_fields_test *test = &check_fields_tests[i]; + test_clone3_check_fields(test->name, test->struct_size); + } + for (int i = 0; i < ARRAY_SIZE(bad_size_tests); i++) { + struct check_fields_test *test = &bad_size_tests[i]; + test_clone3_check_fields_badsize(test->name, test->struct_size); + } + + ksft_finished(); +}
As with openat2(2), this allows userspace to easily figure out what flags and fields are supported by mount_setattr(2). As with clone3(2), for fields which are not flag-based, we simply set every bit in the field so that a naive bitwise-and would show that any value of the field is valid.
The intended way of using this interface to get feature information looks something like the following:
static bool mountattr_nosymfollow_supported; static bool mountattr_idmap_supported;
int check_clone3_support(void) { int err; struct mount_attr attr = {};
err = mount_attr(-EBADF, "", 0, &args, CHECK_FIELDS | sizeof(args)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: mountattr_nosymfollow_supported = ((attr.attr_clr | attr.attr_set) & MOUNT_ATTR_NOSYMFOLLOW); mountattr_idmap_supported = ((attr.attr_clr | attr.attr_set) & MOUNT_ATTR_IDMAP) && (attr.userns_fd != 0); break; } }
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namespace.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/fs/namespace.c b/fs/namespace.c index 328087a4df8a..c7ae8d96b7b7 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4771,6 +4771,7 @@ SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path, size_t, usize) { int err; + bool check_fields; struct path target; struct mount_attr attr; struct mount_kattr kattr; @@ -4783,11 +4784,27 @@ SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path, AT_NO_AUTOMOUNT)) return -EINVAL;
+ check_fields = usize & CHECK_FIELDS; + usize &= ~CHECK_FIELDS; + if (unlikely(usize > PAGE_SIZE)) return -E2BIG; if (unlikely(usize < MOUNT_ATTR_SIZE_VER0)) return -EINVAL;
+ if (unlikely(check_fields)) { + memset(&attr, 0, sizeof(attr)); + attr = (struct mount_attr) { + .attr_set = MOUNT_SETATTR_VALID_FLAGS, + .attr_clr = MOUNT_SETATTR_VALID_FLAGS, + .propagation = MOUNT_SETATTR_PROPAGATION_FLAGS, + .userns_fd = 0xFFFFFFFFFFFFFFFF, + }; + + err = copy_struct_to_user(uattr, usize, &attr, sizeof(attr), NULL); + return err ?: -EEXTSYS_NOOP; + } + if (!may_mount()) return -EPERM;
While we're at it -- to make testing for error codes with ASSERT_* easier, make the sys_* wrappers return -errno in case of an error.
For some reason, the <sys/sysinfo.h> include doesn't correct import <asm/posix_types.h>, leading to compilation errors -- so just import <asm/posix_types.h> manually.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/include/uapi/asm-generic/posix_types.h | 101 +++++++++++++++++++++ tools/testing/selftests/mount_setattr/Makefile | 2 +- .../selftests/mount_setattr/mount_setattr_test.c | 53 ++++++++++- 3 files changed, 153 insertions(+), 3 deletions(-)
diff --git a/tools/include/uapi/asm-generic/posix_types.h b/tools/include/uapi/asm-generic/posix_types.h new file mode 100644 index 000000000000..b5f7594eee7a --- /dev/null +++ b/tools/include/uapi/asm-generic/posix_types.h @@ -0,0 +1,101 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef __ASM_GENERIC_POSIX_TYPES_H +#define __ASM_GENERIC_POSIX_TYPES_H + +#include <asm/bitsperlong.h> +/* + * This file is generally used by user-level software, so you need to + * be a little careful about namespace pollution etc. + * + * First the types that are often defined in different ways across + * architectures, so that you can override them. + */ + +#ifndef __kernel_long_t +typedef long __kernel_long_t; +typedef unsigned long __kernel_ulong_t; +#endif + +#ifndef __kernel_ino_t +typedef __kernel_ulong_t __kernel_ino_t; +#endif + +#ifndef __kernel_mode_t +typedef unsigned int __kernel_mode_t; +#endif + +#ifndef __kernel_pid_t +typedef int __kernel_pid_t; +#endif + +#ifndef __kernel_ipc_pid_t +typedef int __kernel_ipc_pid_t; +#endif + +#ifndef __kernel_uid_t +typedef unsigned int __kernel_uid_t; +typedef unsigned int __kernel_gid_t; +#endif + +#ifndef __kernel_suseconds_t +typedef __kernel_long_t __kernel_suseconds_t; +#endif + +#ifndef __kernel_daddr_t +typedef int __kernel_daddr_t; +#endif + +#ifndef __kernel_uid32_t +typedef unsigned int __kernel_uid32_t; +typedef unsigned int __kernel_gid32_t; +#endif + +#ifndef __kernel_old_uid_t +typedef __kernel_uid_t __kernel_old_uid_t; +typedef __kernel_gid_t __kernel_old_gid_t; +#endif + +#ifndef __kernel_old_dev_t +typedef unsigned int __kernel_old_dev_t; +#endif + +/* + * Most 32 bit architectures use "unsigned int" size_t, + * and all 64 bit architectures use "unsigned long" size_t. + */ +#ifndef __kernel_size_t +#if __BITS_PER_LONG != 64 +typedef unsigned int __kernel_size_t; +typedef int __kernel_ssize_t; +typedef int __kernel_ptrdiff_t; +#else +typedef __kernel_ulong_t __kernel_size_t; +typedef __kernel_long_t __kernel_ssize_t; +typedef __kernel_long_t __kernel_ptrdiff_t; +#endif +#endif + +#ifndef __kernel_fsid_t +typedef struct { + int val[2]; +} __kernel_fsid_t; +#endif + +/* + * anything below here should be completely generic + */ +typedef __kernel_long_t __kernel_off_t; +typedef long long __kernel_loff_t; +typedef __kernel_long_t __kernel_old_time_t; +#ifndef __KERNEL__ +typedef __kernel_long_t __kernel_time_t; +#endif +typedef long long __kernel_time64_t; +typedef __kernel_long_t __kernel_clock_t; +typedef int __kernel_timer_t; +typedef int __kernel_clockid_t; +typedef char * __kernel_caddr_t; +typedef unsigned short __kernel_uid16_t; +typedef unsigned short __kernel_gid16_t; + +#endif /* __ASM_GENERIC_POSIX_TYPES_H */ diff --git a/tools/testing/selftests/mount_setattr/Makefile b/tools/testing/selftests/mount_setattr/Makefile index 0c0d7b1234c1..3f260506fe60 100644 --- a/tools/testing/selftests/mount_setattr/Makefile +++ b/tools/testing/selftests/mount_setattr/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 # Makefile for mount selftests. -CFLAGS = -g $(KHDR_INCLUDES) -Wall -O2 -pthread +CFLAGS = -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES) -Wall -O2 -pthread
TEST_GEN_PROGS := mount_setattr_test
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c index c6a8c732b802..36b8286e5f43 100644 --- a/tools/testing/selftests/mount_setattr/mount_setattr_test.c +++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c @@ -11,6 +11,7 @@ #include <sys/wait.h> #include <sys/vfs.h> #include <sys/statvfs.h> +#include <asm/posix_types.h> /* get __kernel_* typedefs */ #include <sys/sysinfo.h> #include <stdlib.h> #include <unistd.h> @@ -137,7 +138,8 @@ static inline int sys_mount_setattr(int dfd, const char *path, unsigned int flags, struct mount_attr *attr, size_t size) { - return syscall(__NR_mount_setattr, dfd, path, flags, attr, size); + int ret = syscall(__NR_mount_setattr, dfd, path, flags, attr, size); + return ret >= 0 ? ret : -errno; }
#ifndef OPEN_TREE_CLONE @@ -152,9 +154,24 @@ static inline int sys_mount_setattr(int dfd, const char *path, unsigned int flag #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ #endif
+#define FSMOUNT_VALID_FLAGS \ + (MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV | \ + MOUNT_ATTR_NOEXEC | MOUNT_ATTR__ATIME | MOUNT_ATTR_NODIRATIME | \ + MOUNT_ATTR_NOSYMFOLLOW) + +#define MOUNT_SETATTR_VALID_FLAGS (FSMOUNT_VALID_FLAGS | MOUNT_ATTR_IDMAP) + +#define MOUNT_SETATTR_PROPAGATION_FLAGS \ + (MS_UNBINDABLE | MS_PRIVATE | MS_SLAVE | MS_SHARED) + +#ifndef CHECK_FIELDS +#define CHECK_FIELDS (1ULL << 63) +#endif + static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags) { - return syscall(__NR_open_tree, dfd, filename, flags); + int ret = syscall(__NR_open_tree, dfd, filename, flags); + return ret >= 0 ? ret : -errno; }
static ssize_t write_nointr(int fd, const void *buf, size_t count) @@ -1497,4 +1514,36 @@ TEST_F(mount_setattr, mount_attr_nosymfollow) ASSERT_EQ(close(fd), 0); }
+TEST_F(mount_setattr, check_fields) +{ + struct mount_attr_ext { + struct mount_attr inner; + uint64_t extra1; + uint64_t extra2; + uint64_t extra3; + } attr; + + /* Set the structure to dummy values. */ + memset(&attr, 0xAB, sizeof(attr)); + + if (!mount_setattr_supported()) + SKIP(return, "mount_setattr syscall not supported"); + + /* Make sure that invalid sizes are detected even with CHECK_FIELDS. */ + EXPECT_EQ(sys_mount_setattr(-1, "", 0, (struct mount_attr *) &attr, CHECK_FIELDS), -EINVAL); + EXPECT_EQ(sys_mount_setattr(-1, "", 0, (struct mount_attr *) &attr, CHECK_FIELDS | 0xF000), -E2BIG); + + /* Actually get the CHECK_FIELDS results. */ + ASSERT_EQ(sys_mount_setattr(-1, "", 0, (struct mount_attr *) &attr, CHECK_FIELDS | sizeof(attr)), -EEXTSYS_NOOP); + + EXPECT_EQ(attr.inner.attr_set, MOUNT_SETATTR_VALID_FLAGS); + EXPECT_EQ(attr.inner.attr_clr, MOUNT_SETATTR_VALID_FLAGS); + EXPECT_EQ(attr.inner.propagation, MOUNT_SETATTR_PROPAGATION_FLAGS); + EXPECT_EQ(attr.inner.userns_fd, 0xFFFFFFFFFFFFFFFF); + /* Trailing fields outside ksize must be zeroed. */ + EXPECT_EQ(attr.extra1, 0); + EXPECT_EQ(attr.extra2, 0); + EXPECT_EQ(attr.extra3, 0); +} + TEST_HARNESS_MAIN
* Aleksa Sarai:
This is something that I've been thinking about for a while. We had a discussion at LPC 2020 about this[1] but the proposals suggested there never materialised.
In short, it is quite difficult for userspace to detect the feature capability of syscalls at runtime. This is something a lot of programs want to do, but they are forced to create elaborate scenarios to try to figure out if a feature is supported without causing damage to the system. For the vast majority of cases, each individual feature also needs to be tested individually (because syscall results are all-or-nothing), so testing even a single syscall's feature set can easily inflate the startup time of programs.
This patchset implements the fairly minimal design I proposed in this talk[2] and in some old LKML threads (though I can't find the exact references ATM). The general flow looks like:
By the way, I have recently tried to document things from a glibc perspective (which is a bit broader because we also have purely userspace types):
[PATCH RFC] manual: Document how types change https://inbox.sourceware.org/libc-alpha/8734m4n1ij.fsf@oldenburg3.str.redhat.com/
(This patch has not yet been reviewed.)
On Thu, 10 Oct 2024 07:40:33 +1100, Aleksa Sarai wrote:
This is something that I've been thinking about for a while. We had a discussion at LPC 2020 about this[1] but the proposals suggested there never materialised.
In short, it is quite difficult for userspace to detect the feature capability of syscalls at runtime. This is something a lot of programs want to do, but they are forced to create elaborate scenarios to try to figure out if a feature is supported without causing damage to the system. For the vast majority of cases, each individual feature also needs to be tested individually (because syscall results are all-or-nothing), so testing even a single syscall's feature set can easily inflate the startup time of programs.
[...]
I think the copy_struct_to_user() is useful especially now that we'll gain another user with pidfd_info.
---
Applied to the vfs.usercopy branch of the vfs/vfs.git tree. Patches in the vfs.usercopy branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.usercopy
[01/10] uaccess: add copy_struct_to_user helper https://git.kernel.org/vfs/vfs/c/424a55a4a908 [02/10] sched_getattr: port to copy_struct_to_user https://git.kernel.org/vfs/vfs/c/112cca098a70
On 2024-10-21, Christian Brauner brauner@kernel.org wrote:
On Thu, 10 Oct 2024 07:40:33 +1100, Aleksa Sarai wrote:
This is something that I've been thinking about for a while. We had a discussion at LPC 2020 about this[1] but the proposals suggested there never materialised.
In short, it is quite difficult for userspace to detect the feature capability of syscalls at runtime. This is something a lot of programs want to do, but they are forced to create elaborate scenarios to try to figure out if a feature is supported without causing damage to the system. For the vast majority of cases, each individual feature also needs to be tested individually (because syscall results are all-or-nothing), so testing even a single syscall's feature set can easily inflate the startup time of programs.
[...]
I think the copy_struct_to_user() is useful especially now that we'll gain another user with pidfd_info.
Once we start extending pidfd_info, it might be necessary to add some more helpers to make it easier to figure out what bits to set in the returned request mask.
Applied to the vfs.usercopy branch of the vfs/vfs.git tree. Patches in the vfs.usercopy branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.usercopy
[01/10] uaccess: add copy_struct_to_user helper https://git.kernel.org/vfs/vfs/c/424a55a4a908 [02/10] sched_getattr: port to copy_struct_to_user https://git.kernel.org/vfs/vfs/c/112cca098a70
linux-kselftest-mirror@lists.linaro.org