This is something that I've been thinking about for a while. We had a discussion at LPC 2020 about this[1] but the proposals suggested there never materialised.
In short, it is quite difficult for userspace to detect the feature capability of syscalls at runtime. This is something a lot of programs want to do, but they are forced to create elaborate scenarios to try to figure out if a feature is supported without causing damage to the system. For the vast majority of cases, each individual feature also needs to be tested individually (because syscall results are all-or-nothing), so testing even a single syscall's feature set can easily inflate the startup time of programs.
This patchset implements the fairly minimal design I proposed in this talk[2] and in some old LKML threads (though I can't find the exact references ATM). The general flow looks like:
1. Userspace will indicate to the kernel that a syscall should a be no-op by setting the top bit of the extensible struct size argument.
We will almost certainly never support exabyte sized structs, so the top bits are free for us to use as makeshift flag bits. This is preferable to using the per-syscall flag field inside the structure because seccomp can easily detect the bit in the flag and allow the probe or forcefully return -EEXTSYS_NOOP.
2. The kernel will then fill the provided structure with every valid bit pattern that the current kernel understands.
For flags or other bitflag-like fields, this is the set of valid flags or bits. For pointer fields or fields that take an arbitrary value, the field has every bit set (0xFF... to fill the field) to indicate that any value is valid in the field.
3. The syscall then returns -EEXTSYS_NOOP which is an errno that will only ever be used for this purpose (so userspace can be sure that the request succeeded).
On older kernels, the syscall will return a different error (usually -E2BIG or -EFAULT) and userspace can do their old-fashioned checks.
4. Userspace can then check which flags and fields are supported by looking at the fields in the returned structure. Flags are checked by doing an AND with the flags field, and field support can checked by comparing to 0. In principle you could just AND the entire structure if you wanted to do this check generically without caring about the structure contents (this is what libraries might consider doing).
Userspace can even find out the internal kernel structure size by passing a PAGE_SIZE buffer and seeing how many bytes are non-zero.
As with copy_struct_from_user(), this is designed to be forward- and backwards- compatible.
This allows programas to get a one-shot understanding of what features a syscall supports without having to do any elaborate setups or tricks to detect support for destructive features. Flags can simply be ANDed to check if they are in the supported set, and fields can just be checked to see if they are non-zero.
This patchset is IMHO the simplest way we can add the ability to introspect the feature set of extensible struct (copy_struct_from_user) syscalls. It doesn't preclude the chance of a more generic mechanism being added later.
The intended way of using this interface to get feature information looks something like the following (imagine that openat2 has gained a new field and a new flag in the future):
static bool openat2_no_automount_supported; static bool openat2_cwd_fd_supported;
int check_openat2_support(void) { int err; struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT); openat2_cwd_fd_supported = (how.cwd_fd != 0); break; } }
[1]: https://lwn.net/Articles/830666/ [2]: https://youtu.be/ggD-eb3yPVs
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- Aleksa Sarai (8): uaccess: add copy_struct_to_user helper sched_getattr: port to copy_struct_to_user openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) openat2: add CHECK_FIELDS flag to usize argument clone3: add CHECK_FIELDS flag to usize argument selftests: openat2: add 0xFF poisoned data after misaligned struct selftests: openat2: add CHECK_FIELDS selftests selftests: clone3: add CHECK_FIELDS selftests
fs/open.c | 17 ++ include/linux/uaccess.h | 98 +++++++++ include/uapi/asm-generic/errno.h | 3 + include/uapi/linux/openat2.h | 2 + kernel/fork.c | 33 ++- kernel/sched/syscalls.c | 42 +--- tools/testing/selftests/clone3/.gitignore | 1 + tools/testing/selftests/clone3/Makefile | 2 +- .../testing/selftests/clone3/clone3_check_fields.c | 229 +++++++++++++++++++++ tools/testing/selftests/openat2/openat2_test.c | 126 +++++++++++- 10 files changed, 504 insertions(+), 49 deletions(-) --- base-commit: 431c1646e1f86b949fa3685efc50b660a364c2b6 change-id: 20240803-extensible-structs-check_fields-a47e94cef691
Best regards,
This is based on copy_struct_from_user(), but there is one additional case to consider when creating a syscall that returns an extensible-struct to userspace -- how should data in the struct that cannot fit into the userspace struct be handled (ksize > usize)?
There are three possibilies:
1. The interface is like sched_getattr(2), where new information will be silently not provided to userspace. This is probably what most interfaces will want to do, as it provides the most possible backwards-compatibility.
2. The interface is like lsm_list_modules(2), where you want to return an error like -EMSGSIZE if not providing information could result in the userspace program making a serious mistake (such as one that could lead to a security problem) or if you want to provide some flag to userspace so they know that they are missing some information.
3. The interface is like statx(2), where there some kind of a request mask that indicates what data userspace would like. One could imagine that statx2(2) (using extensible structs) would want to return -EMSGSIZE if the user explicitly requested a field that their structure is too small to fit, but not return an error if the field was not explicitly requested. This is kind of a mix between (1) and (2) based on the requested mask.
The copy_struct_to_user() helper includes a an extra argument that is used to return a boolean flag indicating whether there was a non-zero byte in the trailing bytes that were not copied to userspace. This can be used in the following ways to handle all three cases, respectively:
1. Just pass NULL, as you don't care about this case.
2. Return an error (say -EMSGSIZE) if the argument was set to true by copy_struct_to_user().
3. If the argument was set to true by copy_struct_to_user(), check if there is a flag that implies a field larger than usize.
This is the only case where callers of copy_struct_to_user() should check usize themselves. This will probably require scanning an array that specifies what flags were added for each version of the flags struct and returning an error if the request mask matches any of the flags that were added in versions of the struct that are larger than usize.
At the moment we don't have any users of (3), so this patch doesn't include any helpers to make the necessary scanning easier, but it should be fairly easy to add some if necessary.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- include/linux/uaccess.h | 98 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index d8e4105a2f21..5d0a590ef65d 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -387,6 +387,104 @@ copy_struct_from_user(void *dst, size_t ksize, const void __user *src, return 0; }
+/** + * copy_struct_to_user: copy a struct to userspace + * @dst: Destination address, in userspace. This buffer must be @ksize + * bytes long. + * @usize: (Alleged) size of @dst struct. + * @src: Source address, in kernel space. + * @ksize: Size of @src struct. + * @ignored_trailing: Set to %true if there was a non-zero byte in @src that + * userspace cannot see because they are using an smaller struct. + * + * Copies a struct from kernel space to userspace, in a way that guarantees + * backwards-compatibility for struct syscall arguments (as long as future + * struct extensions are made such that all new fields are *appended* to the + * old struct, and zeroed-out new fields have the same meaning as the old + * struct). + * + * Some syscalls may wish to make sure that userspace knows about everything in + * the struct, and if there is a non-zero value that userspce doesn't know + * about, they want to return an error (such as -EMSGSIZE) or have some other + * fallback (such as adding a "you're missing some information" flag). If + * @ignored_trailing is non-%NULL, it will be set to %true if there was a + * non-zero byte that could not be copied to userspace (ie. was past @usize). + * + * While unconditionally returning an error in this case is the simplest + * solution, for maximum backward compatibility you should try to only return + * -EMSGSIZE if the user explicitly requested the data that couldn't be copied. + * Note that structure sizes can change due to header changes and simple + * recompilations without code changes(!), so if you care about + * @ignored_trailing you probably want to make sure that any new field data is + * associated with a flag. Otherwise you might assume that a program knows + * about data it does not. + * + * @ksize is just sizeof(*src), and @usize should've been passed by userspace. + * The recommended usage is something like the following: + * + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) + * { + * int err; + * bool ignored_trailing; + * struct foo karg = {}; + * + * if (usize > PAGE_SIZE) + * return -E2BIG; + * if (usize < FOO_SIZE_VER0) + * return -EINVAL; + * + * // ... modify karg somehow ... + * + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg), + * &ignored_trailing); + * if (err) + * return err; + * if (ignored_trailing) + * return -EMSGSIZE: + * + * // ... + * } + * + * There are three cases to consider: + * * If @usize == @ksize, then it's copied verbatim. + * * If @usize < @ksize, then the kernel is trying to pass userspace a newer + * struct than it supports. Thus we only copy the interoperable portions + * (@usize) and ignore the rest (but @ignored_trailing is set to %true if + * any of the trailing (@ksize - @usize) bytes are non-zero). + * * If @usize > @ksize, then the kernel is trying to pass userspace an older + * struct than userspace supports. In order to make sure the + * unknown-to-the-kernel fields don't contain garbage values, we zero the + * trailing (@usize - @ksize) bytes. + * + * Returns (in all cases, some data may have been copied): + * * -EFAULT: access to userspace failed. + */ +static __always_inline __must_check int +copy_struct_to_user(void __user *dst, size_t usize, const void *src, + size_t ksize, bool *ignored_trailing) +{ + size_t size = min(ksize, usize); + size_t rest = max(ksize, usize) - size; + + /* Double check if ksize is larger than a known object size. */ + if (WARN_ON_ONCE(ksize > __builtin_object_size(src, 1))) + return -E2BIG; + + /* Deal with trailing bytes. */ + if (usize > ksize) { + int ret = clear_user(dst + size, rest); + if (ret) + return ret; + } + if (ignored_trailing) + *ignored_trailing = ksize < usize && + memchr_inv(src + size, 0, rest) != NULL; + /* Copy the interoperable parts of the struct. */ + if (copy_to_user(dst, src, size)) + return -EFAULT; + return 0; +} + bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size);
long copy_from_kernel_nofault(void *dst, const void *src, size_t size);
On Mon, Sep 2, 2024, at 07:06, Aleksa Sarai wrote:
This is based on copy_struct_from_user(), but there is one additional case to consider when creating a syscall that returns an extensible-struct to userspace -- how should data in the struct that cannot fit into the userspace struct be handled (ksize > usize)?
There are three possibilies:
The interface is like sched_getattr(2), where new information will be silently not provided to userspace. This is probably what most interfaces will want to do, as it provides the most possible backwards-compatibility.
The interface is like lsm_list_modules(2), where you want to return an error like -EMSGSIZE if not providing information could result in the userspace program making a serious mistake (such as one that could lead to a security problem) or if you want to provide some flag to userspace so they know that they are missing some information.
I'm not sure if EMSGSIZE is the best choice here, my feeling is that the kernel should instead try to behave the same way as an older kernel that did not know about the extra fields:
- if the structure has always been longer than the provided buffer, -EMSGSIZE should likely have been returned all along. If older kernels just provided a short buffer, changing it now is an ABI change that would only affect intentionally broken callers, and I think keeping the behavior unchanged is more consistent.
- if an extra flag was added along with the larger buffer size, the old kernel would likely have rejected the new flag with -EINVAL, so I think returning the same thing for userspace built against the old kernel headers is more consistent.
+static __always_inline __must_check int +copy_struct_to_user(void __user *dst, size_t usize, const void *src,
size_t ksize, bool *ignored_trailing)
This feels like the kind of function that doesn't need to be inline at all and could be moved to lib/usercopy.c instead. It should clearly stay in the same place as copy_struct_from_user(), but we could move that as well.
+{
- size_t size = min(ksize, usize);
- size_t rest = max(ksize, usize) - size;
- /* Double check if ksize is larger than a known object size. */
- if (WARN_ON_ONCE(ksize > __builtin_object_size(src, 1)))
return -E2BIG;
I guess the __builtin_object_size() check is the reason for making it __always_inline, but that could be done in a trivial inline wrapper around the extern function. If ksize is always expected to be a constant for all callers, the check could even become a compile-time check instead of a WARN_ON_ONCE() that feels wrong here: if there is a code path where this can get triggered, there is clearly a kernel bug, but the only way to find out is to have a userspace caller that triggers the code path.
Again, the same code is already in copy_struct_from_user(), so this is not something you are adding but rather something we may want to change for both.
Arnd
On 2024-09-02, Arnd Bergmann arnd@arndb.de wrote:
On Mon, Sep 2, 2024, at 07:06, Aleksa Sarai wrote:
This is based on copy_struct_from_user(), but there is one additional case to consider when creating a syscall that returns an extensible-struct to userspace -- how should data in the struct that cannot fit into the userspace struct be handled (ksize > usize)?
There are three possibilies:
The interface is like sched_getattr(2), where new information will be silently not provided to userspace. This is probably what most interfaces will want to do, as it provides the most possible backwards-compatibility.
The interface is like lsm_list_modules(2), where you want to return an error like -EMSGSIZE if not providing information could result in the userspace program making a serious mistake (such as one that could lead to a security problem) or if you want to provide some flag to userspace so they know that they are missing some information.
I'm not sure if EMSGSIZE is the best choice here, my feeling is that the kernel should instead try to behave the same way as an older kernel that did not know about the extra fields:
I agree this API is not ideal for syscalls because it can lead to backward-compatibility issues, but that is how lsm_list_modules(2) works. I suspect most syscalls will go with designs (1) or (3).
if the structure has always been longer than the provided buffer, -EMSGSIZE should likely have been returned all along. If older kernels just provided a short buffer, changing it now is an ABI change that would only affect intentionally broken callers, and I think keeping the behavior unchanged is more consistent.
if an extra flag was added along with the larger buffer size, the old kernel would likely have rejected the new flag with -EINVAL, so I think returning the same thing for userspace built against the old kernel headers is more consistent.
+static __always_inline __must_check int +copy_struct_to_user(void __user *dst, size_t usize, const void *src,
size_t ksize, bool *ignored_trailing)
This feels like the kind of function that doesn't need to be inline at all and could be moved to lib/usercopy.c instead. It should clearly stay in the same place as copy_struct_from_user(), but we could move that as well.
IIRC Kees suggested copy_struct_from_user() be inline when I first included it, though I would have to dig through the old threads to find the reasoning. __builtin_object_size() was added some time after it was merged so that wasn't the original reason.
+{
- size_t size = min(ksize, usize);
- size_t rest = max(ksize, usize) - size;
- /* Double check if ksize is larger than a known object size. */
- if (WARN_ON_ONCE(ksize > __builtin_object_size(src, 1)))
return -E2BIG;
I guess the __builtin_object_size() check is the reason for making it __always_inline, but that could be done in a trivial inline wrapper around the extern function. If ksize is always expected to be a constant for all callers, the check could even become a compile-time check instead of a WARN_ON_ONCE() that feels wrong here: if there is a code path where this can get triggered, there is clearly a kernel bug, but the only way to find out is to have a userspace caller that triggers the code path.
Again, the same code is already in copy_struct_from_user(), so this is not something you are adding but rather something we may want to change for both.
Arnd
sched_getattr(2) doesn't care about trailing non-zero bytes in the (ksize > usize) case, so just use copy_struct_to_user() without checking ignored_trailing.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- kernel/sched/syscalls.c | 42 ++---------------------------------------- 1 file changed, 2 insertions(+), 40 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index ae1b42775ef9..4ccc058bae16 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -1147,45 +1147,6 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param) return copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0; }
-/* - * Copy the kernel size attribute structure (which might be larger - * than what user-space knows about) to user-space. - * - * Note that all cases are valid: user-space buffer can be larger or - * smaller than the kernel-space buffer. The usual case is that both - * have the same size. - */ -static int -sched_attr_copy_to_user(struct sched_attr __user *uattr, - struct sched_attr *kattr, - unsigned int usize) -{ - unsigned int ksize = sizeof(*kattr); - - if (!access_ok(uattr, usize)) - return -EFAULT; - - /* - * sched_getattr() ABI forwards and backwards compatibility: - * - * If usize == ksize then we just copy everything to user-space and all is good. - * - * If usize < ksize then we only copy as much as user-space has space for, - * this keeps ABI compatibility as well. We skip the rest. - * - * If usize > ksize then user-space is using a newer version of the ABI, - * which part the kernel doesn't know about. Just ignore it - tooling can - * detect the kernel's knowledge of attributes from the attr->size value - * which is set to ksize in this case. - */ - kattr->size = min(usize, ksize); - - if (copy_to_user(uattr, kattr, kattr->size)) - return -EFAULT; - - return 0; -} - /** * sys_sched_getattr - similar to sched_getparam, but with sched_attr * @pid: the pid in question. @@ -1230,7 +1191,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, #endif }
- return sched_attr_copy_to_user(uattr, &kattr, usize); + kattr.size = min(usize, sizeof(kattr)); + return copy_struct_to_user(uattr, usize, &kattr, sizeof(kattr), NULL); }
#ifdef CONFIG_SMP
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Cc: stable@vger.kernel.org # v5.6+ Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall") Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/open.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/open.c b/fs/open.c index 22adbef7ecc2..30bfcddd505d 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1458,6 +1458,8 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL; + if (unlikely(usize > PAGE_SIZE)) + return -E2BIG;
err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize); if (err)
On Mon, Sep 2, 2024, at 07:06, Aleksa Sarai wrote:
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Right, it's probably a good idea to have a limit there rather than having a busy loop with a user-provided length when the only bound is the available virtual memory.
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL;
- if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
Is PAGE_SIZE significant here? If there is a need to enforce a limit, I would expect this to be the same regardless of kernel configuration, since the structure layout is also independent of the configuration.
Where is the current -EFAULT for users passing more than a page? I only see it for reads beyond the VMA, but not e.g. when checking terabytes of zero pages from an anonymous mapping.
Arnd
On 2024-09-02, Arnd Bergmann arnd@arndb.de wrote:
On Mon, Sep 2, 2024, at 07:06, Aleksa Sarai wrote:
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Right, it's probably a good idea to have a limit there rather than having a busy loop with a user-provided length when the only bound is the available virtual memory.
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL;
- if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
Is PAGE_SIZE significant here? If there is a need to enforce a limit, I would expect this to be the same regardless of kernel configuration, since the structure layout is also independent of the configuration.
PAGE_SIZE is what clone3, perf_event_open, sched_setattr, bpf, etc all use. The idea was that PAGE_SIZE is the absolute limit of any reasonable extensible structure size because we are never going to have argument structures that are larger than a page (I think this was discussed in the original copy_struct_from_user() patchset thread in late 2019, but I can't find the reference at the moment.)
I simply forgot to add this when I first submitted openat2, the original intention was to just match the other syscalls.
Where is the current -EFAULT for users passing more than a page? I only see it for reads beyond the VMA, but not e.g. when checking terabytes of zero pages from an anonymous mapping.
I meant that we in practice return -EFAULT if you pass a really large size (because you end up running off the end of mapped memory). There is no explicit -EFAULT for large sizes, which is exactly the problem. :P
Arnd
On Mon, Sep 2, 2024, at 16:08, Aleksa Sarai wrote:
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL;
- if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
Is PAGE_SIZE significant here? If there is a need to enforce a limit, I would expect this to be the same regardless of kernel configuration, since the structure layout is also independent of the configuration.
PAGE_SIZE is what clone3, perf_event_open, sched_setattr, bpf, etc all use. The idea was that PAGE_SIZE is the absolute limit of any reasonable extensible structure size because we are never going to have argument structures that are larger than a page (I think this was discussed in the original copy_struct_from_user() patchset thread in late 2019, but I can't find the reference at the moment.)
I simply forgot to add this when I first submitted openat2, the original intention was to just match the other syscalls.
Ok, I see. I guess it makes sense to keep this one consistent with the other ones, but we may want to revisit this in the future and come up with something that is independent of CONFIG_PAGE_SIZE.
Where is the current -EFAULT for users passing more than a page? I only see it for reads beyond the VMA, but not e.g. when checking terabytes of zero pages from an anonymous mapping.
I meant that we in practice return -EFAULT if you pass a really large size (because you end up running off the end of mapped memory). There is no explicit -EFAULT for large sizes, which is exactly the problem. :P
Got it, thanks.
Arnd
In order for userspace to be able to know what flags and fields the kernel supports, it is currently necessary for them to do a bunch of fairly subtle self-checks where you need to get a syscall to return a non-EINVAL error or no-op for each flag you wish to use. If you get -EINVAL you know the flag is unsupported, otherwise you know it is supported.
This doesn't scale well for programs that need to check many flags, and not all syscalls can be easily checked (how would you check for new flags for umount2 or clone3 without side-effects?). To solve this problem, we can take advantage of the extensible struct API used by copy_struct_from_user() by providing a special CHECK_FIELDS flag to extensible struct syscalls (like openat2 and clone3) which will:
1. Cause the syscall to fill the structure with every valid bit the kernel understands. For flag arguments, this is the set of all valid flag bits. For pointer and file descriptor arguments, this would be all 0xFF bits (to indicate that any bits are valid). Userspace can then easily check whether the flag they wanted is supported (by doing a simple bitwise AND) or if a field itself is supported (by checking if it is non-zero / all 0xFF).
2. Return a specific no-op error (-EEXTSYS_NOOP) that is not used as an error by any other kernel code, so that userspace can be absolutely sure that the kernel supports CHECK_FIELDS.
Rather than passing CHECK_FIELDS using the standard flags arguments for the syscall, CHECK_FIELDS is instead the highest bit in the provided struct size. The high bits of the size are never going to be non-zero (we currently only allow size to be up to PAGE_SIZE, and it seems very unlikely we will ever allow several exabyte structure arguments).
By passing the flag in the structure size, we can be sure that old kernels will return a consistent error code (-EFAULT in openat2's case) and that seccomp can properly filter this syscall mode (which is guaranteed to be a no-op on all kernels -- it could even force -EEXTSYS_NOOP to make the userspace program think the kernel doesn't support any syscall features).
The intended way of using this interface to get feature information looks something like the following (imagine that openat2 has gained a new field and a new flag in the future):
static bool openat2_no_automount_supported; static bool openat2_cwd_fd_supported;
int check_openat2_support(void) { int err; struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT); openat2_cwd_fd_supported = (how.cwd_fd != 0); break; } }
Link: https://youtu.be/ggD-eb3yPVs Link: https://lwn.net/Articles/830666/ Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/open.c | 15 +++++++++++++++ include/uapi/asm-generic/errno.h | 3 +++ include/uapi/linux/openat2.h | 2 ++ 3 files changed, 20 insertions(+)
diff --git a/fs/open.c b/fs/open.c index 30bfcddd505d..10bfc8d6555c 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1458,6 +1458,21 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL; + + if (unlikely(usize & CHECK_FIELDS)) { + usize &= ~CHECK_FIELDS; + + memset(&tmp, 0, sizeof(tmp)); + tmp = (struct open_how) { + .flags = VALID_OPEN_FLAGS, + .mode = S_IALLUGO, + .resolve = VALID_RESOLVE_FLAGS, + }; + + err = copy_struct_to_user(how, usize, &tmp, sizeof(tmp), NULL); + return err ?: -EEXTSYS_NOOP; + } + if (unlikely(usize > PAGE_SIZE)) return -E2BIG;
diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h index cf9c51ac49f9..f5bfe081e73a 100644 --- a/include/uapi/asm-generic/errno.h +++ b/include/uapi/asm-generic/errno.h @@ -120,4 +120,7 @@
#define EHWPOISON 133 /* Memory page has hardware error */
+/* For extensible syscalls. */ +#define EEXTSYS_NOOP 134 /* Extensible syscall performed no operation */ + #endif diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h index a5feb7604948..6052a504cfa4 100644 --- a/include/uapi/linux/openat2.h +++ b/include/uapi/linux/openat2.h @@ -4,6 +4,8 @@
#include <linux/types.h>
+#define CHECK_FIELDS (1ULL << 63) + /* * Arguments for how openat2(2) should open the target path. If only @flags and * @mode are non-zero, then openat2(2) operates very similarly to openat(2).
As with openat2(2), this allows userspace to easily figure out what flags and fields are supported by clone3(2). For fields which are not flag-based, we simply set every bit in the field so that a naive bitwise-and would show that any value of the field is valid.
For args->exit_signal, since we have an explicit bitmask for the field defined already (CSIGNAL) we can indicate that only those bits are supported by current kernels. If we add some extra bits to exit_signal in the future, being able to detect them as new features would be quite useful.
The intended way of using this interface to get feature information looks something like the following:
static bool clone3_clear_sighand_supported; static bool clone3_cgroup_supported;
int check_clone3_support(void) { int err; struct clone_args args = {};
err = clone3(&args, CHECK_FIELDS | sizeof(args)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: clone3_clear_sighand_supported = (how.flags & CLONE_CLEAR_SIGHAND); clone3_cgroup_supported = (how.flags & CLONE_INTO_CGROUP) && (how.cgroup != 0); break; } }
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- kernel/fork.c | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index cc760491f201..1a170098a1c5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2925,6 +2925,9 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, } #endif
+ +#define CLONE3_VALID_FLAGS (CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP) + noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, struct clone_args __user *uargs, size_t usize) @@ -2941,11 +2944,34 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, CLONE_ARGS_SIZE_VER2); BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
- if (unlikely(usize > PAGE_SIZE)) - return -E2BIG; if (unlikely(usize < CLONE_ARGS_SIZE_VER0)) return -EINVAL;
+ if (unlikely(usize & CHECK_FIELDS)) { + usize &= ~CHECK_FIELDS; + + memset(&args, 0, sizeof(args)); + args = (struct clone_args) { + .flags = CLONE3_VALID_FLAGS, + .pidfd = 0xFFFFFFFFFFFFFFFF, + .child_tid = 0xFFFFFFFFFFFFFFFF, + .parent_tid = 0xFFFFFFFFFFFFFFFF, + .exit_signal = (u64) CSIGNAL, + .stack = 0xFFFFFFFFFFFFFFFF, + .stack_size = 0xFFFFFFFFFFFFFFFF, + .tls = 0xFFFFFFFFFFFFFFFF, + .set_tid = 0xFFFFFFFFFFFFFFFF, + .set_tid_size = 0xFFFFFFFFFFFFFFFF, + .cgroup = 0xFFFFFFFFFFFFFFFF, + }; + + err = copy_struct_to_user(uargs, usize, &args, sizeof(args), NULL); + return err ?: -EEXTSYS_NOOP; + } + + if (unlikely(usize > PAGE_SIZE)) + return -E2BIG; + err = copy_struct_from_user(&args, sizeof(args), uargs, usize); if (err) return err; @@ -3025,8 +3051,7 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs) static bool clone3_args_valid(struct kernel_clone_args *kargs) { /* Verify that no unknown flags are passed along. */ - if (kargs->flags & - ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP)) + if (kargs->flags & ~CLONE3_VALID_FLAGS) return false;
/*
We should also verify that poisoned data after a misaligned struct is also handled correctly by is_zeroed_user(). This test passes with no kernel changes needed, so is_zeroed_user() was correct already.
Fixes: b28a10aedcd4 ("selftests: add openat2(2) selftests") Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/openat2/openat2_test.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c index 5790ab446527..4ca175a16ad6 100644 --- a/tools/testing/selftests/openat2/openat2_test.c +++ b/tools/testing/selftests/openat2/openat2_test.c @@ -112,9 +112,9 @@ void test_openat2_struct(void) * * This is effectively to check that is_zeroed_user() works. */ - copy = malloc(misalign + sizeof(how_ext)); + copy = malloc(misalign*2 + sizeof(how_ext)); how_copy = copy + misalign; - memset(copy, 0xff, misalign); + memset(copy, 0xff, misalign*2 + sizeof(how_ext)); memcpy(how_copy, &how_ext, sizeof(how_ext)); }
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/openat2/openat2_test.c | 122 ++++++++++++++++++++++++- 1 file changed, 120 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c index 4ca175a16ad6..8afb41d0958a 100644 --- a/tools/testing/selftests/openat2/openat2_test.c +++ b/tools/testing/selftests/openat2/openat2_test.c @@ -1,7 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-or-later /* * Author: Aleksa Sarai cyphar@cyphar.com - * Copyright (C) 2018-2019 SUSE LLC. + * Copyright (C) 2018-2024 SUSE LLC. */
#define _GNU_SOURCE @@ -29,6 +29,14 @@ #define O_LARGEFILE 0x8000 #endif
+#ifndef CHECK_FIELDS +#define CHECK_FIELDS (1ULL << 63) +#endif + +#ifndef EEXTSYS_NOOP +#define EEXTSYS_NOOP 134 +#endif + struct open_how_ext { struct open_how inner; uint32_t extra1; @@ -45,6 +53,114 @@ struct struct_test { int err; };
+#define NUM_OPENAT2_CHECK_FIELDS_TESTS 1 +#define NUM_OPENAT2_CHECK_FIELDS_VARIATIONS 13 + +static bool check(bool *failed, bool pred) +{ + *failed |= pred; + return pred; +} + +static void test_openat2_check_fields(void) +{ + int misalignments[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 17, 87 }; + + for (int i = 0; i < ARRAY_LEN(misalignments); i++) { + int fd, misalign = misalignments[i]; + bool failed = false; + char *fdpath = NULL; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + struct open_how_ext how_ext = {}, *how_copy = &how_ext; + void *copy = NULL; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + if (misalign) { + /* + * Explicitly misalign the structure copying it with + * the given (mis)alignment offset. The other data is + * set to zero and we verify this afterwards to make + * sure CHECK_FIELDS doesn't write outside the buffer. + */ + copy = malloc(misalign*2 + sizeof(how_ext)); + how_copy = copy + misalign; + memset(copy, 0x00, misalign*2 + sizeof(how_ext)); + memcpy(how_copy, &how_ext, sizeof(how_ext)); + } + + fd = raw_openat2(AT_FDCWD, ".", how_copy, CHECK_FIELDS | sizeof(*how_copy)); + if (check(&failed, (fd != -EEXTSYS_NOOP))) + ksft_print_msg("openat2(CHECK_FIELDS) returned wrong error code: %d (%s)", + fd, strerror(-fd)); + if (fd >= 0) { + fdpath = fdreadlink(fd); + close(fd); + } + + if (failed) { + ksft_print_msg("openat2(CHECK_FIELDS) unexpectedly returned "); + if (fdpath) + ksft_print_msg("%d['%s']\n", fd, fdpath); + else + ksft_print_msg("%d (%s)\n", fd, strerror(-fd)); + } + + if (check(&failed, !(how_copy->inner.flags & O_PATH))) + ksft_print_msg("openat2(CHECK_FIELDS) returned flags is missing O_PATH (0x%.16x): 0x%.16llx\n", + O_PATH, how_copy->inner.flags); + + if (check(&failed, (how_copy->inner.mode != 07777))) + ksft_print_msg("openat2(CHECK_FIELDS) returned mode is invalid (0o%o): 0o%.4llo\n", + 07777, how_copy->inner.mode); + + if (check(&failed, !(how_copy->inner.resolve & RESOLVE_IN_ROOT))) + ksft_print_msg("openat2(CHECK_FIELDS) returned resolve flags is missing RESOLVE_IN_ROOT (0x%.16x): 0x%.16llx\n", + RESOLVE_IN_ROOT, how_copy->inner.resolve); + + /* Verify that the buffer space outside the struct wasn't written to. */ + if (check(&failed, how_copy->extra1 != 0)) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside open_how (extra1): 0x%x\n", + how_copy->extra1); + if (check(&failed, how_copy->extra2 != 0)) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside open_how (extra2): 0x%x\n", + how_copy->extra2); + if (check(&failed, how_copy->extra3 != 0)) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside open_how (extra3): 0x%x\n", + how_copy->extra3); + + if (misalign) { + for (size_t i = 0; i < misalign; i++) { + char *p = copy + i; + if (check(&failed, *p != '\x00')) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside the size: buffer[%ld] = 0x%.2x\n", + p - (char *) copy, *p); + } + for (size_t i = 0; i < misalign; i++) { + char *p = copy + misalign + sizeof(how_ext) + i; + if (check(&failed, *p != '\x00')) + ksft_print_msg("openat2(CHECK_FIELDS) touched a byte outside the size: buffer[%ld] = 0x%.2x\n", + p - (char *) copy, *p); + } + } + + if (failed) + resultfn = ksft_test_result_fail; + +skip: + resultfn("openat2(CHECK_FIELDS) [misalign=%d]\n", misalign); + + free(copy); + free(fdpath); + fflush(stdout); + } +} + #define NUM_OPENAT2_STRUCT_TESTS 7 #define NUM_OPENAT2_STRUCT_VARIATIONS 13
@@ -320,7 +436,8 @@ void test_openat2_flags(void) } }
-#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \ +#define NUM_TESTS (NUM_OPENAT2_CHECK_FIELDS_TESTS * NUM_OPENAT2_CHECK_FIELDS_VARIATIONS + \ + NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \ NUM_OPENAT2_FLAG_TESTS)
int main(int argc, char **argv) @@ -328,6 +445,7 @@ int main(int argc, char **argv) ksft_print_header(); ksft_set_plan(NUM_TESTS);
+ test_openat2_check_fields(); test_openat2_struct(); test_openat2_flags();
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/clone3/.gitignore | 1 + tools/testing/selftests/clone3/Makefile | 2 +- .../testing/selftests/clone3/clone3_check_fields.c | 229 +++++++++++++++++++++ 3 files changed, 231 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore index 83c0f6246055..4ec3e1ecd273 100644 --- a/tools/testing/selftests/clone3/.gitignore +++ b/tools/testing/selftests/clone3/.gitignore @@ -3,3 +3,4 @@ clone3 clone3_clear_sighand clone3_set_tid clone3_cap_checkpoint_restore +clone3_check_fields diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile index 84832c369a2e..d310f2268066 100644 --- a/tools/testing/selftests/clone3/Makefile +++ b/tools/testing/selftests/clone3/Makefile @@ -3,6 +3,6 @@ CFLAGS += -g -std=gnu99 $(KHDR_INCLUDES) LDLIBS += -lcap
TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \ - clone3_cap_checkpoint_restore + clone3_cap_checkpoint_restore clone3_check_fields
include ../lib.mk diff --git a/tools/testing/selftests/clone3/clone3_check_fields.c b/tools/testing/selftests/clone3/clone3_check_fields.c new file mode 100644 index 000000000000..78b5cbf807a6 --- /dev/null +++ b/tools/testing/selftests/clone3/clone3_check_fields.c @@ -0,0 +1,229 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2024 SUSE LLC + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <inttypes.h> +#include <linux/types.h> +#include <linux/sched.h> +#include <stdbool.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/syscall.h> +#include <sys/types.h> +#include <sys/un.h> +#include <sys/wait.h> +#include <unistd.h> +#include <sched.h> + +#include "../kselftest.h" +#include "clone3_selftests.h" + +#ifndef CHECK_FIELDS +#define CHECK_FIELDS (1ULL << 63) +#endif + +#ifndef EEXTSYS_NOOP +#define EEXTSYS_NOOP 134 +#endif + +struct __clone_args_v0 { + __aligned_u64 flags; + __aligned_u64 pidfd; + __aligned_u64 child_tid; + __aligned_u64 parent_tid; + __aligned_u64 exit_signal; + __aligned_u64 stack; + __aligned_u64 stack_size; + __aligned_u64 tls; +}; + +struct __clone_args_v1 { + __aligned_u64 flags; + __aligned_u64 pidfd; + __aligned_u64 child_tid; + __aligned_u64 parent_tid; + __aligned_u64 exit_signal; + __aligned_u64 stack; + __aligned_u64 stack_size; + __aligned_u64 tls; + __aligned_u64 set_tid; + __aligned_u64 set_tid_size; +}; + +struct __clone_args_v2 { + __aligned_u64 flags; + __aligned_u64 pidfd; + __aligned_u64 child_tid; + __aligned_u64 parent_tid; + __aligned_u64 exit_signal; + __aligned_u64 stack; + __aligned_u64 stack_size; + __aligned_u64 tls; + __aligned_u64 set_tid; + __aligned_u64 set_tid_size; + __aligned_u64 cgroup; +}; + +static int call_clone3(void *clone_args, size_t size) +{ + int status; + pid_t pid; + + pid = sys_clone3(clone_args, size); + if (pid < 0) { + ksft_print_msg("%d (%s) - Failed to create new process\n", + errno, strerror(errno)); + return -errno; + } + + if (pid == 0) { + ksft_print_msg("I am the child, my PID is %d\n", getpid()); + _exit(EXIT_SUCCESS); + } + + ksft_print_msg("I am the parent (%d). My child's pid is %d\n", + getpid(), pid); + + if (waitpid(-1, &status, __WALL) < 0) { + ksft_print_msg("waitpid() returned %s\n", strerror(errno)); + return -errno; + } + if (!WIFEXITED(status)) { + ksft_print_msg("Child did not exit normally, status 0x%x\n", + status); + return EXIT_FAILURE; + } + if (WEXITSTATUS(status)) + return WEXITSTATUS(status); + + return 0; +} + +static bool check(bool *failed, bool pred) +{ + *failed |= pred; + return pred; +} + +static void test_clone3_check_fields(const char *test_name, size_t struct_size) +{ + size_t bufsize; + void *buffer; + pid_t pid; + bool failed = false; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + /* Allocate some bytes after clone_args to verify that the . */ + bufsize = struct_size + 16; + buffer = malloc(bufsize); + memset(buffer, 0, bufsize); + + pid = call_clone3(buffer, CHECK_FIELDS | struct_size); + if (check(&failed, (pid != -EEXTSYS_NOOP))) + ksft_print_msg("clone3(CHECK_FIELDS) returned the wrong error code: %d (%s)\n", + pid, strerror(-pid)); + + switch (struct_size) { + case sizeof(struct __clone_args_v2): { + struct __clone_args_v2 *args = buffer; + + if (check(&failed, (args->cgroup != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong cgroup field: 0x%.16llx != 0x%.16llx\n", + args->cgroup, 0xFFFFFFFFFFFFFFFF); + + /* fallthrough; */ + } + case sizeof(struct __clone_args_v1): { + struct __clone_args_v1 *args = buffer; + + if (check(&failed, (args->set_tid != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong set_tid field: 0x%.16llx != 0x%.16llx\n", + args->set_tid, 0xFFFFFFFFFFFFFFFF); + if (check(&failed, (args->set_tid_size != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong set_tid_size field: 0x%.16llx != 0x%.16llx\n", + args->set_tid_size, 0xFFFFFFFFFFFFFFFF); + + /* fallthrough; */ + } + case sizeof(struct __clone_args_v0): { + struct __clone_args_v0 *args = buffer; + + if (check(&failed, !(args->flags & CLONE_NEWUSER))) + ksft_print_msg("clone3(CHECK_FIELDS) is missing CLONE_NEWUSER in flags: 0x%.16llx (0x%.16llx)\n", + args->flags, CLONE_NEWUSER); + if (check(&failed, !(args->flags & CLONE_THREAD))) + ksft_print_msg("clone3(CHECK_FIELDS) is missing CLONE_THREAD in flags: 0x%.16llx (0x%.16llx)\n", + args->flags, CLONE_THREAD); + /* + * CLONE_INTO_CGROUP was added in v2, but it will be set even + * with smaller structure sizes. + */ + if (check(&failed, !(args->flags & CLONE_INTO_CGROUP))) + ksft_print_msg("clone3(CHECK_FIELDS) is missing CLONE_INTO_CGROUP in flags: 0x%.16llx (0x%.16llx)\n", + args->flags, CLONE_INTO_CGROUP); + + if (check(&failed, (args->exit_signal != 0xFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong exit_signal field: 0x%.16llx != 0x%.16llx\n", + args->exit_signal, 0xFF); + + if (check(&failed, (args->stack != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong stack field: 0x%.16llx != 0x%.16llx\n", + args->stack, 0xFFFFFFFFFFFFFFFF); + if (check(&failed, (args->stack_size != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong stack_size field: 0x%.16llx != 0x%.16llx\n", + args->stack_size, 0xFFFFFFFFFFFFFFFF); + if (check(&failed, (args->tls != 0xFFFFFFFFFFFFFFFF))) + ksft_print_msg("clone3(CHECK_FIELDS) has wrong tls field: 0x%.16llx != 0x%.16llx\n", + args->tls, 0xFFFFFFFFFFFFFFFF); + + break; + } + default: + fprintf(stderr, "INVALID STRUCTURE SIZE: %d\n", struct_size); + abort(); + } + + /* Verify that the trailing parts of the buffer are still 0. */ + for (size_t i = struct_size; i < bufsize; i++) { + char ch = ((char *)buffer)[i]; + if (check(&failed, (ch != '\x00'))) + ksft_print_msg("clone3(CHECK_FIELDS) touched a byte outside the size: buffer[%d] = 0x%.2x\n", + i, ch); + } + + if (failed) + resultfn = ksft_test_result_fail; + + resultfn("clone3(CHECK_FIELDS) with %s\n", test_name); + free(buffer); +} + +struct check_fields_test { + const char *name; + size_t struct_size; +}; + +static struct check_fields_test check_fields_tests[] = { + {"struct v0", sizeof(struct __clone_args_v0)}, + {"struct v1", sizeof(struct __clone_args_v1)}, + {"struct v2", sizeof(struct __clone_args_v2)}, +}; + +int main(void) +{ + ksft_print_header(); + ksft_set_plan(ARRAY_SIZE(check_fields_tests)); + test_clone3_supported(); + + for (int i = 0; i < ARRAY_SIZE(check_fields_tests); i++) { + struct check_fields_test *test = &check_fields_tests[i]; + test_clone3_check_fields(test->name, test->struct_size); + } + + ksft_finished(); +}
linux-kselftest-mirror@lists.linaro.org