This is something that I've been thinking about for a while. We had a discussion at LPC 2020 about this[1] but the proposals suggested there never materialised.
In short, it is quite difficult for userspace to detect the feature capability of syscalls at runtime. This is something a lot of programs want to do, but they are forced to create elaborate scenarios to try to figure out if a feature is supported without causing damage to the system. For the vast majority of cases, each individual feature also needs to be tested individually (because syscall results are all-or-nothing), so testing even a single syscall's feature set can easily inflate the startup time of programs.
This patchset implements the fairly minimal design I proposed in this talk[2] and in some old LKML threads (though I can't find the exact references ATM). The general flow looks like:
1. Userspace will indicate to the kernel that a syscall should a be no-op by setting the top bit of the extensible struct size argument.
We will almost certainly never support exabyte sized structs, so the top bits are free for us to use as makeshift flag bits. This is preferable to using the per-syscall flag field inside the structure because seccomp can easily detect the bit in the flag and allow the probe or forcefully return -EEXTSYS_NOOP.
2. The kernel will then fill the provided structure with every valid bit pattern that the current kernel understands.
For flags or other bitflag-like fields, this is the set of valid flags or bits. For pointer fields or fields that take an arbitrary value, the field has every bit set (0xFF... to fill the field) to indicate that any value is valid in the field.
3. The syscall then returns -EEXTSYS_NOOP which is an errno that will only ever be used for this purpose (so userspace can be sure that the request succeeded).
On older kernels, the syscall will return a different error (usually -E2BIG or -EFAULT) and userspace can do their old-fashioned checks.
4. Userspace can then check which flags and fields are supported by looking at the fields in the returned structure. Flags are checked by doing an AND with the flags field, and field support can checked by comparing to 0. In principle you could just AND the entire structure if you wanted to do this check generically without caring about the structure contents (this is what libraries might consider doing).
Userspace can even find out the internal kernel structure size by passing a PAGE_SIZE buffer and seeing how many bytes are non-zero.
As with copy_struct_from_user(), this is designed to be forward- and backwards- compatible.
This allows programas to get a one-shot understanding of what features a syscall supports without having to do any elaborate setups or tricks to detect support for destructive features. Flags can simply be ANDed to check if they are in the supported set, and fields can just be checked to see if they are non-zero.
This patchset is IMHO the simplest way we can add the ability to introspect the feature set of extensible struct (copy_struct_from_user) syscalls. It doesn't preclude the chance of a more generic mechanism being added later.
The intended way of using this interface to get feature information looks something like the following (imagine that openat2 has gained a new field and a new flag in the future):
static bool openat2_no_automount_supported; static bool openat2_cwd_fd_supported;
int check_openat2_support(void) { int err; struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT); openat2_cwd_fd_supported = (how.cwd_fd != 0); break; } }
[1]: https://lwn.net/Articles/830666/ [2]: https://youtu.be/ggD-eb3yPVs
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- Aleksa Sarai (8): uaccess: add copy_struct_to_user helper sched_getattr: port to copy_struct_to_user openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) openat2: add CHECK_FIELDS flag to usize argument clone3: add CHECK_FIELDS flag to usize argument selftests: openat2: add 0xFF poisoned data after misaligned struct selftests: openat2: add CHECK_FIELDS selftests selftests: clone3: add CHECK_FIELDS selftests
fs/open.c | 17 ++ include/linux/uaccess.h | 98 +++++++++ include/uapi/asm-generic/errno.h | 3 + include/uapi/linux/openat2.h | 2 + kernel/fork.c | 33 ++- kernel/sched/syscalls.c | 42 +--- tools/testing/selftests/clone3/.gitignore | 1 + tools/testing/selftests/clone3/Makefile | 2 +- .../testing/selftests/clone3/clone3_check_fields.c | 229 +++++++++++++++++++++ tools/testing/selftests/openat2/openat2_test.c | 126 +++++++++++- 10 files changed, 504 insertions(+), 49 deletions(-) --- base-commit: 431c1646e1f86b949fa3685efc50b660a364c2b6 change-id: 20240803-extensible-structs-check_fields-a47e94cef691
Best regards,
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Cc: stable@vger.kernel.org # v5.6+ Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall") Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/open.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/open.c b/fs/open.c index 22adbef7ecc2..30bfcddd505d 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1458,6 +1458,8 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL; + if (unlikely(usize > PAGE_SIZE)) + return -E2BIG;
err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize); if (err)
On Mon, Sep 2, 2024, at 07:06, Aleksa Sarai wrote:
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Right, it's probably a good idea to have a limit there rather than having a busy loop with a user-provided length when the only bound is the available virtual memory.
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL;
- if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
Is PAGE_SIZE significant here? If there is a need to enforce a limit, I would expect this to be the same regardless of kernel configuration, since the structure layout is also independent of the configuration.
Where is the current -EFAULT for users passing more than a page? I only see it for reads beyond the VMA, but not e.g. when checking terabytes of zero pages from an anonymous mapping.
Arnd
On 2024-09-02, Arnd Bergmann arnd@arndb.de wrote:
On Mon, Sep 2, 2024, at 07:06, Aleksa Sarai wrote:
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue.
Right, it's probably a good idea to have a limit there rather than having a busy loop with a user-provided length when the only bound is the available virtual memory.
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL;
- if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
Is PAGE_SIZE significant here? If there is a need to enforce a limit, I would expect this to be the same regardless of kernel configuration, since the structure layout is also independent of the configuration.
PAGE_SIZE is what clone3, perf_event_open, sched_setattr, bpf, etc all use. The idea was that PAGE_SIZE is the absolute limit of any reasonable extensible structure size because we are never going to have argument structures that are larger than a page (I think this was discussed in the original copy_struct_from_user() patchset thread in late 2019, but I can't find the reference at the moment.)
I simply forgot to add this when I first submitted openat2, the original intention was to just match the other syscalls.
Where is the current -EFAULT for users passing more than a page? I only see it for reads beyond the VMA, but not e.g. when checking terabytes of zero pages from an anonymous mapping.
I meant that we in practice return -EFAULT if you pass a really large size (because you end up running off the end of mapped memory). There is no explicit -EFAULT for large sizes, which is exactly the problem. :P
Arnd
On Mon, Sep 2, 2024, at 16:08, Aleksa Sarai wrote:
if (unlikely(usize < OPEN_HOW_SIZE_VER0)) return -EINVAL;
- if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
Is PAGE_SIZE significant here? If there is a need to enforce a limit, I would expect this to be the same regardless of kernel configuration, since the structure layout is also independent of the configuration.
PAGE_SIZE is what clone3, perf_event_open, sched_setattr, bpf, etc all use. The idea was that PAGE_SIZE is the absolute limit of any reasonable extensible structure size because we are never going to have argument structures that are larger than a page (I think this was discussed in the original copy_struct_from_user() patchset thread in late 2019, but I can't find the reference at the moment.)
I simply forgot to add this when I first submitted openat2, the original intention was to just match the other syscalls.
Ok, I see. I guess it makes sense to keep this one consistent with the other ones, but we may want to revisit this in the future and come up with something that is independent of CONFIG_PAGE_SIZE.
Where is the current -EFAULT for users passing more than a page? I only see it for reads beyond the VMA, but not e.g. when checking terabytes of zero pages from an anonymous mapping.
I meant that we in practice return -EFAULT if you pass a really large size (because you end up running off the end of mapped memory). There is no explicit -EFAULT for large sizes, which is exactly the problem. :P
Got it, thanks.
Arnd
linux-stable-mirror@lists.linaro.org