This patchset is being developed here: https://github.com/cyphar/linux/tree/openat2/master
Patch changelog: v15: * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds] * Split out patches for each individual LOOKUP flag. * Reword commit messages to give more background information about the series, as well as mention the semantics of each flag in more detail. v14: https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/ https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com v13: https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/ v12: https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/ v11: https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/ https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/ v10: https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/ v09: https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/ v08: https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/ v07: https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/ v06: https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/ v05: https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/ v04: https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/ v03: https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/ v02: https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/ v01: https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/
For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1].
This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum project[5]) with a few additions and changes made based on the previous discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall openat2(2) which provides several other improvements to the openat(2) interface (see the patch description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this. Magic-link traversal which implies a vfsmount jump is also blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic-links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name.
It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it will *not* fail (assuming that no parent component was a magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks done as in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink resolution is allowed at all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[7] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on libpathrs[8] which is a C-friendly library for safe path resolution. It features a userspace-emulated backend if the kernel doesn't support openat2(2). Hopefully we can get userspace to switch to using it, and thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes though stale NFS handles).
[1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZ... [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goog... [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goog... [6]: https://lwn.net/Articles/723057/ [7]: https://github.com/cyphar/filepath-securejoin [8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<--------------------------------------------------------------------------- OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME openat2 - open and possibly create a file (extended)
SYNOPSIS #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION The openat2() system call opens the file specified by pathname. If the specified file does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc- tory referred to by the file descriptor dirfd (or the current working directory of the calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its functionality. Rather than taking a single flag argument, an extensible structure (how) is passed instead to allow for future extensions. size must be set to sizeof(struct open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES for more detail on how extensions are handled.)
The open_how structure The following structure indicates how pathname should be opened, and acts as a superset of the flag and mode arguments to openat(2).
struct open_how { __aligned_u64 flags; /* O_* flags. */ __u16 mode; /* Mode for O_{CREAT,TMPFILE}. */ __u16 __padding[3]; /* Must be zeroed. */ __aligned_u64 resolve; /* RESOLVE_* flags. */ };
Any future extensions to openat2() will be implemented as new fields appended to the above structure (or through reuse of pre-existing padding space), with the zero value of the new fields acting as though the extension were not present.
The meaning of each field is as follows:
flags The file creation and status flags to use for this operation. All of the O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting flags in flags.
mode File mode for the new file, with identical semantics to the mode argument to openat(2). However, unlike openat(2), it is an error to provide openat2() with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con- tain O_CREAT or O_TMPFILE.
resolve Change how the components of pathname will be resolved (see path_resolu- tion(7) for background information.) The primary use case for these flags is to allow trusted programs to restrict how untrusted paths (or paths in- side untrusted directories) are resolved. The full list of resolve flags is given below.
RESOLVE_NO_XDEV Disallow traversal of mount points during path resolution (including all bind mounts).
Users of this flag are encouraged to make its use configurable (un- less it is used for a specific security purpose), as bind mounts are very widely used by end-users. Setting this flag indiscrimnately for all uses of openat2() may result in spurious errors on previously- functional systems.
RESOLVE_NO_SYMLINKS Disallow resolution of symbolic links during path resolution. This option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un- less it is used for a specific security purpose), as symbolic links are very widely used by end-users. Setting this flag indiscrimnately for all uses of openat2() may result in spurious errors on previ- ously-functional systems.
RESOLVE_NO_MAGICLINKS Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the magic link will be returned.
Magic-links are symbolic link-like objects that are most notably found in proc(5) (examples include /proc/[pid]/exe and /proc/[pid]/fd/*.) Due to the potential danger of unknowingly open- ing these magic links, it may be preferable for users to disable their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH Do not permit the path resolution to succeed if any component of the resolution is not a descendant of the directory indicated by dirfd. This results in absolute symbolic links (and absolute values of path- name) to be rejected.
Currently, this flag also disables magic link resolution. However, this may change in the future. The caller should explicitly specify RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT Treat dirfd as the root directory while resolving pathname (as though the user called chroot(2) with dirfd as the argument.) Absolute sym- bolic links and ".." path components will be scoped to dirfd. If pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma- nently for a process), RESOLVE_IN_ROOT allows a program to effi- ciently restrict path resolution for only certain operations. It also has several hardening features (such detecting escape attempts during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However, this may change in the future. The caller should explicitly specify RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE On success, a new file descriptor is returned. On error, -1 is returned, and errno is set appropriately.
ERRORS The set of errors returned by openat2() includes all of the errors returned by openat(2), as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see the "Extensibility" section of the NOTES for more detail on how extensions are han- dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could not ensure that a ".." component didn't escape (due to a race condition or poten- tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic link.
VERSIONS openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility In order to allow for struct open_how to be extended in future kernel revisions, openat2() requires userspace to specify the size of struct open_how structure they are passing. By providing this information, it is possible for openat2() to provide both forwards- and backwards-compatibility — with size acting as an implicit version number (because new ex- tension fields will always be appended, the size will always increase.) This extensibil- ity design is very similar to other system calls such as perf_setattr(2), perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup- ports which the userspace program is unaware of. Because all extensions must have their zero values be a no-op, the kernel treats all of the extension fields not set by userspace to have zero values. This provides backwards-compatibil- ity.
* If ksize is smaller than usize, then there are some extensions which the userspace program is aware of but the kernel does not support. Because all ex- tensions must have their zero values be a no-op, the kernel can safely ignore the unsupported extension fields if they are all-zero. If any unsupported ex- tension fields are non-zero, then -1 is returned and errno is set to E2BIG. This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten- sions. However, if a userspace program wishes to determine what extensions the running kernel supports, they may conduct a binary search on size (to find the largest value which doesn't produce an error of E2BIG.)
SEE ALSO openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2) --8<---------------------------------------------------------------------------
Aleksa Sarai (9): namei: LOOKUP_NO_SYMLINKS: block symlink resolution namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution namei: LOOKUP_NO_XDEV: block mountpoint crossing namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution namei: LOOKUP_IN_ROOT: chroot-like scoped resolution namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution open: introduce openat2(2) syscall selftests: add openat2(2) selftests Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
CREDITS | 4 +- Documentation/filesystems/path-lookup.rst | 18 +- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/namei.c | 176 +++++- fs/open.c | 149 +++-- include/linux/fcntl.h | 12 +- include/linux/namei.h | 11 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/fcntl.h | 41 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/openat2/.gitignore | 1 + tools/testing/selftests/openat2/Makefile | 8 + tools/testing/selftests/openat2/helpers.c | 109 ++++ tools/testing/selftests/openat2/helpers.h | 107 ++++ .../testing/selftests/openat2/openat2_test.c | 316 +++++++++++ .../selftests/openat2/rename_attack_test.c | 160 ++++++ .../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++ 35 files changed, 1591 insertions(+), 73 deletions(-) create mode 100644 tools/testing/selftests/openat2/.gitignore create mode 100644 tools/testing/selftests/openat2/Makefile create mode 100644 tools/testing/selftests/openat2/helpers.c create mode 100644 tools/testing/selftests/openat2/helpers.h create mode 100644 tools/testing/selftests/openat2/openat2_test.c create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: a99d8080aaf358d5d23581244e5da23b35e340b9
/* Background. */ Userspace cannot easily resolve a path without resolving symlinks, and would have to manually resolve each path component with O_PATH and O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw up (resulting in possible security bugs). Linus has mentioned that Git has a particular need for this kind of flag[1]. It also resolves a fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only blocks the opening of trailing symlinks.
This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a variation on David Drysdale's O_BENEATH patchset[3], which in turn was based on the Capsicum project[4]).
/* Userspace API. */ LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).
/* Semantics. */ Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW), LOOKUP_NO_SYMLINKS applies to all components of the path.
With LOOKUP_NO_SYMLINKS, any symlink path component encountered during path resolution will yield -ELOOP. If the trailing component is a symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW will not error out and will instead provide a handle to the trailing symlink -- without resolving it.
/* Testing. */ LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7... [2]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ [3]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goog... [4]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goog...
Cc: Christian Brauner christian.brauner@ubuntu.com Suggested-by: Al Viro viro@zeniv.linux.org.uk Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 3 +++ include/linux/namei.h | 3 +++ 2 files changed, 6 insertions(+)
diff --git a/fs/namei.c b/fs/namei.c index 671c3c1a3425..4e85d6fa4048 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1045,6 +1045,9 @@ const char *get_link(struct nameidata *nd) int error; const char *res;
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + return ERR_PTR(-ELOOP); + if (!(nd->flags & LOOKUP_RCU)) { touch_atime(&last->link); cond_resched(); diff --git a/include/linux/namei.h b/include/linux/namei.h index 397a08ade6a2..ee2e35af387f 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -39,6 +39,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_ROOT 0x2000 #define LOOKUP_ROOT_GRABBED 0x0008
+/* Scoping flags for lookup. */ +#define LOOKUP_NO_SYMLINKS 0x020000 /* No symlink crossing. */ + extern int path_pts(struct path *path);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
/* Background. */ There has always been a special class of symlink-like objects in procfs (and a few other pseudo-filesystems) which allow for non-lexical resolution of paths using nd_jump_link(). These "magic-links" do not follow traditional mount namespace boundaries, and have been used consistently in container escape attacks because they can be used to trick unsuspecting privileged processes into resolving unexpected paths.
It is also non-trivial for userspace to unambiguously avoid resolving magic-links, because they do not have a reliable indication that they are a magic-link (in order to verify them you'd have to manually open the path given by readlink(2) and then verify that the two file descriptors reference the same underlying file, which is plagued with possible race conditions or supplementary attack scenarios).
It would therefore be very helpful for userspace to be able to avoid these symlinks easily, thus hopefully removing a tool from attackers' toolboxes.
This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation on David Drysdale's O_BENEATH patchset[2], which in turn was based on the Capsicum project[3]).
/* Userspace API. */ LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).
/* Semantics. */ Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW), LOOKUP_NO_MAGICLINKS applies to all components of the path.
With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.
LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.
/* Testing. */ LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goog... [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goog...
Cc: Christian Brauner christian.brauner@ubuntu.com Suggested-by: David Drysdale drysdale@google.com Suggested-by: Al Viro viro@zeniv.linux.org.uk Suggested-by: Andy Lutomirski luto@kernel.org Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 7 ++++++- include/linux/namei.h | 2 ++ 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/namei.c b/fs/namei.c index 4e85d6fa4048..1f0d871199e5 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -866,7 +866,7 @@ void nd_jump_link(struct path *path)
nd->path = *path; nd->inode = nd->path.dentry->d_inode; - nd->flags |= LOOKUP_JUMPED; + nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED; }
static inline void put_link(struct nameidata *nd) @@ -1063,6 +1063,7 @@ const char *get_link(struct nameidata *nd) return ERR_PTR(error);
nd->last_type = LAST_BIND; + nd->flags &= ~LOOKUP_MAGICLINK_JUMPED; res = READ_ONCE(inode->i_link); if (!res) { const char * (*get)(struct dentry *, struct inode *, @@ -1078,6 +1079,10 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); } + if (nd->flags & LOOKUP_MAGICLINK_JUMPED) { + if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) + return ERR_PTR(-ELOOP); + } if (IS_ERR_OR_NULL(res)) return res; } diff --git a/include/linux/namei.h b/include/linux/namei.h index ee2e35af387f..a8b3f93338da 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -38,9 +38,11 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_JUMPED 0x1000 #define LOOKUP_ROOT 0x2000 #define LOOKUP_ROOT_GRABBED 0x0008 +#define LOOKUP_MAGICLINK_JUMPED 0x10000
/* Scoping flags for lookup. */ #define LOOKUP_NO_SYMLINKS 0x020000 /* No symlink crossing. */ +#define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */
extern int path_pts(struct path *path);
On Tue, Nov 05, 2019 at 08:05:46PM +1100, Aleksa Sarai wrote:
@@ -1078,6 +1079,10 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); }
if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
return ERR_PTR(-ELOOP);
}
Minor nit - the first check probably wants unlikely() more than the second one; it's probably noise anyway, but most of the symlinks traversed are not going to be procfs ones, so you get test + branch taken most of the time.
OTOH, that just might compile into fetch nd->flags and with LOOKUP_MAGICLINK_JUMPED | LOOKUP_NO_MAGICLINKS compare with the same constant unlikely branch when equal
Anyway, that's no more than a minor nit and can be dealt with later (if at all)
/* Background. */ The need to contain path operations within a mountpoint has been a long-standing usecase that userspace has historically implemented manually with liberal usage of stat(). find, rsync, tar and many other programs implement these semantics -- but it'd be much simpler to have a fool-proof way of refusing to open a path if it crosses a mountpoint.
This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation on David Drysdale's O_BENEATH patchset[2], which in turn was based on the Capsicum project[3]).
/* Userspace API. */ LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).
/* Semantics. */ Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW), LOOKUP_NO_XDEV applies to all components of the path.
With LOOKUP_NO_XDEV, any path component which crosses a mount-point during path resolution (including "..") will yield an -EXDEV. Absolute paths, absolute symlinks, and magic-links will only yield an -EXDEV if the jump involved changing mount-points.
/* Testing. */ LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goog... [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goog...
Cc: Christian Brauner christian.brauner@ubuntu.com Suggested-by: David Drysdale drysdale@google.com Suggested-by: Al Viro viro@zeniv.linux.org.uk Suggested-by: Andy Lutomirski luto@kernel.org Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 34 ++++++++++++++++++++++++++++++---- include/linux/namei.h | 1 + 2 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 1f0d871199e5..b73ee1601bd4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -504,6 +504,9 @@ struct nameidata { struct filename *name; struct nameidata *saved; struct inode *link_inode; + struct { + bool same_mnt; + } last_magiclink; unsigned root_seq; int dfd; } __randomize_layout; @@ -837,6 +840,11 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd) { + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) { + /* Absolute path arguments to path_init() are allowed. */ + if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt) + return -EXDEV; + } if (nd->flags & LOOKUP_RCU) { struct dentry *d; nd->path = nd->root; @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd) void nd_jump_link(struct path *path) { struct nameidata *nd = current->nameidata; + + nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt); path_put(&nd->path);
nd->path = *path; @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd) if (nd->flags & LOOKUP_MAGICLINK_JUMPED) { if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) return ERR_PTR(-ELOOP); + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) { + if (!nd->last_magiclink.same_mnt) + return ERR_PTR(-EXDEV); + } } if (IS_ERR_OR_NULL(res)) return res; @@ -1271,12 +1285,16 @@ static int follow_managed(struct path *path, struct nameidata *nd) break; }
- if (need_mntput && path->mnt == mnt) - mntput(path->mnt); + if (need_mntput) { + if (path->mnt == mnt) + mntput(path->mnt); + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) + ret = -EXDEV; + else + nd->flags |= LOOKUP_JUMPED; + } if (ret == -EISDIR || !ret) ret = 1; - if (need_mntput) - nd->flags |= LOOKUP_JUMPED; if (unlikely(ret < 0)) path_put_conditional(path, nd); return ret; @@ -1333,6 +1351,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, mounted = __lookup_mnt(path->mnt, path->dentry); if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) + return false; path->mnt = &mounted->mnt; path->dentry = mounted->mnt.mnt_root; nd->flags |= LOOKUP_JUMPED; @@ -1379,6 +1399,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (&mparent->mnt == nd->path.mnt) break; + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) + return -EXDEV; /* we know that mountpoint was pinned */ nd->path.dentry = mountpoint; nd->path.mnt = &mparent->mnt; @@ -1393,6 +1415,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) + return -EXDEV; nd->path.mnt = &mounted->mnt; nd->path.dentry = mounted->mnt.mnt_root; inode = nd->path.dentry->d_inode; @@ -1491,6 +1515,8 @@ static int follow_dotdot(struct nameidata *nd) } if (!follow_up(&nd->path)) break; + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) + return -EXDEV; } follow_mount(&nd->path); nd->inode = nd->path.dentry->d_inode; diff --git a/include/linux/namei.h b/include/linux/namei.h index a8b3f93338da..6105c8a59fc8 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -43,6 +43,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; /* Scoping flags for lookup. */ #define LOOKUP_NO_SYMLINKS 0x020000 /* No symlink crossing. */ #define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ +#define LOOKUP_NO_XDEV 0x080000 /* No mountpoint crossing. */
extern int path_pts(struct path *path);
On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
@@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd) void nd_jump_link(struct path *path) { struct nameidata *nd = current->nameidata;
- nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt); path_put(&nd->path);
nd->path = *path; @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd) if (nd->flags & LOOKUP_MAGICLINK_JUMPED) { if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) return ERR_PTR(-ELOOP);
if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
if (!nd->last_magiclink.same_mnt)
return ERR_PTR(-EXDEV);
}}
Ugh... Wouldn't it be better to take that logics (some equivalent thereof) into nd_jump_link()? Or just have nd_jump_link() return an error...
I mean, look at the callers of nd_jump_link(). static const char *policy_get_link(struct dentry *dentry, struct inode *inode, struct delayed_call *done) { struct aa_ns *ns; struct path path;
if (!dentry) return ERR_PTR(-ECHILD); ns = aa_get_current_ns(); path.mnt = mntget(aafs_mnt); path.dentry = dget(ns_dir(ns)); nd_jump_link(&path); aa_put_ns(ns);
return NULL; } - very close to the end of ->get_link() instance.
static const char *proc_pid_get_link(struct dentry *dentry, struct inode *inode, struct delayed_call *done) { struct path path; int error = -EACCES;
if (!dentry) return ERR_PTR(-ECHILD);
/* Are we allowed to snoop on the tasks file descriptors? */ if (!proc_fd_access_allowed(inode)) goto out;
error = PROC_I(inode)->op.proc_get_link(dentry, &path); if (error) goto out;
nd_jump_link(&path); return NULL; out: return ERR_PTR(error); } Ditto.
static const char *proc_ns_get_link(struct dentry *dentry, struct inode *inode, struct delayed_call *done) { const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops; struct task_struct *task; struct path ns_path; void *error = ERR_PTR(-EACCES);
if (!dentry) return ERR_PTR(-ECHILD);
task = get_proc_task(inode); if (!task) return error;
if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) { error = ns_get_path(&ns_path, task, ns_ops); if (!error) nd_jump_link(&ns_path); } put_task_struct(task); return error; }
The same. And that's it - there's no more of them. So how about this in the beginning of the series, then having your magiclink error handling done in nd_jump_link()?
diff --git a/fs/namei.c b/fs/namei.c index 671c3c1a3425..8ec924813c30 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -859,7 +859,7 @@ static int nd_jump_root(struct nameidata *nd) * Helper to directly jump to a known parsed path from ->get_link, * caller must have taken a reference to path beforehand. */ -void nd_jump_link(struct path *path) +const char *nd_jump_link(struct path *path) { struct nameidata *nd = current->nameidata; path_put(&nd->path); @@ -867,6 +867,7 @@ void nd_jump_link(struct path *path) nd->path = *path; nd->inode = nd->path.dentry->d_inode; nd->flags |= LOOKUP_JUMPED; + return NULL; }
static inline void put_link(struct nameidata *nd) diff --git a/fs/proc/base.c b/fs/proc/base.c index ebea9501afb8..ac4e57a3dfa5 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1626,8 +1626,7 @@ static const char *proc_pid_get_link(struct dentry *dentry, if (error) goto out;
- nd_jump_link(&path); - return NULL; + return nd_jump_link(&path); out: return ERR_PTR(error); } diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c index dd2b35f78b09..dde0c501b2f3 100644 --- a/fs/proc/namespaces.c +++ b/fs/proc/namespaces.c @@ -54,7 +54,7 @@ static const char *proc_ns_get_link(struct dentry *dentry, if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) { error = ns_get_path(&ns_path, task, ns_ops); if (!error) - nd_jump_link(&ns_path); + error = nd_jump_link(&ns_path); } put_task_struct(task); return error; diff --git a/include/linux/namei.h b/include/linux/namei.h index 397a08ade6a2..f3e8438e5631 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -68,7 +68,7 @@ extern int follow_up(struct path *); extern struct dentry *lock_rename(struct dentry *, struct dentry *); extern void unlock_rename(struct dentry *, struct dentry *);
-extern void nd_jump_link(struct path *path); +extern const char *nd_jump_link(struct path *path);
static inline void nd_terminate_link(void *name, size_t len, size_t maxlen) { diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c index 45d13b6462aa..98aef94b4777 100644 --- a/security/apparmor/apparmorfs.c +++ b/security/apparmor/apparmorfs.c @@ -2453,18 +2453,16 @@ static const char *policy_get_link(struct dentry *dentry, struct inode *inode, struct delayed_call *done) { - struct aa_ns *ns; - struct path path; - - if (!dentry) - return ERR_PTR(-ECHILD); - ns = aa_get_current_ns(); - path.mnt = mntget(aafs_mnt); - path.dentry = dget(ns_dir(ns)); - nd_jump_link(&path); - aa_put_ns(ns); - - return NULL; + const char *err = ERR_PTR(-ECHILD); + + if (dentry) { + struct aa_ns *ns = aa_get_current_ns(); + struct path path = {.mnt = mntget(aafs_mnt), + .dentry = ns_dir(ns)}; + err = nd_jump_link(&path); + aa_put_ns(ns); + } + return err; }
static int policy_readlink(struct dentry *dentry, char __user *buffer,
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
@@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd) void nd_jump_link(struct path *path) { struct nameidata *nd = current->nameidata;
- nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt); path_put(&nd->path);
nd->path = *path; @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd) if (nd->flags & LOOKUP_MAGICLINK_JUMPED) { if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) return ERR_PTR(-ELOOP);
if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
if (!nd->last_magiclink.same_mnt)
return ERR_PTR(-EXDEV);
}}
Ugh... Wouldn't it be better to take that logics (some equivalent thereof) into nd_jump_link()? Or just have nd_jump_link() return an error...
This could be done, but the reason for stashing it away in last_magiclink is because of the future magic-link re-opening patches which can't be implemented like that without putting the open_flags inside nameidata (which was decided to be too ugly a while ago).
My point being that I could implement it this way for this series, but I'd have to implement something like last_magiclink when I end up re-posting the magic-link stuff in a few weeks.
Looking at all the nd_jump_link() users, the other option is to just disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only thing allowing them permits is to resolve file descriptors that are pointing to the same procfs mount -- and it's unclear to me how useful that really is (apparmorfs and nsfs will always give -EXDEV because aafs_mnt and nsfs_mnt are internal kernel vfsmounts).
On Thu, Nov 14, 2019 at 03:49:45PM +1100, Aleksa Sarai wrote:
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
@@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd) void nd_jump_link(struct path *path) { struct nameidata *nd = current->nameidata;
- nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt); path_put(&nd->path);
nd->path = *path; @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd) if (nd->flags & LOOKUP_MAGICLINK_JUMPED) { if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) return ERR_PTR(-ELOOP);
if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
if (!nd->last_magiclink.same_mnt)
return ERR_PTR(-EXDEV);
}}
Ugh... Wouldn't it be better to take that logics (some equivalent thereof) into nd_jump_link()? Or just have nd_jump_link() return an error...
This could be done, but the reason for stashing it away in last_magiclink is because of the future magic-link re-opening patches which can't be implemented like that without putting the open_flags inside nameidata (which was decided to be too ugly a while ago).
My point being that I could implement it this way for this series, but I'd have to implement something like last_magiclink when I end up re-posting the magic-link stuff in a few weeks.
Looking at all the nd_jump_link() users, the other option is to just disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only thing allowing them permits is to resolve file descriptors that are pointing to the same procfs mount -- and it's unclear to me how useful that really is (apparmorfs and nsfs will always give -EXDEV because aafs_mnt and nsfs_mnt are internal kernel vfsmounts).
I would rather keep the entire if (nd->flags & LOOKUP_MAGICLINK_JUMPED) out of the get_link(). If you want to generate some error if nd_jump_link() has been called, just do it right there. The fewer pieces of state need to be carried around, the better...
And as for opening them... Why would you need full open_flags in there? Details, please...
On 2019-11-14, Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Nov 14, 2019 at 03:49:45PM +1100, Aleksa Sarai wrote:
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
@@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd) void nd_jump_link(struct path *path) { struct nameidata *nd = current->nameidata;
- nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt); path_put(&nd->path);
nd->path = *path; @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd) if (nd->flags & LOOKUP_MAGICLINK_JUMPED) { if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) return ERR_PTR(-ELOOP);
if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
if (!nd->last_magiclink.same_mnt)
return ERR_PTR(-EXDEV);
}}
Ugh... Wouldn't it be better to take that logics (some equivalent thereof) into nd_jump_link()? Or just have nd_jump_link() return an error...
This could be done, but the reason for stashing it away in last_magiclink is because of the future magic-link re-opening patches which can't be implemented like that without putting the open_flags inside nameidata (which was decided to be too ugly a while ago).
My point being that I could implement it this way for this series, but I'd have to implement something like last_magiclink when I end up re-posting the magic-link stuff in a few weeks.
Looking at all the nd_jump_link() users, the other option is to just disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only thing allowing them permits is to resolve file descriptors that are pointing to the same procfs mount -- and it's unclear to me how useful that really is (apparmorfs and nsfs will always give -EXDEV because aafs_mnt and nsfs_mnt are internal kernel vfsmounts).
I would rather keep the entire if (nd->flags & LOOKUP_MAGICLINK_JUMPED) out of the get_link(). If you want to generate some error if nd_jump_link() has been called, just do it right there. The fewer pieces of state need to be carried around, the better...
Sure, I can make nd_jump_link() give -ELOOP and drop the current need for LOOKUP_MAGICLINK_JUMPED -- if necessary we can re-add it for the magic-link reopening patches.
And as for opening them... Why would you need full open_flags in there? Details, please...
I was referring to [1] which has been dropped from this series. I misspoke -- you don't need the full open_flags, you just need acc_mode in nameidata -- but from memory you (understandably) weren't in favour of that either because it further muddled the open semantics with namei.
So the solution I went with was to stash away the i_mode of the magiclink in nd->last_magiclink.mode (though to avoid a race which Jann found, you actually need to recalculate it when you call nd_jump_link() but that's a different topic) and then check it in trailing_magiclink().
However, I've since figured out that we need to restrict things like bind-mounts and truncate() because they can be used to get around the restrictions. I dropped that patch from this series so that I could work on implementing the restrictions for the other relevant VFS syscalls separately from openat2 (upgrade_mask will be re-added to open_how with those patches).
My point was that AFAICS we will either have to have nd->acc_mode (or something similar) or have nd->last_magiclink in order to implement the magic-link reopening hardening.
[1]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
/* Background. */ There are many circumstances when userspace wants to resolve a path and ensure that it doesn't go outside of a particular root directory during resolution. Obvious examples include archive extraction tools, as well as other security-conscious userspace programs. FreeBSD spun out O_BENEATH from their Capsicum project[1,2], so it also seems reasonable to implement similar functionality for Linux.
This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a variation on David Drysdale's O_BENEATH patchset[4], which in turn was based on the Capsicum project[5]).
/* Userspace API. */ LOOKUP_BENEATH will be exposed to userspace through openat2(2).
/* Semantics. */ Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW), LOOKUP_BENEATH applies to all components of the path.
With LOOKUP_BENEATH, any path component which attempts to "escape" the starting point of the filesystem lookup (the dirfd passed to openat) will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
Due to a security concern brought up by Jann[6], any ".." path components are also blocked. This restriction will be lifted in a future patch, but requires more work to ensure that permitting ".." is done safely.
Magic-link jumps are also blocked, because they can beam the path lookup across the starting point. It would be possible to detect and block only the "bad" crossings with path_is_under() checks, but it's unclear whether it makes sense to permit magic-links at all. However, userspace is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that magic-link crossing is entirely disabled.
/* Testing. */ LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
[1]: https://reviews.freebsd.org/D2808 [2]: https://reviews.freebsd.org/D17547 [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goog... [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goog... [6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6x...
Cc: Christian Brauner christian.brauner@ubuntu.com Suggested-by: David Drysdale drysdale@google.com Suggested-by: Al Viro viro@zeniv.linux.org.uk Suggested-by: Andy Lutomirski luto@kernel.org Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 91 +++++++++++++++++++++++++++++++++++-------- include/linux/namei.h | 4 ++ 2 files changed, 79 insertions(+), 16 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index b73ee1601bd4..54fdbdfbeb94 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -644,6 +644,14 @@ static bool legitimize_links(struct nameidata *nd)
static bool legitimize_root(struct nameidata *nd) { + /* + * For scoped-lookups (where nd->root has been zeroed), we need to + * restart the whole lookup from scratch -- because set_root() is wrong + * for these lookups (nd->dfd is the root, not the filesystem root). + */ + if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)) + return false; + /* Nothing to do if nd->root is zero or is managed by the VFS user. */ if (!nd->root.mnt || (nd->flags & LOOKUP_ROOT)) return true; nd->flags |= LOOKUP_ROOT_GRABBED; @@ -779,7 +787,11 @@ static int complete_walk(struct nameidata *nd) int status;
if (nd->flags & LOOKUP_RCU) { - if (!(nd->flags & LOOKUP_ROOT)) + /* + * We don't want to zero nd->root for scoped-lookups or + * externally-managed nd->root. + */ + if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED))) nd->root.mnt = NULL; if (unlikely(unlazy_walk(nd))) return -ECHILD; @@ -801,10 +813,18 @@ static int complete_walk(struct nameidata *nd) return status; }
-static void set_root(struct nameidata *nd) +static int set_root(struct nameidata *nd) { struct fs_struct *fs = current->fs;
+ /* + * Jumping to the real root in a scoped-lookup is a BUG in namei, but we + * still have to ensure it doesn't happen because it will cause a breakout + * from the dirfd. + */ + if (WARN_ON(nd->flags & LOOKUP_IS_SCOPED)) + return -ENOTRECOVERABLE; + if (nd->flags & LOOKUP_RCU) { unsigned seq;
@@ -817,6 +837,7 @@ static void set_root(struct nameidata *nd) get_fs_root(fs, &nd->root); nd->flags |= LOOKUP_ROOT_GRABBED; } + return 0; }
static void path_put_conditional(struct path *path, struct nameidata *nd) @@ -840,11 +861,18 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; if (unlikely(nd->flags & LOOKUP_NO_XDEV)) { /* Absolute path arguments to path_init() are allowed. */ if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt) return -EXDEV; } + if (!nd->root.mnt) { + int error = set_root(nd); + if (error) + return error; + } if (nd->flags & LOOKUP_RCU) { struct dentry *d; nd->path = nd->root; @@ -1096,15 +1124,17 @@ const char *get_link(struct nameidata *nd) if (!nd->last_magiclink.same_mnt) return ERR_PTR(-EXDEV); } + /* Not currently safe for scoped-lookups. */ + if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) + return ERR_PTR(-EXDEV); } if (IS_ERR_OR_NULL(res)) return res; } if (*res == '/') { - if (!nd->root.mnt) - set_root(nd); - if (unlikely(nd_jump_root(nd))) - return ERR_PTR(-ECHILD); + error = nd_jump_root(nd); + if (unlikely(error)) + return ERR_PTR(error); while (unlikely(*++res == '/')) ; } @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode;
while (1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1505,8 +1538,11 @@ static int path_parent_directory(struct path *path) static int follow_dotdot(struct nameidata *nd) { while(1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret) @@ -1731,8 +1767,20 @@ static inline int may_lookup(struct nameidata *nd) static inline int handle_dots(struct nameidata *nd, int type) { if (type == LAST_DOTDOT) { - if (!nd->root.mnt) - set_root(nd); + int error = 0; + + /* + * Scoped-lookup flags resolving ".." is not currently safe -- + * races can cause our parent to have moved outside of the root + * and us to skip over it. + */ + if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) + return -EXDEV; + if (!nd->root.mnt) { + error = set_root(nd); + if (error) + return error; + } if (nd->flags & LOOKUP_RCU) { return follow_dotdot_rcu(nd); } else @@ -2195,6 +2243,7 @@ static int link_path_walk(const char *name, struct nameidata *nd) /* must be paired with terminate_walk() */ static const char *path_init(struct nameidata *nd, unsigned flags) { + int error; const char *s = nd->name->name;
if (!*s) @@ -2227,11 +2276,12 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock); + + /* Figure out the starting path and root (if needed). */ if (*s == '/') { - set_root(nd); - if (likely(!nd_jump_root(nd))) - return s; - return ERR_PTR(-ECHILD); + error = nd_jump_root(nd); + if (unlikely(error)) + return ERR_PTR(error); } else if (nd->dfd == AT_FDCWD) { if (flags & LOOKUP_RCU) { struct fs_struct *fs = current->fs; @@ -2247,7 +2297,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; } - return s; } else { /* Caller must check execute permissions on the starting path component */ struct fd f = fdget_raw(nd->dfd); @@ -2272,8 +2321,18 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->inode = nd->path.dentry->d_inode; } fdput(f); - return s; } + /* For scoped-lookups we need to set the root to the dirfd as well. */ + if (flags & LOOKUP_IS_SCOPED) { + nd->root = nd->path; + if (flags & LOOKUP_RCU) { + nd->root_seq = nd->seq; + } else { + path_get(&nd->root); + nd->flags |= LOOKUP_ROOT_GRABBED; + } + } + return s; }
static const char *trailing_symlink(struct nameidata *nd) diff --git a/include/linux/namei.h b/include/linux/namei.h index 6105c8a59fc8..12f4f36835c2 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -2,6 +2,7 @@ #ifndef _LINUX_NAMEI_H #define _LINUX_NAMEI_H
+#include <linux/fs.h> #include <linux/kernel.h> #include <linux/path.h> #include <linux/fcntl.h> @@ -44,6 +45,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_SYMLINKS 0x020000 /* No symlink crossing. */ #define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_XDEV 0x080000 /* No mountpoint crossing. */ +#define LOOKUP_BENEATH 0x100000 /* No escaping from starting point. */ +/* LOOKUP_* flags which do scope-related checks based on the dirfd. */ +#define LOOKUP_IS_SCOPED LOOKUP_BENEATH
extern int path_pts(struct path *path);
On Tue, Nov 05, 2019 at 08:05:48PM +1100, Aleksa Sarai wrote:
Minor nit here - I'd split "move the conditional call of set_root() into nd_jump_root()" into a separate patch before that one. Makes for fewer distractions in this one. I'd probably fold "and be ready for errors other than -ECHILD" into the same preliminary patch.
/* Not currently safe for scoped-lookups. */
if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
return ERR_PTR(-EXDEV);
Also a candidate for doing in nd_jump_link()...
@@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode; while (1) {
if (path_equal(&nd->path, &nd->root))
if (path_equal(&nd->path, &nd->root)) {
if (unlikely(nd->flags & LOOKUP_BENEATH))
return -EXDEV;
Umm... Are you sure it's not -ECHILD?
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
Minor nit here - I'd split "move the conditional call of set_root() into nd_jump_root()" into a separate patch before that one. Makes for fewer distractions in this one. I'd probably fold "and be ready for errors other than -ECHILD" into the same preliminary patch.
Will do.
/* Not currently safe for scoped-lookups. */
if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
return ERR_PTR(-EXDEV);
Also a candidate for doing in nd_jump_link()...
@@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode; while (1) {
if (path_equal(&nd->path, &nd->root))
if (path_equal(&nd->path, &nd->root)) {
if (unlikely(nd->flags & LOOKUP_BENEATH))
return -EXDEV;
Umm... Are you sure it's not -ECHILD?
It wouldn't hurt to be -ECHILD -- though it's not clear to me how likely a success would be in REF-walk if the parent components didn't already trigger an unlazy_walk() in RCU-walk.
I guess that also means LOOKUP_NO_XDEV should trigger -ECHILD in follow_dotdot_rcu()?
On 2019-11-13, Aleksa Sarai cyphar@cyphar.com wrote:
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
Minor nit here - I'd split "move the conditional call of set_root() into nd_jump_root()" into a separate patch before that one. Makes for fewer distractions in this one. I'd probably fold "and be ready for errors other than -ECHILD" into the same preliminary patch.
Will do.
/* Not currently safe for scoped-lookups. */
if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
return ERR_PTR(-EXDEV);
Also a candidate for doing in nd_jump_link()...
@@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode; while (1) {
if (path_equal(&nd->path, &nd->root))
if (path_equal(&nd->path, &nd->root)) {
if (unlikely(nd->flags & LOOKUP_BENEATH))
return -EXDEV;
Umm... Are you sure it's not -ECHILD?
It wouldn't hurt to be -ECHILD -- though it's not clear to me how likely a success would be in REF-walk if the parent components didn't already trigger an unlazy_walk() in RCU-walk.
I guess that also means LOOKUP_NO_XDEV should trigger -ECHILD in follow_dotdot_rcu()?
Scratch the last question -- AFAICS we don't need to do that for LOOKUP_NO_XDEV because we check against mount_lock so it's very unlikely that -ECHILD will have any benefit.
/* Background. */ Container runtimes or other administrative management processes will often interact with root filesystems while in the host mount namespace, because the cost of doing a chroot(2) on every operation is too prohibitive (especially in Go, which cannot safely use vfork). However, a malicious program can trick the management process into doing operations on files outside of the root filesystem through careful crafting of symlinks.
Most programs that need this feature have attempted to make this process safe, by doing all of the path resolution in userspace (with symlinks being scoped to the root of the malicious root filesystem). Unfortunately, this method is prone to foot-guns and usually such implementations have subtle security bugs.
Thus, what userspace needs is a way to resolve a path as though it were in a chroot(2) -- with all absolute symlinks being resolved relative to the dirfd root (and ".." components being stuck under the dirfd root). It is much simpler and more straight-forward to provide this functionality in-kernel (because it can be done far more cheaply and correctly).
More classical applications that also have this problem (which have their own potentially buggy userspace path sanitisation code) include web servers, archive extraction tools, network file servers, and so on.
/* Userspace API. */ LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
/* Semantics. */ Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW), LOOKUP_IN_ROOT applies to all components of the path.
With LOOKUP_IN_ROOT, any path component which attempts to cross the starting point of the pathname lookup (the dirfd passed to openat) will remain at the starting point. Thus, all absolute paths and symlinks will be scoped within the starting point.
There is a slight change in behaviour regarding pathnames -- if the pathname is absolute then the dirfd is still used as the root of resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious foot-guns, at the cost of a minor API inconsistency).
As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction will be lifted in a future patch, but requires more work to ensure that permitting ".." is done safely.
Magic-link jumps are also blocked, because they can beam the path lookup across the starting point. It would be possible to detect and block only the "bad" crossings with path_is_under() checks, but it's unclear whether it makes sense to permit magic-links at all. However, userspace is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that magic-link crossing is entirely disabled.
/* Testing. */ LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6x...
Cc: Christian Brauner christian.brauner@ubuntu.com Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 15 ++++++++++++--- include/linux/namei.h | 3 ++- 2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 54fdbdfbeb94..a3d199a60708 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock);
- /* Figure out the starting path and root (if needed). */ - if (*s == '/') { + /* Absolute pathname -- fetch the root. */ + if (flags & LOOKUP_IN_ROOT) { + /* With LOOKUP_IN_ROOT, act as a relative path. */ + while (*s == '/') + s++; + } else if (*s == '/') { error = nd_jump_root(nd); if (unlikely(error)) return ERR_PTR(error); - } else if (nd->dfd == AT_FDCWD) { + return s; + } + + /* Relative pathname -- get the starting-point it is relative to. */ + if (nd->dfd == AT_FDCWD) { if (flags & LOOKUP_RCU) { struct fs_struct *fs = current->fs; unsigned seq; @@ -2322,6 +2330,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags) } fdput(f); } + /* For scoped-lookups we need to set the root to the dirfd as well. */ if (flags & LOOKUP_IS_SCOPED) { nd->root = nd->path; diff --git a/include/linux/namei.h b/include/linux/namei.h index 12f4f36835c2..96b374e08230 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -46,8 +46,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_XDEV 0x080000 /* No mountpoint crossing. */ #define LOOKUP_BENEATH 0x100000 /* No escaping from starting point. */ +#define LOOKUP_IN_ROOT 0x200000 /* Treat dirfd as %current->fs->root. */ /* LOOKUP_* flags which do scope-related checks based on the dirfd. */ -#define LOOKUP_IS_SCOPED LOOKUP_BENEATH +#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
extern int path_pts(struct path *path);
On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
@@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->m_seq = read_seqbegin(&mount_lock);
- /* Figure out the starting path and root (if needed). */
- if (*s == '/') {
- /* Absolute pathname -- fetch the root. */
- if (flags & LOOKUP_IN_ROOT) {
/* With LOOKUP_IN_ROOT, act as a relative path. */
while (*s == '/')
s++;
Er... Why bother skipping slashes? I mean, not only link_path_walk() will skip them just fine, you are actually risking breakage in this: if (*s && unlikely(!d_can_lookup(dentry))) { fdput(f); return ERR_PTR(-ENOTDIR); } which is downstream from there with you patch, AFAICS.
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
@@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->m_seq = read_seqbegin(&mount_lock);
- /* Figure out the starting path and root (if needed). */
- if (*s == '/') {
- /* Absolute pathname -- fetch the root. */
- if (flags & LOOKUP_IN_ROOT) {
/* With LOOKUP_IN_ROOT, act as a relative path. */
while (*s == '/')
s++;
Er... Why bother skipping slashes? I mean, not only link_path_walk() will skip them just fine, you are actually risking breakage in this: if (*s && unlikely(!d_can_lookup(dentry))) { fdput(f); return ERR_PTR(-ENOTDIR); } which is downstream from there with you patch, AFAICS.
I switched to stripping the slashes at your suggestion a few revisions ago[1], and had (wrongly) assumed we needed to handle "/" somehow in path_init(). But you're quite right about link_path_walk() -- and I'd be more than happy to drop it.
[1]: https://lore.kernel.org/lkml/20190712125552.GL17978@ZenIV.linux.org.uk/
On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
@@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->m_seq = read_seqbegin(&mount_lock);
- /* Figure out the starting path and root (if needed). */
- if (*s == '/') {
- /* Absolute pathname -- fetch the root. */
- if (flags & LOOKUP_IN_ROOT) {
/* With LOOKUP_IN_ROOT, act as a relative path. */
while (*s == '/')
s++;
Er... Why bother skipping slashes? I mean, not only link_path_walk() will skip them just fine, you are actually risking breakage in this: if (*s && unlikely(!d_can_lookup(dentry))) { fdput(f); return ERR_PTR(-ENOTDIR); } which is downstream from there with you patch, AFAICS.
I switched to stripping the slashes at your suggestion a few revisions ago[1], and had (wrongly) assumed we needed to handle "/" somehow in path_init(). But you're quite right about link_path_walk() -- and I'd be more than happy to drop it.
That, IIRC, was about untangling the weirdness around multiple calls of dirfd_path_init() and basically went "we might want just strip the slashes in case of that flag very early in the entire thing, so that later the normal logics for absolute/relative would DTRT". Since your check is right next to checking for absolute pathnames (and not in the very beginning of path_init()), we might as well turn the check for absolute pathname into *s == '/' && !(flags & LOOKUP_IN_ROOT) and be done with that.
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
@@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->m_seq = read_seqbegin(&mount_lock);
- /* Figure out the starting path and root (if needed). */
- if (*s == '/') {
- /* Absolute pathname -- fetch the root. */
- if (flags & LOOKUP_IN_ROOT) {
/* With LOOKUP_IN_ROOT, act as a relative path. */
while (*s == '/')
s++;
Er... Why bother skipping slashes? I mean, not only link_path_walk() will skip them just fine, you are actually risking breakage in this: if (*s && unlikely(!d_can_lookup(dentry))) { fdput(f); return ERR_PTR(-ENOTDIR); } which is downstream from there with you patch, AFAICS.
I switched to stripping the slashes at your suggestion a few revisions ago[1], and had (wrongly) assumed we needed to handle "/" somehow in path_init(). But you're quite right about link_path_walk() -- and I'd be more than happy to drop it.
That, IIRC, was about untangling the weirdness around multiple calls of dirfd_path_init() and basically went "we might want just strip the slashes in case of that flag very early in the entire thing, so that later the normal logics for absolute/relative would DTRT".
Ah okay, I'd misunderstood the point you were making in that thread.
Since your check is right next to checking for absolute pathnames (and not in the very beginning of path_init()), we might as well turn the check for absolute pathname into *s == '/' && !(flags & LOOKUP_IN_ROOT) and be done with that.
Yup, agreed.
Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution (in the case of LOOKUP_BENEATH the resolution will still fail if ".." resolution would resolve a path outside of the root -- while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are still disallowed entirely[*].
As Jann explains[1,2], the need for this patch (and the original no-".." restriction) is explained by observing there is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root.
thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat2(dirb, "b/c/../../etc/shadow", { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );
With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar (though somewhat more privileged) attack using MS_MOVE.
With this patch, such cases will be detected *during* ".." resolution and will return -EAGAIN for userspace to decide to either retry or abort the lookup. It should be noted that ".." is the weak point of chroot(2) -- walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root (except through a bind-mount or magic-link). There is also no other way for a directory's parent to change (which is the primary worry with ".." resolution here) other than a rename or MS_MOVE.
This is a first-pass implementation, where -EAGAIN will be returned if any rename or mount occurs anywhere on the host (in any namespace). This will result in spurious errors, but there isn't a satisfactory alternative (other than denying ".." altogether).
One other possible alternative (which previous versions of this patch used) would be to check with path_is_under() if there was a racing rename or mount (after re-taking the relevant seqlocks). While this does work, it results in possible O(n*m) behaviour if there are many renames or mounts occuring *anywhere on the system*.
A variant of the above attack is included in the selftests for openat2(2) later in this patch series. I've run this test on several machines for several days and no instances of a breakout were detected. While this is not concrete proof that this is safe, when combined with the above argument it should lend some trustworthiness to this construction.
[*] It may be acceptable in the future to do a path_is_under() check (as with the alternative solution for "..") for magic-links after they are resolved. However this seems unlikely to be a feature that people *really* need -- it can be added later if it turns out a lot of people want it.
[1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6x... [2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAOR...
Cc: Christian Brauner christian.brauner@ubuntu.com Suggested-by: Jann Horn jannh@google.com Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 42 +++++++++++++++++++++++++++++------------- 1 file changed, 29 insertions(+), 13 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index a3d199a60708..174d69cf9084 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -491,7 +491,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags; - unsigned seq, m_seq; + unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count; @@ -1769,22 +1769,35 @@ static inline int handle_dots(struct nameidata *nd, int type) if (type == LAST_DOTDOT) { int error = 0;
- /* - * Scoped-lookup flags resolving ".." is not currently safe -- - * races can cause our parent to have moved outside of the root - * and us to skip over it. - */ - if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) - return -EXDEV; if (!nd->root.mnt) { error = set_root(nd); if (error) return error; } - if (nd->flags & LOOKUP_RCU) { - return follow_dotdot_rcu(nd); - } else - return follow_dotdot(nd); + if (nd->flags & LOOKUP_RCU) + error = follow_dotdot_rcu(nd); + else + error = follow_dotdot(nd); + if (error) + return error; + + if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) { + bool m_retry = read_seqretry(&mount_lock, nd->m_seq); + bool r_retry = read_seqretry(&rename_lock, nd->r_seq); + + /* + * If there was a racing rename or mount along our + * path, then we can't be sure that ".." hasn't jumped + * above nd->root (and so userspace should retry or use + * some fallback). + * + * In future we could do a path_is_under() check here + * instead, but there are O(n*m) performance + * considerations with such a setup. + */ + if (unlikely(m_retry || r_retry)) + return -EAGAIN; + } } return 0; } @@ -2254,6 +2267,10 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0; + + nd->m_seq = read_seqbegin(&mount_lock); + nd->r_seq = read_seqbegin(&rename_lock); + if (flags & LOOKUP_ROOT) { struct dentry *root = nd->root.dentry; struct inode *inode = root->d_inode; @@ -2275,7 +2292,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.mnt = NULL; nd->path.dentry = NULL;
- nd->m_seq = read_seqbegin(&mount_lock);
/* Absolute pathname -- fetch the root. */ if (flags & LOOKUP_IN_ROOT) {
On Tue, Nov 05, 2019 at 08:05:50PM +1100, Aleksa Sarai wrote:
One other possible alternative (which previous versions of this patch used) would be to check with path_is_under() if there was a racing rename or mount (after re-taking the relevant seqlocks). While this does work, it results in possible O(n*m) behaviour if there are many renames or mounts occuring *anywhere on the system*.
BTW, do you realize that open-by-fhandle (or working nfsd, for that matter) will trigger arseloads of write_seqlock(&rename_lock) simply on d_splice_alias() bringing disconnected subtrees in contact with parent?
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Nov 05, 2019 at 08:05:50PM +1100, Aleksa Sarai wrote:
One other possible alternative (which previous versions of this patch used) would be to check with path_is_under() if there was a racing rename or mount (after re-taking the relevant seqlocks). While this does work, it results in possible O(n*m) behaviour if there are many renames or mounts occuring *anywhere on the system*.
BTW, do you realize that open-by-fhandle (or working nfsd, for that matter) will trigger arseloads of write_seqlock(&rename_lock) simply on d_splice_alias() bringing disconnected subtrees in contact with parent?
I wasn't aware of that -- that makes path_is_under() even less viable. I'll reword it to be clearer that path_is_under() isn't a good idea and why we went with -EAGAIN over an in-kernel retry.
/* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1].
This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2).
Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof.
In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument.
Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2).
/* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ };
int openat2(int dfd, const char *pathname, struct open_how *how, size_t size);
/* Description. */ The initial version of 'struct open_how' contains the following fields:
flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2).
mode The file mode for O_CREAT or O_TMPFILE.
Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
__padding Must be set to all zeroes.
resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag).
RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT
open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it).
Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2).
/* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios.
In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace).
/* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution).
Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened.
[1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZ... [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
Suggested-by: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- CREDITS | 4 +- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/open.c | 149 +++++++++++++++----- include/linux/fcntl.h | 12 +- include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/fcntl.h | 41 ++++++ 24 files changed, 196 insertions(+), 38 deletions(-)
diff --git a/CREDITS b/CREDITS index 031605d46b4d..a048e001d726 100644 --- a/CREDITS +++ b/CREDITS @@ -3301,7 +3301,9 @@ S: France N: Aleksa Sarai E: cyphar@cyphar.com W: https://www.cyphar.com/ -D: `pids` cgroup subsystem +D: /sys/fs/cgroup/pids +D: openat2(2) +S: Sydney, Australia
N: Dipankar Sarma E: dipankar@in.ibm.com diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 728fe028c02c..9f374f7d9514 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -475,3 +475,4 @@ 543 common fspick sys_fspick 544 common pidfd_open sys_pidfd_open # 545 reserved for clone3 +547 common openat2 sys_openat2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 6da7dc4d79cc..4ba54bc7e19a 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -449,3 +449,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 +437 common openat2 sys_openat2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 2629a68b8724..8aa00ccb0b96 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 436 +#define __NR_compat_syscalls 438 #endif
#define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 94ab29cf4f00..57f6f592d460 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick) __SYSCALL(__NR_pidfd_open, sys_pidfd_open) #define __NR_clone3 435 __SYSCALL(__NR_clone3, sys_clone3) +#define __NR_openat2 437 +__SYSCALL(__NR_openat2, sys_openat2)
/* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 36d5faf4c86c..8d36f2e2dc89 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -356,3 +356,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 +437 common openat2 sys_openat2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index a88a285a0e5f..2559925f1924 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -435,3 +435,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 +437 common openat2 sys_openat2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 09b0cd7dab0a..c04385e60833 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -441,3 +441,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 +437 common openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index e7c5ab38e403..68c9ec06851f 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -374,3 +374,4 @@ 433 n32 fspick sys_fspick 434 n32 pidfd_open sys_pidfd_open 435 n32 clone3 __sys_clone3 +437 n32 openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 13cd66581f3b..42a72d010050 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -350,3 +350,4 @@ 433 n64 fspick sys_fspick 434 n64 pidfd_open sys_pidfd_open 435 n64 clone3 __sys_clone3 +437 n64 openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 353539ea4140..f114c4aed0ed 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -423,3 +423,4 @@ 433 o32 fspick sys_fspick 434 o32 pidfd_open sys_pidfd_open 435 o32 clone3 __sys_clone3 +437 o32 openat2 sys_openat2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 285ff516150c..b550ae9a7fea 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -433,3 +433,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3_wrapper +437 common openat2 sys_openat2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 43f736ed47f2..a8b5ecb5b602 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -517,3 +517,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 nospu clone3 ppc_clone3 +437 common openat2 sys_openat2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 3054e9c035a3..16b571c06161 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -438,3 +438,4 @@ 433 common fspick sys_fspick sys_fspick 434 common pidfd_open sys_pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 sys_clone3 +437 common openat2 sys_openat2 sys_openat2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index b5ed26c4c005..a7185cc18626 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -438,3 +438,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 +437 common openat2 sys_openat2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 8c8cc7537fb2..b11c19552022 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -481,3 +481,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open # 435 reserved for clone3 +437 common openat2 sys_openat2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 3fe02546aed3..e5c022e9a5c4 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -440,3 +440,4 @@ 433 i386 fspick sys_fspick __ia32_sys_fspick 434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open 435 i386 clone3 sys_clone3 __ia32_sys_clone3 +437 i386 openat2 sys_openat2 __ia32_sys_openat2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c29976eca4a8..9035647ef236 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -357,6 +357,7 @@ 433 common fspick __x64_sys_fspick 434 common pidfd_open __x64_sys_pidfd_open 435 common clone3 __x64_sys_clone3/ptregs +437 common openat2 __x64_sys_openat2
# # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 25f4de729a6d..f0a68013c038 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -406,3 +406,4 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 +437 common openat2 sys_openat2 diff --git a/fs/open.c b/fs/open.c index b62f5c0923a8..50a46501bcc9 100644 --- a/fs/open.c +++ b/fs/open.c @@ -955,48 +955,86 @@ struct file *open_with_fake_path(const struct path *path, int flags, } EXPORT_SYMBOL(open_with_fake_path);
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op) +#define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE)) +#define O_PATH_FLAGS (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC) + +static inline struct open_how build_open_how(int flags, umode_t mode) +{ + struct open_how how = { + .flags = flags & VALID_OPEN_FLAGS, + .mode = mode & S_IALLUGO, + }; + + /* O_PATH beats everything else. */ + if (how.flags & O_PATH) + how.flags &= O_PATH_FLAGS; + /* Modes should only be set for create-like flags. */ + if (!WILL_CREATE(how.flags)) + how.mode = 0; + return how; +} + +static inline int build_open_flags(const struct open_how *how, + struct open_flags *op) { + int flags = how->flags; int lookup_flags = 0; int acc_mode = ACC_MODE(flags);
+ /* Must never be set by userspace */ + flags &= ~(FMODE_NONOTIFY | O_CLOEXEC); + /* - * Clear out all open flags we don't know about so that we don't report - * them in fcntl(F_GETFD) or similar interfaces. + * Older syscalls implicitly clear all of the invalid flags or argument + * values before calling build_open_flags(), but openat2(2) checks all + * of its arguments. */ - flags &= VALID_OPEN_FLAGS; + if (flags & ~VALID_OPEN_FLAGS) + return -EINVAL; + if (how->resolve & ~VALID_RESOLVE_FLAGS) + return -EINVAL; + if (memchr_inv(how->__padding, 0, sizeof(how->__padding))) + return -EINVAL;
- if (flags & (O_CREAT | __O_TMPFILE)) - op->mode = (mode & S_IALLUGO) | S_IFREG; - else + /* Deal with the mode. */ + if (WILL_CREATE(flags)) { + if (how->mode & ~S_IALLUGO) + return -EINVAL; + op->mode = how->mode | S_IFREG; + } else { + if (how->mode != 0) + return -EINVAL; op->mode = 0; - - /* Must never be set by userspace */ - flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC; + }
/* - * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only - * check for O_DSYNC if the need any syncing at all we enforce it's - * always set instead of having to deal with possibly weird behaviour - * for malicious applications setting only __O_SYNC. + * In order to ensure programs get explicit errors when trying to use + * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it + * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we + * have to require userspace to explicitly set it. */ - if (flags & __O_SYNC) - flags |= O_DSYNC; - if (flags & __O_TMPFILE) { if ((flags & O_TMPFILE_MASK) != O_TMPFILE) return -EINVAL; if (!(acc_mode & MAY_WRITE)) return -EINVAL; - } else if (flags & O_PATH) { - /* - * If we have O_PATH in the open flag. Then we - * cannot have anything other than the below set of flags - */ - flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH; + } + if (flags & O_PATH) { + /* O_PATH only permits certain other flags to be set. */ + if (flags & ~O_PATH_FLAGS) + return -EINVAL; acc_mode = 0; }
+ /* + * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only + * check for O_DSYNC if the need any syncing at all we enforce it's + * always set instead of having to deal with possibly weird behaviour + * for malicious applications setting only __O_SYNC. + */ + if (flags & __O_SYNC) + flags |= O_DSYNC; + op->open_flag = flags;
/* O_TRUNC implies we need access checks for write permissions */ @@ -1022,6 +1060,18 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_DIRECTORY; if (!(flags & O_NOFOLLOW)) lookup_flags |= LOOKUP_FOLLOW; + + if (how->resolve & RESOLVE_NO_XDEV) + lookup_flags |= LOOKUP_NO_XDEV; + if (how->resolve & RESOLVE_NO_MAGICLINKS) + lookup_flags |= LOOKUP_NO_MAGICLINKS; + if (how->resolve & RESOLVE_NO_SYMLINKS) + lookup_flags |= LOOKUP_NO_SYMLINKS; + if (how->resolve & RESOLVE_BENEATH) + lookup_flags |= LOOKUP_BENEATH; + if (how->resolve & RESOLVE_IN_ROOT) + lookup_flags |= LOOKUP_IN_ROOT; + op->lookup_flags = lookup_flags; return 0; } @@ -1040,8 +1090,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o struct file *file_open_name(struct filename *name, int flags, umode_t mode) { struct open_flags op; - int err = build_open_flags(flags, mode, &op); - return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op); + struct open_how how = build_open_how(flags, mode); + int err = build_open_flags(&how, &op); + if (err) + return ERR_PTR(err); + return do_filp_open(AT_FDCWD, name, &op); }
/** @@ -1072,17 +1125,19 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt, const char *filename, int flags, umode_t mode) { struct open_flags op; - int err = build_open_flags(flags, mode, &op); + struct open_how how = build_open_how(flags, mode); + int err = build_open_flags(&how, &op); if (err) return ERR_PTR(err); return do_file_open_root(dentry, mnt, filename, &op); } EXPORT_SYMBOL(file_open_root);
-long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) +static long do_sys_openat2(int dfd, const char __user *filename, + struct open_how *how) { struct open_flags op; - int fd = build_open_flags(flags, mode, &op); + int fd = build_open_flags(how, &op); struct filename *tmp;
if (fd) @@ -1092,7 +1147,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) if (IS_ERR(tmp)) return PTR_ERR(tmp);
- fd = get_unused_fd_flags(flags); + fd = get_unused_fd_flags(how->flags); if (fd >= 0) { struct file *f = do_filp_open(dfd, tmp, &op); if (IS_ERR(f)) { @@ -1107,12 +1162,16 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) return fd; }
-SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) +long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) { - if (force_o_largefile()) - flags |= O_LARGEFILE; + struct open_how how = build_open_how(flags, mode); + return do_sys_openat2(dfd, filename, &how); +}
- return do_sys_open(AT_FDCWD, filename, flags, mode); + +SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) +{ + return ksys_open(filename, flags, mode); }
SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, @@ -1120,10 +1179,32 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, { if (force_o_largefile()) flags |= O_LARGEFILE; - return do_sys_open(dfd, filename, flags, mode); }
+SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename, + struct open_how __user *, how, size_t, usize) +{ + int err; + struct open_how tmp; + + BUILD_BUG_ON(sizeof(struct open_how) < OPEN_HOW_SIZE_VER0); + BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_LATEST); + + if (unlikely(usize < OPEN_HOW_SIZE_VER0)) + return -EINVAL; + + err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize); + if (err) + return err; + + /* O_LARGEFILE is only allowed for non-O_PATH. */ + if (!(tmp.flags & O_PATH) && force_o_largefile()) + tmp.flags |= O_LARGEFILE; + + return do_sys_openat2(dfd, filename, &tmp); +} + #ifdef CONFIG_COMPAT /* * Exactly like sys_open(), except that it doesn't set the diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index d019df946cb2..f2eb05bd3af3 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -2,15 +2,25 @@ #ifndef _LINUX_FCNTL_H #define _LINUX_FCNTL_H
+#include <linux/stat.h> #include <uapi/linux/fcntl.h>
-/* list of all valid flags for the open/openat flags argument: */ +/* List of all valid flags for the open/openat flags argument: */ #define VALID_OPEN_FLAGS \ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+/* List of all valid flags for the how->upgrade_mask argument: */ +#define VALID_UPGRADE_FLAGS \ + (UPGRADE_NOWRITE | UPGRADE_NOREAD) + +/* List of all valid flags for the how->resolve argument: */ +#define VALID_RESOLVE_FLAGS \ + (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ + RESOLVE_BENEATH | RESOLVE_IN_ROOT) + #ifndef force_o_largefile #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T)) #endif diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index f7c561c4dcdd..808f103b7a62 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct rseq; union bpf_attr; struct io_uring_params; struct clone_args; +struct open_how;
#include <linux/types.h> #include <linux/aio_abi.h> @@ -439,6 +440,8 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user, asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group); asmlinkage long sys_openat(int dfd, const char __user *filename, int flags, umode_t mode); +asmlinkage long sys_openat2(int dfd, const char __user *filename, + struct open_how *how, size_t size); asmlinkage long sys_close(unsigned int fd); asmlinkage long sys_vhangup(void);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 1fc8faa6e973..d4122c091472 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open) __SYSCALL(__NR_clone3, sys_clone3) #endif
+#define __NR_openat2 437 +__SYSCALL(__NR_openat2, sys_openat2) + #undef __NR_syscalls -#define __NR_syscalls 436 +#define __NR_syscalls 438
/* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 1d338357df8a..5de8b0006a95 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -93,5 +93,46 @@
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+/* + * Arguments for how openat2(2) should open the target path. If @resolve is + * zero, then openat2(2) operates very similarly to openat(2). + * + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather + * than being silently ignored. @mode must be zero unless one of {O_CREAT, + * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set. + * + * @flags: O_* flags. + * @mode: O_CREAT/O_TMPFILE file mode. + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening). + * @resolve: RESOLVE_* flags. + */ +struct open_how { + __aligned_u64 flags; + __u16 mode; + __u16 __padding[3]; /* must be zeroed */ + __aligned_u64 resolve; +}; + +#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */ +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0 + +/* how->resolve flags for openat2(2). */ +#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings + (includes bind-mounts). */ +#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style + "magic-links". */ +#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks + (implies OEXT_NO_MAGICLINKS) */ +#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like + "..", symlinks, and absolute + paths which escape the dirfd. */ +#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." + be scoped inside the dirfd + (similar to chroot(2)). */ + +/* how->upgrade flags for openat2(2). */ +/* First bit is reserved for a future UPGRADE_NOEXEC flag. */ +#define UPGRADE_NOREAD 0x02 /* Block re-opening with MAY_READ. */ +#define UPGRADE_NOWRITE 0x04 /* Block re-opening with MAY_WRITE. */
#endif /* _UAPI_LINUX_FCNTL_H */
On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
+/*
- Arguments for how openat2(2) should open the target path. If @resolve is
- zero, then openat2(2) operates very similarly to openat(2).
- However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
- than being silently ignored. @mode must be zero unless one of {O_CREAT,
- O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
- @flags: O_* flags.
- @mode: O_CREAT/O_TMPFILE file mode.
- @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
???
- @resolve: RESOLVE_* flags.
- */
+struct open_how {
- __aligned_u64 flags;
- __u16 mode;
- __u16 __padding[3]; /* must be zeroed */
- __aligned_u64 resolve;
+};
On 2019-11-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
+/*
- Arguments for how openat2(2) should open the target path. If @resolve is
- zero, then openat2(2) operates very similarly to openat(2).
- However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
- than being silently ignored. @mode must be zero unless one of {O_CREAT,
- O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
- @flags: O_* flags.
- @mode: O_CREAT/O_TMPFILE file mode.
- @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
???
Sorry, that was left over from a previous revision (where the magic-link re-opening restrictions were part of this series).
- @resolve: RESOLVE_* flags.
- */
+struct open_how {
- __aligned_u64 flags;
- __u16 mode;
- __u16 __padding[3]; /* must be zeroed */
- __aligned_u64 resolve;
+};
Test all of the various openat2(2) flags. A small stress-test of a symlink-rename attack is included to show that the protections against ".."-based attacks are sufficient.
The main things these self-tests are enforcing are:
* The struct+usize ABI for openat2(2) and copy_struct_from_user() to ensure that upgrades will be handled gracefully (in addition, ensuring that misaligned structures are also handled correctly).
* The -EINVAL checks for openat2(2) are all correctly handled to avoid userspace passing unknown or conflicting flag sets (most importantly, ensuring that invalid flag combinations are checked).
* All of the RESOLVE_* semantics (including errno values) are correctly handled with various combinations of paths and flags.
* RESOLVE_IN_ROOT correctly protects against the symlink rename(2) attack that has been responsible for several CVEs (and likely will be responsible for several more).
Cc: Shuah Khan shuah@kernel.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/openat2/.gitignore | 1 + tools/testing/selftests/openat2/Makefile | 8 + tools/testing/selftests/openat2/helpers.c | 109 ++++ tools/testing/selftests/openat2/helpers.h | 107 ++++ .../testing/selftests/openat2/openat2_test.c | 316 +++++++++++ .../selftests/openat2/rename_attack_test.c | 160 ++++++ .../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++ 8 files changed, 1225 insertions(+) create mode 100644 tools/testing/selftests/openat2/.gitignore create mode 100644 tools/testing/selftests/openat2/Makefile create mode 100644 tools/testing/selftests/openat2/helpers.c create mode 100644 tools/testing/selftests/openat2/helpers.h create mode 100644 tools/testing/selftests/openat2/openat2_test.c create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c create mode 100644 tools/testing/selftests/openat2/resolve_test.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 4cdbae6f4e61..28996856ed5e 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -37,6 +37,7 @@ TARGETS += powerpc TARGETS += proc TARGETS += pstore TARGETS += ptrace +TARGETS += openat2 TARGETS += rseq TARGETS += rtc TARGETS += seccomp diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore new file mode 100644 index 000000000000..bd68f6c3fd07 --- /dev/null +++ b/tools/testing/selftests/openat2/.gitignore @@ -0,0 +1 @@ +/*_test diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile new file mode 100644 index 000000000000..4b93b1417b86 --- /dev/null +++ b/tools/testing/selftests/openat2/Makefile @@ -0,0 +1,8 @@ +# SPDX-License-Identifier: GPL-2.0-or-later + +CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined +TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test + +include ../lib.mk + +$(TEST_GEN_PROGS): helpers.c diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c new file mode 100644 index 000000000000..e9a6557ab16f --- /dev/null +++ b/tools/testing/selftests/openat2/helpers.c @@ -0,0 +1,109 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <stdbool.h> +#include <string.h> +#include <syscall.h> +#include <limits.h> + +#include "helpers.h" + +bool needs_openat2(const struct open_how *how) +{ + return how->resolve != 0; +} + +int raw_openat2(int dfd, const char *path, void *how, size_t size) +{ + int ret = syscall(__NR_openat2, dfd, path, how, size); + return ret >= 0 ? ret : -errno; +} + +int sys_openat2(int dfd, const char *path, struct open_how *how) +{ + return raw_openat2(dfd, path, how, sizeof(*how)); +} + +int sys_openat(int dfd, const char *path, struct open_how *how) +{ + int ret = openat(dfd, path, how->flags, how->mode); + return ret >= 0 ? ret : -errno; +} + +int sys_renameat2(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, unsigned int flags) +{ + int ret = syscall(__NR_renameat2, olddirfd, oldpath, + newdirfd, newpath, flags); + return ret >= 0 ? ret : -errno; +} + +int touchat(int dfd, const char *path) +{ + int fd = openat(dfd, path, O_CREAT); + if (fd >= 0) + close(fd); + return fd; +} + +char *fdreadlink(int fd) +{ + char *target, *tmp; + + E_asprintf(&tmp, "/proc/self/fd/%d", fd); + + target = malloc(PATH_MAX); + if (!target) + ksft_exit_fail_msg("fdreadlink: malloc failed\n"); + memset(target, 0, PATH_MAX); + + E_readlink(tmp, target, PATH_MAX); + free(tmp); + return target; +} + +bool fdequal(int fd, int dfd, const char *path) +{ + char *fdpath, *dfdpath, *other; + bool cmp; + + fdpath = fdreadlink(fd); + dfdpath = fdreadlink(dfd); + + if (!path) + E_asprintf(&other, "%s", dfdpath); + else if (*path == '/') + E_asprintf(&other, "%s", path); + else + E_asprintf(&other, "%s/%s", dfdpath, path); + + cmp = !strcmp(fdpath, other); + + free(fdpath); + free(dfdpath); + free(other); + return cmp; +} + +bool openat2_supported = false; + +void __attribute__((constructor)) init(void) +{ + struct open_how how = {}; + int fd; + + BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0); + + /* Check openat2(2) support. */ + fd = sys_openat2(AT_FDCWD, ".", &how); + openat2_supported = (fd >= 0); + + if (fd >= 0) + close(fd); +} diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h new file mode 100644 index 000000000000..43ca5ceab6e3 --- /dev/null +++ b/tools/testing/selftests/openat2/helpers.h @@ -0,0 +1,107 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#ifndef __RESOLVEAT_H__ +#define __RESOLVEAT_H__ + +#define _GNU_SOURCE +#include <stdint.h> +#include <errno.h> +#include <linux/types.h> +#include "../kselftest.h" + +#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X))) +#define BUILD_BUG_ON(e) ((void)(sizeof(struct { int:(-!!(e)); }))) + +#ifndef SYS_openat2 +#ifndef __NR_openat2 +#define __NR_openat2 437 +#endif /* __NR_openat2 */ +#define SYS_openat2 __NR_openat2 +#endif /* SYS_openat2 */ + +/* + * Arguments for how openat2(2) should open the target path. If @resolve is + * zero, then openat2(2) operates very similarly to openat(2). + * + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather + * than being silently ignored. @mode must be zero unless one of {O_CREAT, + * O_TMPFILE} are set. + * + * @flags: O_* flags. + * @mode: O_CREAT/O_TMPFILE file mode. + * @resolve: RESOLVE_* flags. + */ +struct open_how { + __aligned_u64 flags; + __u16 mode; + __u16 __padding[3]; /* must be zeroed */ + __aligned_u64 resolve; +}; + +#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */ +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0 + +bool needs_openat2(const struct open_how *how); + +#ifndef RESOLVE_IN_ROOT +/* how->resolve flags for openat2(2). */ +#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings + (includes bind-mounts). */ +#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style + "magic-links". */ +#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks + (implies OEXT_NO_MAGICLINKS) */ +#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like + "..", symlinks, and absolute + paths which escape the dirfd. */ +#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." + be scoped inside the dirfd + (similar to chroot(2)). */ +#endif /* RESOLVE_IN_ROOT */ + +#define E_func(func, ...) \ + do { \ + if (func(__VA_ARGS__) < 0) \ + ksft_exit_fail_msg("%s:%d %s failed\n", \ + __FILE__, __LINE__, #func);\ + } while (0) + +#define E_asprintf(...) E_func(asprintf, __VA_ARGS__) +#define E_chmod(...) E_func(chmod, __VA_ARGS__) +#define E_dup2(...) E_func(dup2, __VA_ARGS__) +#define E_fchdir(...) E_func(fchdir, __VA_ARGS__) +#define E_fstatat(...) E_func(fstatat, __VA_ARGS__) +#define E_kill(...) E_func(kill, __VA_ARGS__) +#define E_mkdirat(...) E_func(mkdirat, __VA_ARGS__) +#define E_mount(...) E_func(mount, __VA_ARGS__) +#define E_prctl(...) E_func(prctl, __VA_ARGS__) +#define E_readlink(...) E_func(readlink, __VA_ARGS__) +#define E_setresuid(...) E_func(setresuid, __VA_ARGS__) +#define E_symlinkat(...) E_func(symlinkat, __VA_ARGS__) +#define E_touchat(...) E_func(touchat, __VA_ARGS__) +#define E_unshare(...) E_func(unshare, __VA_ARGS__) + +#define E_assert(expr, msg, ...) \ + do { \ + if (!(expr)) \ + ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \ + __FILE__, __LINE__, #expr, ##__VA_ARGS__); \ + } while (0) + +int raw_openat2(int dfd, const char *path, void *how, size_t size); +int sys_openat2(int dfd, const char *path, struct open_how *how); +int sys_openat(int dfd, const char *path, struct open_how *how); +int sys_renameat2(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, unsigned int flags); + +int touchat(int dfd, const char *path); +char *fdreadlink(int fd); +bool fdequal(int fd, int dfd, const char *path); + +extern bool openat2_supported; + +#endif /* __RESOLVEAT_H__ */ diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c new file mode 100644 index 000000000000..8a641acb0d6c --- /dev/null +++ b/tools/testing/selftests/openat2/openat2_test.c @@ -0,0 +1,316 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <fcntl.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <stdlib.h> +#include <stdbool.h> +#include <string.h> + +#include "../kselftest.h" +#include "helpers.h" + +/* + * O_LARGEFILE is set to 0 by glibc. + * XXX: This is wrong on {mips, parisc, powerpc, sparc}. + */ +#undef O_LARGEFILE +#define O_LARGEFILE 0x8000 + +struct open_how_ext { + struct open_how inner; + uint32_t extra1; + char pad1[128]; + uint32_t extra2; + char pad2[128]; + uint32_t extra3; +}; + +struct struct_test { + const char *name; + struct open_how_ext arg; + size_t size; + int err; +}; + +#define NUM_OPENAT2_STRUCT_TESTS 9 +#define NUM_OPENAT2_STRUCT_VARIATIONS 13 + +void test_openat2_struct(void) +{ + int misalignments[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 17, 87 }; + + struct struct_test tests[] = { + /* Normal struct. */ + { .name = "normal struct", + .arg.inner.flags = O_RDONLY, + .size = sizeof(struct open_how) }, + /* Bigger struct, with zeroed out end. */ + { .name = "bigger struct (zeroed out)", + .arg.inner.flags = O_RDONLY, + .size = sizeof(struct open_how_ext) }, + + /* Normal struct with broken padding. */ + { .name = "normal struct (non-zero padding[0])", + .arg.inner.flags = O_RDONLY, + .arg.inner.__padding = {0xa0, 0x00}, + .size = sizeof(struct open_how_ext), .err = -EINVAL }, + { .name = "normal struct (non-zero padding[1])", + .arg.inner.flags = O_RDONLY, + .arg.inner.__padding = {0x00, 0x1a}, + .size = sizeof(struct open_how_ext), .err = -EINVAL }, + + /* TODO: Once expanded, check zero-padding. */ + + /* Smaller than version-0 struct. */ + { .name = "zero-sized 'struct'", + .arg.inner.flags = O_RDONLY, .size = 0, .err = -EINVAL }, + { .name = "smaller-than-v0 struct", + .arg.inner.flags = O_RDONLY, + .size = OPEN_HOW_SIZE_VER0 - 1, .err = -EINVAL }, + + /* Bigger struct, with non-zero trailing bytes. */ + { .name = "bigger struct (non-zero data in first 'future field')", + .arg.inner.flags = O_RDONLY, .arg.extra1 = 0xdeadbeef, + .size = sizeof(struct open_how_ext), .err = -E2BIG }, + { .name = "bigger struct (non-zero data in middle of 'future fields')", + .arg.inner.flags = O_RDONLY, .arg.extra2 = 0xfeedcafe, + .size = sizeof(struct open_how_ext), .err = -E2BIG }, + { .name = "bigger struct (non-zero data at end of 'future fields')", + .arg.inner.flags = O_RDONLY, .arg.extra3 = 0xabad1dea, + .size = sizeof(struct open_how_ext), .err = -E2BIG }, + }; + + BUILD_BUG_ON(ARRAY_LEN(misalignments) != NUM_OPENAT2_STRUCT_VARIATIONS); + BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_STRUCT_TESTS); + + for (int i = 0; i < ARRAY_LEN(tests); i++) { + struct struct_test *test = &tests[i]; + struct open_how_ext how_ext = test->arg; + + for (int j = 0; j < ARRAY_LEN(misalignments); j++) { + int fd, misalign = misalignments[j]; + char *fdpath = NULL; + bool failed; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + void *copy = NULL, *how_copy = &how_ext; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + if (misalign) { + /* + * Explicitly misalign the structure copying it with the given + * (mis)alignment offset. The other data is set to be non-zero to + * make sure that non-zero bytes outside the struct aren't checked + * + * This is effectively to check that is_zeroed_user() works. + */ + copy = malloc(misalign + sizeof(how_ext)); + how_copy = copy + misalign; + memset(copy, 0xff, misalign); + memcpy(how_copy, &how_ext, sizeof(how_ext)); + } + + fd = raw_openat2(AT_FDCWD, ".", how_copy, test->size); + if (test->err >= 0) + failed = (fd < 0); + else + failed = (fd != test->err); + if (fd >= 0) { + fdpath = fdreadlink(fd); + close(fd); + } + + if (failed) { + resultfn = ksft_test_result_fail; + + ksft_print_msg("openat2 unexpectedly returned "); + if (fdpath) + ksft_print_msg("%d['%s']\n", fd, fdpath); + else + ksft_print_msg("%d (%s)\n", fd, strerror(-fd)); + } + +skip: + if (test->err >= 0) + resultfn("openat2 with %s argument [misalign=%d] succeeds\n", + test->name, misalign); + else + resultfn("openat2 with %s argument [misalign=%d] fails with %d (%s)\n", + test->name, misalign, test->err, + strerror(-test->err)); + + free(copy); + free(fdpath); + fflush(stdout); + } + } +} + +struct flag_test { + const char *name; + struct open_how how; + int err; +}; + +#define NUM_OPENAT2_FLAG_TESTS 21 + +void test_openat2_flags(void) +{ + struct flag_test tests[] = { + /* O_TMPFILE is incompatible with O_PATH and O_CREAT. */ + { .name = "incompatible flags (O_TMPFILE | O_PATH)", + .how.flags = O_TMPFILE | O_PATH | O_RDWR, .err = -EINVAL }, + { .name = "incompatible flags (O_TMPFILE | O_CREAT)", + .how.flags = O_TMPFILE | O_CREAT | O_RDWR, .err = -EINVAL }, + + /* O_PATH only permits certain other flags to be set ... */ + { .name = "compatible flags (O_PATH | O_CLOEXEC)", + .how.flags = O_PATH | O_CLOEXEC }, + { .name = "compatible flags (O_PATH | O_DIRECTORY)", + .how.flags = O_PATH | O_DIRECTORY }, + { .name = "compatible flags (O_PATH | O_NOFOLLOW)", + .how.flags = O_PATH | O_NOFOLLOW }, + /* ... and others are absolutely not permitted. */ + { .name = "incompatible flags (O_PATH | O_RDWR)", + .how.flags = O_PATH | O_RDWR, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_CREAT)", + .how.flags = O_PATH | O_CREAT, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_EXCL)", + .how.flags = O_PATH | O_EXCL, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_NOCTTY)", + .how.flags = O_PATH | O_NOCTTY, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_DIRECT)", + .how.flags = O_PATH | O_DIRECT, .err = -EINVAL }, + { .name = "incompatible flags (O_PATH | O_LARGEFILE)", + .how.flags = O_PATH | O_LARGEFILE, .err = -EINVAL }, + + /* ->mode must only be set with O_{CREAT,TMPFILE}. */ + { .name = "non-zero how.mode and O_RDONLY", + .how.flags = O_RDONLY, .how.mode = 0600, .err = -EINVAL }, + { .name = "non-zero how.mode and O_PATH", + .how.flags = O_PATH, .how.mode = 0600, .err = -EINVAL }, + { .name = "valid how.mode and O_CREAT", + .how.flags = O_CREAT, .how.mode = 0600 }, + { .name = "valid how.mode and O_TMPFILE", + .how.flags = O_TMPFILE | O_RDWR, .how.mode = 0600 }, + /* ->mode must only contain 0777 bits. */ + { .name = "invalid how.mode and O_CREAT", + .how.flags = O_CREAT, + .how.mode = 0xFFFF, .err = -EINVAL }, + { .name = "invalid how.mode and O_TMPFILE", + .how.flags = O_TMPFILE | O_RDWR, + .how.mode = 0x1337, .err = -EINVAL }, + + /* ->resolve must only contain RESOLVE_* flags. */ + { .name = "invalid how.resolve and O_RDONLY", + .how.flags = O_RDONLY, + .how.resolve = 0x1337, .err = -EINVAL }, + { .name = "invalid how.resolve and O_CREAT", + .how.flags = O_CREAT, + .how.resolve = 0x1337, .err = -EINVAL }, + { .name = "invalid how.resolve and O_TMPFILE", + .how.flags = O_TMPFILE | O_RDWR, + .how.resolve = 0x1337, .err = -EINVAL }, + { .name = "invalid how.resolve and O_PATH", + .how.flags = O_PATH, + .how.resolve = 0x1337, .err = -EINVAL }, + }; + + BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS); + + for (int i = 0; i < ARRAY_LEN(tests); i++) { + int fd, fdflags = -1; + char *path, *fdpath = NULL; + bool failed = false; + struct flag_test *test = &tests[i]; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + path = (test->how.flags & O_CREAT) ? "/tmp/ksft.openat2_tmpfile" : "."; + unlink(path); + + fd = sys_openat2(AT_FDCWD, path, &test->how); + if (test->err >= 0) + failed = (fd < 0); + else + failed = (fd != test->err); + if (fd >= 0) { + int otherflags; + + fdpath = fdreadlink(fd); + fdflags = fcntl(fd, F_GETFL); + otherflags = fcntl(fd, F_GETFD); + close(fd); + + E_assert(fdflags >= 0, "fcntl F_GETFL of new fd"); + E_assert(otherflags >= 0, "fcntl F_GETFD of new fd"); + + /* O_CLOEXEC isn't shown in F_GETFL. */ + if (otherflags & FD_CLOEXEC) + fdflags |= O_CLOEXEC; + /* O_CREAT is hidden from F_GETFL. */ + if (test->how.flags & O_CREAT) + fdflags |= O_CREAT; + if (!(test->how.flags & O_LARGEFILE)) + fdflags &= ~O_LARGEFILE; + failed |= (fdflags != test->how.flags); + } + + if (failed) { + resultfn = ksft_test_result_fail; + + ksft_print_msg("openat2 unexpectedly returned "); + if (fdpath) + ksft_print_msg("%d['%s'] with %X (!= %X)\n", + fd, fdpath, fdflags, + test->how.flags); + else + ksft_print_msg("%d (%s)\n", fd, strerror(-fd)); + } + +skip: + if (test->err >= 0) + resultfn("openat2 with %s succeeds\n", test->name); + else + resultfn("openat2 with %s fails with %d (%s)\n", + test->name, test->err, strerror(-test->err)); + + free(fdpath); + fflush(stdout); + } +} + +#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \ + NUM_OPENAT2_FLAG_TESTS) + +int main(int argc, char **argv) +{ + ksft_print_header(); + ksft_set_plan(NUM_TESTS); + + test_openat2_struct(); + test_openat2_flags(); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c new file mode 100644 index 000000000000..0a770728b436 --- /dev/null +++ b/tools/testing/selftests/openat2/rename_attack_test.c @@ -0,0 +1,160 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <sys/mman.h> +#include <sys/prctl.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <stdbool.h> +#include <string.h> +#include <syscall.h> +#include <limits.h> +#include <unistd.h> + +#include "../kselftest.h" +#include "helpers.h" + +/* Construct a test directory with the following structure: + * + * root/ + * |-- a/ + * | `-- c/ + * `-- b/ + */ +int setup_testdir(void) +{ + int dfd; + char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX"; + + /* Make the top-level directory. */ + if (!mkdtemp(dirname)) + ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n"); + dfd = open(dirname, O_PATH | O_DIRECTORY); + if (dfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + + E_mkdirat(dfd, "a", 0755); + E_mkdirat(dfd, "b", 0755); + E_mkdirat(dfd, "a/c", 0755); + + return dfd; +} + +/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */ +pid_t spawn_attack(int dirfd, char *a, char *b) +{ + pid_t child = fork(); + if (child != 0) + return child; + + /* If the parent (the test process) dies, kill ourselves too. */ + E_prctl(PR_SET_PDEATHSIG, SIGKILL); + + /* Swap @a and @b. */ + for (;;) + renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE); + exit(1); +} + +#define NUM_RENAME_TESTS 2 +#define ROUNDS 400000 + +const char *flagname(int resolve) +{ + switch (resolve) { + case RESOLVE_IN_ROOT: + return "RESOLVE_IN_ROOT"; + case RESOLVE_BENEATH: + return "RESOLVE_BENEATH"; + } + return "(unknown)"; +} + +void test_rename_attack(int resolve) +{ + int dfd, afd; + pid_t child; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + int escapes = 0, other_errs = 0, exdevs = 0, eagains = 0, successes = 0; + + struct open_how how = { + .flags = O_PATH, + .resolve = resolve, + }; + + if (!openat2_supported) { + how.resolve = 0; + ksft_print_msg("openat2(2) unsupported -- using openat(2) instead\n"); + } + + dfd = setup_testdir(); + afd = openat(dfd, "a", O_PATH); + if (afd < 0) + ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n"); + + child = spawn_attack(dfd, "a/c", "b"); + + for (int i = 0; i < ROUNDS; i++) { + int fd; + char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../.."; + + if (openat2_supported) + fd = sys_openat2(afd, victim_path, &how); + else + fd = sys_openat(afd, victim_path, &how); + + if (fd < 0) { + if (fd == -EAGAIN) + eagains++; + else if (fd == -EXDEV) + exdevs++; + else if (fd == -ENOENT) + escapes++; /* escaped outside and got ENOENT... */ + else + other_errs++; /* unexpected error */ + } else { + if (fdequal(fd, afd, NULL)) + successes++; + else + escapes++; /* we got an unexpected fd */ + } + close(fd); + } + + if (escapes > 0) + resultfn = ksft_test_result_fail; + ksft_print_msg("non-escapes: EAGAIN=%d EXDEV=%d E<other>=%d success=%d\n", + eagains, exdevs, other_errs, successes); + resultfn("rename attack with %s (%d runs, got %d escapes)\n", + flagname(resolve), ROUNDS, escapes); + + /* Should be killed anyway, but might as well make sure. */ + E_kill(child, SIGKILL); +} + +#define NUM_TESTS NUM_RENAME_TESTS + +int main(int argc, char **argv) +{ + ksft_print_header(); + ksft_set_plan(NUM_TESTS); + + test_rename_attack(RESOLVE_BENEATH); + test_rename_attack(RESOLVE_IN_ROOT); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c new file mode 100644 index 000000000000..7a94b1da8e7b --- /dev/null +++ b/tools/testing/selftests/openat2/resolve_test.c @@ -0,0 +1,523 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <fcntl.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <stdlib.h> +#include <stdbool.h> +#include <string.h> + +#include "../kselftest.h" +#include "helpers.h" + +/* + * Construct a test directory with the following structure: + * + * root/ + * |-- procexe -> /proc/self/exe + * |-- procroot -> /proc/self/root + * |-- root/ + * |-- mnt/ [mountpoint] + * | |-- self -> ../mnt/ + * | `-- absself -> /mnt/ + * |-- etc/ + * | `-- passwd + * |-- creatlink -> /newfile3 + * |-- reletc -> etc/ + * |-- relsym -> etc/passwd + * |-- absetc -> /etc/ + * |-- abssym -> /etc/passwd + * |-- abscheeky -> /cheeky + * `-- cheeky/ + * |-- absself -> / + * |-- self -> ../../root/ + * |-- garbageself -> /../../root/ + * |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd + * |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd + * |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd + * `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd + */ +int setup_testdir(void) +{ + int dfd, tmpfd; + char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX"; + + /* Unshare and make /tmp a new directory. */ + E_unshare(CLONE_NEWNS); + E_mount("", "/tmp", "", MS_PRIVATE, ""); + + /* Make the top-level directory. */ + if (!mkdtemp(dirname)) + ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n"); + dfd = open(dirname, O_PATH | O_DIRECTORY); + if (dfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + + /* A sub-directory which is actually used for tests. */ + E_mkdirat(dfd, "root", 0755); + tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY); + if (tmpfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + close(dfd); + dfd = tmpfd; + + E_symlinkat("/proc/self/exe", dfd, "procexe"); + E_symlinkat("/proc/self/root", dfd, "procroot"); + E_mkdirat(dfd, "root", 0755); + + /* There is no mountat(2), so use chdir. */ + E_mkdirat(dfd, "mnt", 0755); + E_fchdir(dfd); + E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, ""); + E_symlinkat("../mnt/", dfd, "mnt/self"); + E_symlinkat("/mnt/", dfd, "mnt/absself"); + + E_mkdirat(dfd, "etc", 0755); + E_touchat(dfd, "etc/passwd"); + + E_symlinkat("/newfile3", dfd, "creatlink"); + E_symlinkat("etc/", dfd, "reletc"); + E_symlinkat("etc/passwd", dfd, "relsym"); + E_symlinkat("/etc/", dfd, "absetc"); + E_symlinkat("/etc/passwd", dfd, "abssym"); + E_symlinkat("/cheeky", dfd, "abscheeky"); + + E_mkdirat(dfd, "cheeky", 0755); + + E_symlinkat("/", dfd, "cheeky/absself"); + E_symlinkat("../../root/", dfd, "cheeky/self"); + E_symlinkat("/../../root/", dfd, "cheeky/garbageself"); + + E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd"); + E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd"); + + E_symlinkat("../../../../../../../../../../../../../../etc/passwd", + dfd, "cheeky/dotdotlink"); + E_symlinkat("/../../../../../../../../../../../../../../etc/passwd", + dfd, "cheeky/garbagelink"); + + return dfd; +} + +struct basic_test { + const char *name; + const char *dir; + const char *path; + struct open_how how; + bool pass; + union { + int err; + const char *path; + } out; +}; + +#define NUM_OPENAT2_OPATH_TESTS 88 + +void test_openat2_opath_tests(void) +{ + int rootfd, hardcoded_fd; + char *procselfexe, *hardcoded_fdpath; + + E_asprintf(&procselfexe, "/proc/%d/exe", getpid()); + rootfd = setup_testdir(); + + hardcoded_fd = open("/dev/null", O_RDONLY); + E_assert(hardcoded_fd >= 0, "open fd to hardcode"); + E_asprintf(&hardcoded_fdpath, "self/fd/%d", hardcoded_fd); + + struct basic_test tests[] = { + /** RESOLVE_BENEATH **/ + /* Attempts to cross dirfd should be blocked. */ + { .name = "[beneath] jump to /", + .path = "/", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] absolute link to $root", + .path = "cheeky/absself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] chained absolute links to $root", + .path = "abscheeky/absself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] jump outside $root", + .path = "..", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] temporary jump outside $root", + .path = "../root/", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] symlink temporary jump outside $root", + .path = "cheeky/self", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] chained symlink temporary jump outside $root", + .path = "abscheeky/self", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] garbage links to $root", + .path = "cheeky/garbageself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] chained garbage links to $root", + .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + /* Only relative paths that stay inside dirfd should work. */ + { .name = "[beneath] ordinary path to 'root'", + .path = "root", .how.resolve = RESOLVE_BENEATH, + .out.path = "root", .pass = true }, + { .name = "[beneath] ordinary path to 'etc'", + .path = "etc", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc", .pass = true }, + { .name = "[beneath] ordinary path to 'etc/passwd'", + .path = "etc/passwd", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .name = "[beneath] relative symlink inside $root", + .path = "relsym", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .name = "[beneath] chained-'..' relative symlink inside $root", + .path = "cheeky/passwd", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .name = "[beneath] absolute symlink component outside $root", + .path = "abscheeky/passwd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] absolute symlink target outside $root", + .path = "abssym", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] absolute path outside $root", + .path = "/etc/passwd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] cheeky absolute path outside $root", + .path = "cheeky/abspasswd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] chained cheeky absolute path outside $root", + .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + /* Tricky paths should fail. */ + { .name = "[beneath] tricky '..'-chained symlink outside $root", + .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] tricky absolute + '..'-chained symlink outside $root", + .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] tricky garbage link outside $root", + .path = "cheeky/garbagelink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .name = "[beneath] tricky absolute + garbage link outside $root", + .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + + /** RESOLVE_IN_ROOT **/ + /* All attempts to cross the dirfd will be scoped-to-root. */ + { .name = "[in_root] jump to /", + .path = "/", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .name = "[in_root] absolute symlink to /root", + .path = "cheeky/absself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .name = "[in_root] chained absolute symlinks to /root", + .path = "abscheeky/absself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .name = "[in_root] '..' at root", + .path = "..", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .name = "[in_root] '../root' at root", + .path = "../root/", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] relative symlink containing '..' above root", + .path = "cheeky/self", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] garbage link to /root", + .path = "cheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] chainged garbage links to /root", + .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] relative path to 'root'", + .path = "root", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .name = "[in_root] relative path to 'etc'", + .path = "etc", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc", .pass = true }, + { .name = "[in_root] relative path to 'etc/passwd'", + .path = "etc/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] relative symlink to 'etc/passwd'", + .path = "relsym", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] chained-'..' relative symlink to 'etc/passwd'", + .path = "cheeky/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] chained-'..' absolute + relative symlink to 'etc/passwd'", + .path = "abscheeky/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] absolute symlink to 'etc/passwd'", + .path = "abssym", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] absolute path 'etc/passwd'", + .path = "/etc/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] cheeky absolute path 'etc/passwd'", + .path = "cheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] chained cheeky absolute path 'etc/passwd'", + .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky '..'-chained symlink outside $root", + .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky absolute + '..'-chained symlink outside $root", + .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky absolute path + absolute + '..'-chained symlink outside $root", + .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky garbage link outside $root", + .path = "cheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky absolute + garbage link outside $root", + .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .name = "[in_root] tricky absolute path + absolute + garbage link outside $root", + .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + /* O_CREAT should handle trailing symlinks correctly. */ + { .name = "[in_root] O_CREAT of relative path inside $root", + .path = "newfile1", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile1", .pass = true }, + { .name = "[in_root] O_CREAT of absolute path", + .path = "/newfile2", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile2", .pass = true }, + { .name = "[in_root] O_CREAT of tricky symlink outside root", + .path = "/creatlink", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile3", .pass = true }, + + /** RESOLVE_NO_XDEV **/ + /* Crossing *down* into a mountpoint is disallowed. */ + { .name = "[no_xdev] cross into $mnt", + .path = "mnt", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] cross into $mnt/", + .path = "mnt/", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] cross into $mnt/.", + .path = "mnt/.", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Crossing *up* out of a mountpoint is disallowed. */ + { .name = "[no_xdev] goto mountpoint root", + .dir = "mnt", .path = ".", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "mnt", .pass = true }, + { .name = "[no_xdev] cross up through '..'", + .dir = "mnt", .path = "..", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] temporary cross up through '..'", + .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] temporary relative symlink cross up", + .dir = "mnt", .path = "self", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] temporary absolute symlink cross up", + .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Jumping to "/" is ok, but later components cannot cross. */ + { .name = "[no_xdev] jump to / directly", + .dir = "mnt", .path = "/", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "/", .pass = true }, + { .name = "[no_xdev] jump to / (from /) directly", + .dir = "/", .path = "/", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "/", .pass = true }, + { .name = "[no_xdev] jump to / then proc", + .path = "/proc/1", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] jump to / then tmp", + .path = "/tmp", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Magic-links are blocked since they can switch vfsmounts. */ + { .name = "[no_xdev] cross through magic-link to self/root", + .dir = "/proc", .path = "self/root", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .name = "[no_xdev] cross through magic-link to self/cwd", + .dir = "/proc", .path = "self/cwd", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Except magic-link jumps inside the same vfsmount. */ + { .name = "[no_xdev] jump through magic-link to same procfs", + .dir = "/proc", .path = hardcoded_fdpath, .how.resolve = RESOLVE_NO_XDEV, + .out.path = "/proc", .pass = true, }, + + /** RESOLVE_NO_MAGICLINKS **/ + /* Regular symlinks should work. */ + { .name = "[no_magiclinks] ordinary relative symlink", + .path = "relsym", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.path = "etc/passwd", .pass = true }, + /* Magic-links should not work. */ + { .name = "[no_magiclinks] symlink to magic-link", + .path = "procexe", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_magiclinks] normal path to magic-link", + .path = "/proc/self/exe", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_magiclinks] normal path to magic-link with O_NOFOLLOW", + .path = "/proc/self/exe", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.path = procselfexe, .pass = true }, + { .name = "[no_magiclinks] symlink to magic-link path component", + .path = "procroot/etc", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_magiclinks] magic-link path component", + .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_magiclinks] magic-link path component with O_NOFOLLOW", + .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + + /** RESOLVE_NO_SYMLINKS **/ + /* Normal paths should work. */ + { .name = "[no_symlinks] ordinary path to '.'", + .path = ".", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = NULL, .pass = true }, + { .name = "[no_symlinks] ordinary path to 'root'", + .path = "root", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "root", .pass = true }, + { .name = "[no_symlinks] ordinary path to 'etc'", + .path = "etc", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "etc", .pass = true }, + { .name = "[no_symlinks] ordinary path to 'etc/passwd'", + .path = "etc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "etc/passwd", .pass = true }, + /* Regular symlinks are blocked. */ + { .name = "[no_symlinks] relative symlink target", + .path = "relsym", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] relative symlink component", + .path = "reletc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] absolute symlink target", + .path = "abssym", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] absolute symlink component", + .path = "absetc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] cheeky garbage link", + .path = "cheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] cheeky absolute + garbage link", + .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] cheeky absolute + absolute symlink", + .path = "abscheeky/absself", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + /* Trailing symlinks with NO_FOLLOW. */ + { .name = "[no_symlinks] relative symlink with O_NOFOLLOW", + .path = "relsym", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "relsym", .pass = true }, + { .name = "[no_symlinks] absolute symlink with O_NOFOLLOW", + .path = "abssym", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "abssym", .pass = true }, + { .name = "[no_symlinks] trailing symlink with O_NOFOLLOW", + .path = "cheeky/garbagelink", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "cheeky/garbagelink", .pass = true }, + { .name = "[no_symlinks] multiple symlink components with O_NOFOLLOW", + .path = "abscheeky/absself", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .name = "[no_symlinks] multiple symlink (and garbage link) components with O_NOFOLLOW", + .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + }; + + BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_OPATH_TESTS); + + for (int i = 0; i < ARRAY_LEN(tests); i++) { + int dfd, fd; + char *fdpath = NULL; + bool failed; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + struct basic_test *test = &tests[i]; + + if (!openat2_supported) { + ksft_print_msg("openat2(2) unsupported\n"); + resultfn = ksft_test_result_skip; + goto skip; + } + + /* Auto-set O_PATH. */ + if (!(test->how.flags & O_CREAT)) + test->how.flags |= O_PATH; + + if (test->dir) + dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY); + else + dfd = dup(rootfd); + E_assert(dfd, "failed to openat root '%s': %m", test->dir); + + E_dup2(dfd, hardcoded_fd); + + fd = sys_openat2(dfd, test->path, &test->how); + if (test->pass) + failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path)); + else + failed = (fd != test->out.err); + if (fd >= 0) { + fdpath = fdreadlink(fd); + close(fd); + } + close(dfd); + + if (failed) { + resultfn = ksft_test_result_fail; + + ksft_print_msg("openat2 unexpectedly returned "); + if (fdpath) + ksft_print_msg("%d['%s']\n", fd, fdpath); + else + ksft_print_msg("%d (%s)\n", fd, strerror(-fd)); + } + +skip: + if (test->pass) + resultfn("%s gives path '%s'\n", test->name, + test->out.path ?: "."); + else + resultfn("%s fails with %d (%s)\n", test->name, + test->out.err, strerror(-test->out.err)); + + fflush(stdout); + free(fdpath); + } + + free(procselfexe); + close(rootfd); + + free(hardcoded_fdpath); + close(hardcoded_fd); +} + +#define NUM_TESTS NUM_OPENAT2_OPATH_TESTS + +int main(int argc, char **argv) +{ + ksft_print_header(); + ksft_set_plan(NUM_TESTS); + + /* NOTE: We should be checking for CAP_SYS_ADMIN here... */ + if (geteuid() != 0) + ksft_exit_skip("all tests require euid == 0\n"); + + test_openat2_opath_tests(); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +}
Now that we have a special flag to signify magic-link jumps, mention it within the path-lookup docs. And now that "magic link" is the correct term for nd_jump_link()-style symlinks, clean up references to this type of "symlink".
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- Documentation/filesystems/path-lookup.rst | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 434a07b0002b..2c32795389bd 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -405,6 +405,10 @@ is requested. Keeping a reference in the ``nameidata`` ensures that only one root is in effect for the entire path walk, even if it races with a ``chroot()`` system call.
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or +``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor +passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags). + The root is needed when either of two conditions holds: (1) either the pathname or a symbolic link starts with a "'/'", or (2) a "``..``" component is being handled, since "``..``" from the root must always stay @@ -1149,7 +1153,7 @@ so ``NULL`` is returned to indicate that the symlink can be released and the stack frame discarded.
The other case involves things in ``/proc`` that look like symlinks but -aren't really:: +aren't really (and are therefore commonly referred to as "magic-links")::
$ ls -l /proc/self/fd/1 lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4 @@ -1310,12 +1314,14 @@ longer needed. ``LOOKUP_JUMPED`` means that the current dentry was chosen not because it had the right name but for some other reason. This happens when following "``..``", following a symlink to ``/``, crossing a mount point -or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the -filesystem has not been asked to revalidate the name (with -``d_revalidate()``). In such cases the inode may still need to be -revalidated, so ``d_op->d_weak_revalidate()`` is called if +or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic +link"). In this case the filesystem has not been asked to revalidate the +name (with ``d_revalidate()``). In such cases the inode may still need +to be revalidated, so ``d_op->d_weak_revalidate()`` is called if ``LOOKUP_JUMPED`` is set when the look completes - which may be at the -final component or, when creating, unlinking, or renaming, at the penultimate component. +final component or, when creating, unlinking, or renaming, at the +penultimate component. ``LOOKUP_MAGICLINK_JUMPED`` is set alongside +``LOOKUP_JUMPED`` if a magic-link was traversed.
Final-component flags ~~~~~~~~~~~~~~~~~~~~~
On 2019-11-05, Aleksa Sarai cyphar@cyphar.com wrote:
This patchset is being developed here: https://github.com/cyphar/linux/tree/openat2/master
Patch changelog: v15:
- Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
- Split out patches for each individual LOOKUP flag.
- Reword commit messages to give more background information about the series, as well as mention the semantics of each flag in more detail.
v14: https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/ https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com v13: https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/ v12: https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/ v11: https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/ https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/ v10: https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/ v09: https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/ v08: https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/ v07: https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/ v06: https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/ v05: https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/ v04: https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/ v03: https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/ v02: https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/ v01: https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/
For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1].
This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum project[5]) with a few additions and changes made based on the previous discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall openat2(2) which provides several other improvements to the openat(2) interface (see the patch description for more details). The following new LOOKUP_* flags are added:
LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this. Magic-link traversal which implies a vfsmount jump is also blocked (though magic-link jumps on the same vfsmount are permitted).
LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic-links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name.
It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it will *not* fail (assuming that no parent component was a magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new LOOKUP_MAGICLINK_JUMPED state flag was required.
LOOKUP_BENEATH disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks done as in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink resolution is allowed at all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component.
LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[7] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on libpathrs[8] which is a C-friendly library for safe path resolution. It features a userspace-emulated backend if the kernel doesn't support openat2(2). Hopefully we can get userspace to switch to using it, and thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes though stale NFS handles).
The current draft of the openat2(2) man-page is included below.
--8<--------------------------------------------------------------------------- OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME openat2 - open and possibly create a file (extended)
SYNOPSIS #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size); Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION The openat2() system call opens the file specified by pathname. If the specified file does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc- tory referred to by the file descriptor dirfd (or the current working directory of the calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is resolved relative to dirfd.) The openat2() system call is an extension of openat(2) and provides a superset of its functionality. Rather than taking a single flag argument, an extensible structure (how) is passed instead to allow for future extensions. size must be set to sizeof(struct open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES for more detail on how extensions are handled.)
The open_how structure The following structure indicates how pathname should be opened, and acts as a superset of the flag and mode arguments to openat(2).
struct open_how { __aligned_u64 flags; /* O_* flags. */ __u16 mode; /* Mode for O_{CREAT,TMPFILE}. */ __u16 __padding[3]; /* Must be zeroed. */ __aligned_u64 resolve; /* RESOLVE_* flags. */ }; Any future extensions to openat2() will be implemented as new fields appended to the above structure (or through reuse of pre-existing padding space), with the zero value of the new fields acting as though the extension were not present. The meaning of each field is as follows: flags The file creation and status flags to use for this operation. All of the O_* flags defined for openat(2) are valid openat2() flag values. Unlike openat(2), it is an error to provide openat2() unknown or conflicting flags in flags. mode File mode for the new file, with identical semantics to the mode argument to openat(2). However, unlike openat(2), it is an error to provide openat2() with a mode which contains bits other than 0777. It is an error to provide openat2() a non-zero mode if flags does not con- tain O_CREAT or O_TMPFILE. resolve Change how the components of pathname will be resolved (see path_resolu- tion(7) for background information.) The primary use case for these flags is to allow trusted programs to restrict how untrusted paths (or paths in- side untrusted directories) are resolved. The full list of resolve flags is given below. RESOLVE_NO_XDEV Disallow traversal of mount points during path resolution (including all bind mounts). Users of this flag are encouraged to make its use configurable (un- less it is used for a specific security purpose), as bind mounts are very widely used by end-users. Setting this flag indiscrimnately for all uses of openat2() may result in spurious errors on previously- functional systems. RESOLVE_NO_SYMLINKS Disallow resolution of symbolic links during path resolution. This option implies RESOLVE_NO_MAGICLINKS. If the trailing component is a symbolic link, and flags contains both O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the symbolic link will be returned. Users of this flag are encouraged to make its use configurable (un- less it is used for a specific security purpose), as symbolic links are very widely used by end-users. Setting this flag indiscrimnately for all uses of openat2() may result in spurious errors on previ- ously-functional systems. RESOLVE_NO_MAGICLINKS Disallow all magic link resolution during path resolution. If the trailing component is a magic link, and flags contains both O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the magic link will be returned. Magic-links are symbolic link-like objects that are most notably found in proc(5) (examples include /proc/[pid]/exe and /proc/[pid]/fd/*.) Due to the potential danger of unknowingly open- ing these magic links, it may be preferable for users to disable their resolution entirely (see symboliclink(7) for more details.) RESOLVE_BENEATH Do not permit the path resolution to succeed if any component of the resolution is not a descendant of the directory indicated by dirfd. This results in absolute symbolic links (and absolute values of path- name) to be rejected. Currently, this flag also disables magic link resolution. However, this may change in the future. The caller should explicitly specify RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved. RESOLVE_IN_ROOT Treat dirfd as the root directory while resolving pathname (as though the user called chroot(2) with dirfd as the argument.) Absolute sym- bolic links and ".." path components will be scoped to dirfd. If pathname is an absolute path, it is also treated relative to dirfd. However, unlike chroot(2) (which changes the filesystem root perma- nently for a process), RESOLVE_IN_ROOT allows a program to effi- ciently restrict path resolution for only certain operations. It also has several hardening features (such detecting escape attempts during .. resolution) which chroot(2) does not. Currently, this flag also disables magic link resolution. However, this may change in the future. The caller should explicitly specify RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved. It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE On success, a new file descriptor is returned. On error, -1 is returned, and errno is set appropriately.
ERRORS The set of errors returned by openat2() includes all of the errors returned by openat(2), as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how. EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE. EINVAL size was smaller than any known version of struct open_how. E2BIG An extension was specified in how, which the current kernel does not support (see the "Extensibility" section of the NOTES for more detail on how extensions are han- dled.) EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could not ensure that a ".." component didn't escape (due to a race condition or poten- tial attack.) Callers may choose to retry the openat2() call. EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the root during path resolution was detected. EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount point. ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic link (or magic link). ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic link.
VERSIONS openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility In order to allow for struct open_how to be extended in future kernel revisions, openat2() requires userspace to specify the size of struct open_how structure they are passing. By providing this information, it is possible for openat2() to provide both forwards- and backwards-compatibility — with size acting as an implicit version number (because new ex- tension fields will always be appended, the size will always increase.) This extensibil- ity design is very similar to other system calls such as perf_setattr(2), perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size of the structure which the kernel supports, then there are only three cases to consider: * If ksize equals usize, then there is no version mismatch and how can be used verbatim. * If ksize is larger than usize, then there are some extensions the kernel sup- ports which the userspace program is unaware of. Because all extensions must have their zero values be a no-op, the kernel treats all of the extension fields not set by userspace to have zero values. This provides backwards-compatibil- ity. * If ksize is smaller than usize, then there are some extensions which the userspace program is aware of but the kernel does not support. Because all ex- tensions must have their zero values be a no-op, the kernel can safely ignore the unsupported extension fields if they are all-zero. If any unsupported ex- tension fields are non-zero, then -1 is returned and errno is set to E2BIG. This provides forwards-compatibility. Therefore, most userspace programs will not need to have any special handling of exten- sions. However, if a userspace program wishes to determine what extensions the running kernel supports, they may conduct a binary search on size (to find the largest value which doesn't produce an error of E2BIG.)
SEE ALSO openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2) --8<---------------------------------------------------------------------------
Aleksa Sarai (9): namei: LOOKUP_NO_SYMLINKS: block symlink resolution namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution namei: LOOKUP_NO_XDEV: block mountpoint crossing namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution namei: LOOKUP_IN_ROOT: chroot-like scoped resolution namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution open: introduce openat2(2) syscall selftests: add openat2(2) selftests Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
CREDITS | 4 +- Documentation/filesystems/path-lookup.rst | 18 +- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/namei.c | 176 +++++- fs/open.c | 149 +++-- include/linux/fcntl.h | 12 +- include/linux/namei.h | 11 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/fcntl.h | 41 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/openat2/.gitignore | 1 + tools/testing/selftests/openat2/Makefile | 8 + tools/testing/selftests/openat2/helpers.c | 109 ++++ tools/testing/selftests/openat2/helpers.h | 107 ++++ .../testing/selftests/openat2/openat2_test.c | 316 +++++++++++ .../selftests/openat2/rename_attack_test.c | 160 ++++++ .../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++ 35 files changed, 1591 insertions(+), 73 deletions(-) create mode 100644 tools/testing/selftests/openat2/.gitignore create mode 100644 tools/testing/selftests/openat2/Makefile create mode 100644 tools/testing/selftests/openat2/helpers.c create mode 100644 tools/testing/selftests/openat2/helpers.h create mode 100644 tools/testing/selftests/openat2/openat2_test.c create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: a99d8080aaf358d5d23581244e5da23b35e340b9
Ping -- this patch hasn't been touched for a week. Thanks.
On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
On 2019-11-05, Aleksa Sarai cyphar@cyphar.com wrote:
This patchset is being developed here: https://github.com/cyphar/linux/tree/openat2/master
Patch changelog: v15:
- Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
- Split out patches for each individual LOOKUP flag.
- Reword commit messages to give more background information about the series, as well as mention the semantics of each flag in more detail.
[...]
Ping -- this patch hasn't been touched for a week. Thanks.
If I've been following correctly, everyone is happy with this series. (i.e. Linus's comment appear to have been addressed.)
Perhaps the next question is should this go via a pull request by you to Linus directly during the v5.5 merge window, via akpm, via akpm, via Christian, or some other path? Besides Linus, it's not been clear who should "claim" this series. :)
On Tue, Nov 12, 2019 at 03:01:26PM -0800, Kees Cook wrote:
On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
On 2019-11-05, Aleksa Sarai cyphar@cyphar.com wrote:
This patchset is being developed here: https://github.com/cyphar/linux/tree/openat2/master
Patch changelog: v15:
- Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
- Split out patches for each individual LOOKUP flag.
- Reword commit messages to give more background information about the series, as well as mention the semantics of each flag in more detail.
[...]
Ping -- this patch hasn't been touched for a week. Thanks.
If I've been following correctly, everyone is happy with this series. (i.e. Linus's comment appear to have been addressed.)
Perhaps the next question is should this go via a pull request by you to Linus directly during the v5.5 merge window, via akpm, via akpm, via Christian, or some other path? Besides Linus, it's not been clear who should "claim" this series. :)
I like this series and the same with the copy_struct_from_user() part of it I've taken I'm happy to stuff this into a dedicated branch, merge it into my for-next and send it for v5.5. Though I'd _much_ rather see Al pick this up or have him give his blessing first.
Christian
On 2019-11-12, Kees Cook keescook@chromium.org wrote:
On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
On 2019-11-05, Aleksa Sarai cyphar@cyphar.com wrote:
This patchset is being developed here: https://github.com/cyphar/linux/tree/openat2/master
Patch changelog: v15:
- Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
- Split out patches for each individual LOOKUP flag.
- Reword commit messages to give more background information about the series, as well as mention the semantics of each flag in more detail.
[...]
Ping -- this patch hasn't been touched for a week. Thanks.
If I've been following correctly, everyone is happy with this series. (i.e. Linus's comment appear to have been addressed.)
Perhaps the next question is should this go via a pull request by you to Linus directly during the v5.5 merge window, via akpm, via akpm, via Christian, or some other path? Besides Linus, it's not been clear who should "claim" this series. :)
Given the namei changes, I wanted to avoid stepping on Al's toes. Though he did review the series a few versions ago, the discussion didn't focus on the openat2(2) semantics (which have also changed since then). I'm not sure whether to interpret the silence to mean he's satisfied with things as they are, or if he hasn't had more time to look at the series.
As for which tree it should be routed to, I don't mind -- Christian is the most straight-forward choice (but if Al wants to route it, that's fine with me too).
linux-kselftest-mirror@lists.linaro.org