Patch changelog: v9: * Replace resolveat(2) with openat2(2). [Linus] * Output a warning to dmesg if may_open_magiclink() is violated. * Add an openat2(O_CREAT) testcase. v8: * Default to O_CLOEXEC to match other new fd-creation syscalls (users can always disable O_CLOEXEC afterwards). [Christian] * Implement magic-link restrictions based on their mode. This is done through a series of masks and is designed to avoid breaking users -- most users don't have chained O_PATH fd re-opens. * Add O_EMPTYPATH which allows for fd re-opening without needing procfs. This would help some users of fd re-opening, and with the changes to magic-link permissions we now have the right semantics for such a flag. * Add selftests for resolveat(2), O_EMPTYPATH, and the magic-link mode semantics. v7: * Remove execveat(2) support for these flags since it might result in some pretty hairy security issues with setuid binaries. There are other avenues we can go down to solve the issues with CVE-2019-5736. [Jann] * Reserve an additional bit in resolveat(2) for the eXecute access mode if we end up implementing it. v6: * Drop O_* flags API to the new LOOKUP_ path scoping bits and instead introduce resolveat(2) as an alternative method of obtaining an O_PATH. The justification for this is included in patch 6 (though switching back to O_* flags is trivial). v5: * In response to CVE-2019-5736 (one of the vectors showed that open(2)+fexec(3) cannot be used to scope binfmt_script's implicit open_exec()), AT_* flags have been re-added and are now piped through to binfmt_script (and other binfmt_* that use open_exec) but are only supported for execveat(2) for now. v4: * Remove AT_* flag reservations, as they require more discussion. * Switch to path_is_under() over __d_path() for breakout checking. * Make O_XDEV no longer block openat("/tmp", "/", O_XDEV) -- dirfd is now ignored for absolute paths to match other flags. * Improve the dirfd_path_init() refactor and move it to a separate commit. * Remove reference to Linux-capsicum. * Switch "proclink" name to magic-link. v3: [resend] v2: * Made ".." resolution with AT_THIS_ROOT and AT_BENEATH safe(r) with some semi-aggressive __d_path checking (see patch 3). * Disallowed "proclinks" with AT_THIS_ROOT and AT_BENEATH, in the hopes they can be re-enabled once safe. * Removed the selftests as they will be reimplemented as xfstests. * Removed stat(2) support, since you can already get it through O_PATH and fstatat(2).
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the Capsicum project[4]) with a few additions and changes made based on the previous discussion within [5] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall openat2(2) which provides an alternative way to get an O_PATH file descriptor (the reasoning for doing this is included in patch 6). The following new LOOKUP_ flags are added:
* LOOKUP_XDEV blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this.
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic-links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name.
It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it will *not* fail (assuming that no parent component was a magic-link), and you will have an fd for the magic-link.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks done as in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink resolution is allowed at all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[6] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having O_THISROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few).
And further, several semantics of file descriptor "re-opening" are now changed to prevent attacks like CVE-2019-5736 by restricting how magic-links can be resolved (based on their mode). This required some other changes to the semantics of the modes of O_PATH file descriptor's associated /proc/self/fd magic-links. openat2(2) has the ability to further restrict re-opening of its own O_PATH fds, so that users can make even better use of this feature.
Finally, O_EMPTYPATH was added so that users can do /proc/self/fd-style re-opening without depending on procfs. The new restricted semantics for magic-links are applied here too.
In order to make all of the above more usable, I'm working on libpathrs[7] which is a C-friendly library for safe path resolution. It features a userspace-emulated backend if the kernel doesn't support openat2(2). Hopefully we can get userspace to switch to using it, and thus get openat2(2) support for free once it's ready.
Cc: Al Viro viro@zeniv.linux.org.uk Cc: Eric Biederman ebiederm@xmission.com Cc: Andy Lutomirski luto@kernel.org Cc: David Howells dhowells@redhat.com Cc: Jann Horn jannh@google.com Cc: Christian Brauner christian@brauner.io Cc: David Drysdale drysdale@google.com Cc: Tycho Andersen tycho@tycho.ws Cc: Kees Cook keescook@chromium.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: containers@lists.linux-foundation.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org
[1]: https://lwn.net/Articles/721443/ [2]: https://lore.kernel.org/patchwork/patch/784221/ [3]: https://lwn.net/Articles/619151/ [4]: https://lwn.net/Articles/603929/ [5]: https://lwn.net/Articles/723057/ [6]: https://github.com/cyphar/filepath-securejoin [7]: https://github.com/openSUSE/libpathrs
Aleksa Sarai (10): namei: obey trailing magic-link DAC permissions procfs: switch magic-link modes to be more sane open: O_EMPTYPATH: procfs-less file descriptor re-opening namei: split out nd->dfd handling to dirfd_path_init namei: O_BENEATH-style path resolution flags namei: LOOKUP_IN_ROOT: chroot-like path resolution namei: aggressively check for nd->root escape on ".." resolution open: openat2(2) syscall kselftest: save-and-restore errno to allow for %m formatting selftests: add openat2(2) selftests
arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/fcntl.c | 2 +- fs/internal.h | 1 + fs/namei.c | 333 ++++++++++++--- fs/open.c | 140 +++++-- fs/proc/base.c | 20 +- fs/proc/fd.c | 23 +- fs/proc/namespaces.c | 2 +- include/linux/fcntl.h | 17 +- include/linux/fs.h | 8 +- include/linux/namei.h | 8 + include/linux/syscalls.h | 14 +- include/uapi/asm-generic/fcntl.h | 5 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/fcntl.h | 38 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/kselftest.h | 15 + tools/testing/selftests/memfd/memfd_test.c | 7 +- tools/testing/selftests/openat2/.gitignore | 1 + tools/testing/selftests/openat2/Makefile | 12 + tools/testing/selftests/openat2/helpers.c | 162 +++++++ tools/testing/selftests/openat2/helpers.h | 114 +++++ .../testing/selftests/openat2/linkmode_test.c | 325 ++++++++++++++ .../selftests/openat2/rename_attack_test.c | 124 ++++++ .../testing/selftests/openat2/resolve_test.c | 395 ++++++++++++++++++ 42 files changed, 1667 insertions(+), 125 deletions(-) create mode 100644 tools/testing/selftests/openat2/.gitignore create mode 100644 tools/testing/selftests/openat2/Makefile create mode 100644 tools/testing/selftests/openat2/helpers.c create mode 100644 tools/testing/selftests/openat2/helpers.h create mode 100644 tools/testing/selftests/openat2/linkmode_test.c create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c create mode 100644 tools/testing/selftests/openat2/resolve_test.c
The ability for userspace to "re-open" file descriptors through /proc/self/fd has been a very useful tool for all sorts of usecases (container runtimes are one common example). However, the current interface for doing this has resulted in some pretty subtle security holes. Userspace can re-open a file descriptor with more permissions than the original, which can result in cases such as /proc/$pid/exe being re-opened O_RDWR at a later date even though (by definition) /proc/$pid/exe cannot be opened for writing. When combined with O_PATH the results can get even more confusing.
We cannot block this outright. Aside from userspace already depending on it, it's a useful feature which can actually increase the security of userspace. For instance, LXC keeps an O_PATH of the container's /dev/pts/ptmx that gets re-opened to create new ptys and then uses TIOCGPTPEER to get the slave end. This allows for pty allocation without resolving paths inside an (untrusted) container's rootfs. There isn't a trivial way of doing this that is as straight-forward and safe as O_PATH re-opening.
Instead we have to restrict it in such a way that it doesn't break (good) users but does block potential attackers. The solution applied in this patch is to restrict *re-opening* (not resolution through) magic-links by requiring that mode of the link be obeyed. Normal symlinks have modes of a+rwx but magic-links have other modes. These magic-link modes were historically ignored during path resolution, but they've now been re-purposed for more useful ends.
It is also necessary to define semantics for the mode of an O_PATH descriptor, since re-opening a magic-link through an O_PATH needs to be just as restricted as the corresponding magic-link otherwise the above protection can be bypassed. There are two distinct cases:
1. The target is a regular file (not a magic-link). Userspace depends on being able to re-open the O_PATH of a regular file, so we must define the mode to be a+rwx.
2. The target is a magic-link. In this case, we simply copy the mode of the magic-link. This results in an O_PATH of a magic-link effectively acting as a no-op in terms of how much re-opening privileges a process has.
CAP_DAC_OVERRIDE can be used to override all of these restrictions, but we only permit &init_userns's capabilities to affect these semantics. The reason for this is that there isn't a clear way to track what user_ns is the original owner of a given O_PATH chain -- thus an unprivileged user could create a new userns and O_PATH the file descriptor, owning it. All signs would indicate that the user really does have CAP_DAC_OVERRIDE over the new descriptor and the protection would be bypassed. We thus opt for the more conservative approach.
I have run this patch on several machines for several days. So far, the only processes which have hit this case ("loadkeys" and "kbd_mode" from the kbd package[1]) gracefully handle the permission error and do not cause any user-visible problems. In order to give users a heads-up, a warning is given whenever may_open_magiclink() refuses access.
[1]: http://git.altlinux.org/people/legion/packages/kbd.git
Co-developed-by: Andy Lutomirski luto@kernel.org Co-developed-by: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/internal.h | 1 + fs/namei.c | 103 +++++++++++++++++++++++++++++++++++++++++++-- fs/open.c | 3 +- fs/proc/fd.c | 23 +++++++++- include/linux/fs.h | 4 ++ 5 files changed, 127 insertions(+), 7 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h index a48ef81be37d..12847f502f49 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -119,6 +119,7 @@ struct open_flags { int acc_mode; int intent; int lookup_flags; + fmode_t opath_mask; }; extern struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op); diff --git a/fs/namei.c b/fs/namei.c index 20831c2fbb34..4ec6168762db 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -506,6 +506,8 @@ struct nameidata { struct inode *link_inode; unsigned root_seq; int dfd; + fmode_t opath_mask; + int acc_mode; /* op.acc_mode */ } __randomize_layout;
static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) @@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) p->stack = p->internal; p->dfd = dfd; p->name = name; - p->total_link_count = old ? old->total_link_count : 0; + p->total_link_count = 0; + p->acc_mode = 0; + p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE; + if (old) { + p->total_link_count = old->total_link_count; + p->acc_mode = old->acc_mode; + p->opath_mask = old->opath_mask; + } p->saved = old; current->nameidata = p; } @@ -1042,8 +1051,52 @@ static int may_create_in_sticky(struct dentry * const dir, return 0; }
+/** + * may_reopen_magiclink - Check permissions for opening a trailing magic-link + * @opath_mask: the O_PATH mask of the magic-link + * @acc_mode: ACC_MODE which the user is attempting + * + * We block magic-link re-opening if the @opath_mask is more strict than the + * @acc_mode being requested, unless the user is capable(CAP_DAC_OVERRIDE). + * + * Returns 0 if successful, -ve on error. + */ +static int may_open_magiclink(fmode_t opath_mask, int acc_mode) +{ + /* + * We only allow for init_userns to be able to override magic-links. + * This is done to avoid cases where an unprivileged userns could take + * an O_PATH of the fd, resulting in it being very unclear whether + * CAP_DAC_OVERRIDE should work on the new O_PATH fd (given that it + * pipes through to the underlying file). + */ + if (capable(CAP_DAC_OVERRIDE)) + return 0; + + if ((acc_mode & MAY_READ) && + !(opath_mask & (FMODE_READ | FMODE_PATH_READ))) + goto err; + if ((acc_mode & MAY_WRITE) && + !(opath_mask & (FMODE_WRITE | FMODE_PATH_WRITE))) + goto err; + + return 0; + +err: + pr_warn_ratelimited("%s[%d]: magic-link re-open blocked (acc_mode=%s%s%s, opath_mask=%s%s%s%s)", + current->comm, task_pid_nr(current), + (acc_mode & MAY_READ) ? "r": "", + (acc_mode & MAY_WRITE) ? "w": "", + (acc_mode & MAY_EXEC) ? "x": "", + (opath_mask & FMODE_READ) ? "R" : "", + (opath_mask & FMODE_PATH_READ) ? "r" : "", + (opath_mask & FMODE_WRITE) ? "W" : "", + (opath_mask & FMODE_PATH_WRITE) ? "w" : ""); + return -EACCES; +} + static __always_inline -const char *get_link(struct nameidata *nd) +const char *get_link(struct nameidata *nd, bool trailing) { struct saved *last = nd->stack + nd->depth - 1; struct dentry *dentry = last->link.dentry; @@ -1081,6 +1134,44 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); } + /* If we just jumped it was because of a magic-link. */ + if (unlikely(nd->flags & LOOKUP_JUMPED)) { + /* + * For trailing_symlink we check whether the symlink's + * mode allows us to do what we want through acc_mode. + * In addition, we need to stash away what the link + * mode is in case we are about to O_PATH this + * magic-link. + * + * This is only done for magic-links, as a security + * measure to prevent users from being able to re-open + * files with additional permissions or similar tricks + * through procfs. This is not strictly POSIX-friendly, + * but technically neither are magic-links. + */ + if (trailing) { + fmode_t opath_mask = 0; + + /* + * Figure out the O_PATH mask. Rather than + * using acl_permission_check, we check whether + * any of the rw bits are set in the mode. + */ + if (inode->i_mode & S_IRUGO) + opath_mask |= FMODE_PATH_READ; + if (inode->i_mode & S_IWUGO) + opath_mask |= FMODE_PATH_WRITE; + + /* + * Is the new opath_mask more restrictive than + * the acc_mode being requested? + */ + error = may_open_magiclink(opath_mask, nd->acc_mode); + if (error) + return ERR_PTR(error); + nd->opath_mask &= opath_mask; + } + } if (IS_ERR_OR_NULL(res)) return res; } @@ -2142,7 +2233,7 @@ static int link_path_walk(const char *name, struct nameidata *nd) return err;
if (err) { - const char *s = get_link(nd); + const char *s = get_link(nd, false);
if (IS_ERR(s)) return PTR_ERR(s); @@ -2258,7 +2349,7 @@ static const char *trailing_symlink(struct nameidata *nd) return ERR_PTR(error); nd->flags |= LOOKUP_PARENT; nd->stack[0].name = NULL; - s = get_link(nd); + s = get_link(nd, true); return s ? s : ""; }
@@ -3508,6 +3599,7 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file) if (!error) { audit_inode(nd->name, path.dentry, 0); error = vfs_open(&path, file); + file->f_mode |= nd->opath_mask; path_put(&path); } return error; @@ -3519,6 +3611,9 @@ static struct file *path_openat(struct nameidata *nd, struct file *file; int error;
+ nd->acc_mode = op->acc_mode; + nd->opath_mask = op->opath_mask; + file = alloc_empty_file(op->open_flag, current_cred()); if (IS_ERR(file)) return file; diff --git a/fs/open.c b/fs/open.c index b5b80469b93d..ab20eae39df7 100644 --- a/fs/open.c +++ b/fs/open.c @@ -982,8 +982,9 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o acc_mode |= MAY_APPEND;
op->acc_mode = acc_mode; - op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN; + /* For O_PATH backwards-compatibility we default to an all-set mask. */ + op->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
if (flags & O_CREAT) { op->intent |= LOOKUP_CREATE; diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 81882a13212d..9b7d8becb002 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -104,11 +104,30 @@ static void tid_fd_update_inode(struct task_struct *task, struct inode *inode, task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
if (S_ISLNK(inode->i_mode)) { + /* + * Always set +x (depending on the fmode type), since there + * currently aren't FMODE_PATH_EXEC restrictions and there is + * no O_MAYEXEC yet. This might change in the future, in which + * case we will restrict +x. + */ unsigned i_mode = S_IFLNK; + if (f_mode & FMODE_PATH) + i_mode |= S_IXGRP; + else + i_mode |= S_IXUSR; + /* + * Construct the mode bits based on the open-mode. The u+rwx + * bits are for "ordinary" open modes while g+rwx are for + * O_PATH modes. + */ if (f_mode & FMODE_READ) - i_mode |= S_IRUSR | S_IXUSR; + i_mode |= S_IRUSR; if (f_mode & FMODE_WRITE) - i_mode |= S_IWUSR | S_IXUSR; + i_mode |= S_IWUSR; + if (f_mode & FMODE_PATH_READ) + i_mode |= S_IRGRP; + if (f_mode & FMODE_PATH_WRITE) + i_mode |= S_IWGRP; inode->i_mode = i_mode; } security_task_to_inode(task, inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index f7fdfe93e25d..f7df213405ea 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -173,6 +173,10 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File does not contribute to nr_files count */ #define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)
+/* File is an O_PATH descriptor which can be upgraded to (read, write). */ +#define FMODE_PATH_READ ((__force fmode_t)0x40000000) +#define FMODE_PATH_WRITE ((__force fmode_t)0x80000000) + /* * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector * that indicates that they should check the contents of the iovec are
On Sun, Jul 07, 2019 at 12:57:28AM +1000, Aleksa Sarai wrote:
@@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) p->stack = p->internal; p->dfd = dfd; p->name = name;
- p->total_link_count = old ? old->total_link_count : 0;
- p->total_link_count = 0;
- p->acc_mode = 0;
- p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
- if (old) {
p->total_link_count = old->total_link_count;
p->acc_mode = old->acc_mode;
p->opath_mask = old->opath_mask;
- }
Huh? Could somebody explain why traversals of NFS4 referrals should inherit ->acc_mode and ->opath_mask?
static __always_inline -const char *get_link(struct nameidata *nd) +const char *get_link(struct nameidata *nd, bool trailing) { struct saved *last = nd->stack + nd->depth - 1; struct dentry *dentry = last->link.dentry; @@ -1081,6 +1134,44 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); }
/* If we just jumped it was because of a magic-link. */
if (unlikely(nd->flags & LOOKUP_JUMPED)) {
That's not quite guaranteed (it is possible to bind a symlink on top of a regular file, and you will get LOOKUP_JUMPED on the entry into trailing_symlink() when looking the result up). Moreover, why bother with LOOKUP_JUMPED here? See that nd->last_type = LAST_BIND; several lines prior? That's precisely to be able to recognize those suckers.
And _that_ would've avoided another piece of ugliness - your LOOKUP_JUMPED kludge forces you to handle that cra^Wsclero^Wvaluable security hardening in get_link(), instead of trailing_symlink() where you apparently want it to be. Simply because nd_jump_root() done later in get_link() will set LOOKUP_JUMPED for absolute symlinks, confusing your test.
Moreover, I'm not sure that trailing_symlink() is the right place for that either - I would be rather tempted to fold do_o_path() into path_openat(), inline path_lookupat() there (as in s = path_init(nd, flags);
while (!(error = link_path_walk(s, nd)) && ((error = lookup_last(nd)) > 0)) { s = trailing_symlink(nd); } if (!error) error = complete_walk(nd); if (!error && nd->flags & LOOKUP_DIRECTORY) if (!d_can_lookup(nd->path.dentry)) error = -ENOTDIR; if (!error) { audit_inode(nd->name, nd->path.dentry, 0); error = vfs_open(&nd->path, file); } terminate_walk(nd); - we don't need LOOKUP_DOWN there) and then we only care about the two callers of trailing_symlink() that are in path_openat(). Which is where you have your ->acc_mode and ->opath_mask without the need to dump them into nameidata. Or to bring that mess into the things like stat(2) et.al. - it simply doesn't belong there.
In any case, this "bool trailing" is completely wrong; whether that check belongs in trailing_symlink() or (some of) its callers, putting it into get_link() is a mistake, forced by kludgy check for procfs-style symlinks.
On Fri, Jul 12, 2019 at 05:14:54AM +0100, Al Viro wrote:
That's not quite guaranteed (it is possible to bind a symlink on top of a regular file, and you will get LOOKUP_JUMPED on the entry into trailing_symlink() when looking the result up). Moreover, why bother with LOOKUP_JUMPED here? See that nd->last_type = LAST_BIND; several lines prior? That's precisely to be able to recognize those suckers.
... except that this won't work these days (once upon a time it used to, but that had been a long time ago). However, that doesn't change the fact that the test is really wrong. So let's do it right:
* set a new flag along with LOOKUP_JUMPED in nd_jump_link() * clear it in get_link() right before res = READ_ONCE(inode->i_link); * check it in trailing_symlink() (or callers thereof)
The rest of comments stand, AFAICS.
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Sun, Jul 07, 2019 at 12:57:28AM +1000, Aleksa Sarai wrote:
@@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) p->stack = p->internal; p->dfd = dfd; p->name = name;
- p->total_link_count = old ? old->total_link_count : 0;
- p->total_link_count = 0;
- p->acc_mode = 0;
- p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
- if (old) {
p->total_link_count = old->total_link_count;
p->acc_mode = old->acc_mode;
p->opath_mask = old->opath_mask;
- }
Huh? Could somebody explain why traversals of NFS4 referrals should inherit ->acc_mode and ->opath_mask?
I'll be honest -- I don't understand what set_nameidata() did so I just did what I thought would be an obvious change (to just copy the contents). I thought it was related to some aspect of the symlink stack handling.
In that case, should they both be set to 0 on set_nameidata()? This will mean that fd re-opening (or magic-link opening) through a set_nameidata() would always fail.
static __always_inline -const char *get_link(struct nameidata *nd) +const char *get_link(struct nameidata *nd, bool trailing) { struct saved *last = nd->stack + nd->depth - 1; struct dentry *dentry = last->link.dentry; @@ -1081,6 +1134,44 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); }
/* If we just jumped it was because of a magic-link. */
if (unlikely(nd->flags & LOOKUP_JUMPED)) {
[...] In any case, this "bool trailing" is completely wrong; whether that check belongs in trailing_symlink() or (some of) its callers, putting it into get_link() is a mistake, forced by kludgy check for procfs-style symlinks.
The error path for LOOKUP_JUMPED comes from the old O_BENEATH patchset, but all of the "bool trailing" logic is definitely my gaff (I was quietly hoping you'd have a much better solution than the whole get_link() thing -- it definitely felt very kludgey to write).
I will work on the suggestion in your follow-up email. Thanks!
On Fri, Jul 12, 2019 at 10:20:17PM +1000, Aleksa Sarai wrote:
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Sun, Jul 07, 2019 at 12:57:28AM +1000, Aleksa Sarai wrote:
@@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) p->stack = p->internal; p->dfd = dfd; p->name = name;
- p->total_link_count = old ? old->total_link_count : 0;
- p->total_link_count = 0;
- p->acc_mode = 0;
- p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
- if (old) {
p->total_link_count = old->total_link_count;
p->acc_mode = old->acc_mode;
p->opath_mask = old->opath_mask;
- }
Huh? Could somebody explain why traversals of NFS4 referrals should inherit ->acc_mode and ->opath_mask?
I'll be honest -- I don't understand what set_nameidata() did so I just did what I thought would be an obvious change (to just copy the contents). I thought it was related to some aspect of the symlink stack handling.
No. It's handling of (very rare) nested pathwalk. The only case I can think of is handling of NFS4 referrals - they are triggered by ->d_automount() and include NFS4 mount. Which does internal pathwalk of its own, to get to the root of subtree being automounted.
NFS has its own recursion protection on that path (no deeper nesting than one level of referral traversals), but there some nesting is inevitable; we do get another nameidata instance on stack. And for nd_jump_link() we need to keep track of the innermost one.
For symlinks nothing of that sort happens - they are dealt with on the same struct nameidata. ->total_link_count copying is there for one reason only - we want the total amount of symlinks traversed during the pathwalk (including the referral processing, etc.) to count towards MAXSYMLINKS check. It could've been moved from nameidata to task_struct, but it's cheaper to handle it that way.
Again, nesting is *rare*.
In that case, should they both be set to 0 on set_nameidata()? This will mean that fd re-opening (or magic-link opening) through a set_nameidata() would always fail.
Huh? set_nameidata() is done for *all* instances - it's pretty much the constructor of that object (and restore_nameidata() - a destructor). Everything goes through it.
And again, I'm not sure we want these fields in nameidata - IMO they belong in open_flags. Things like e.g. stat() don't need them at all.
Incidentally, O_PATH opening of symlinks combined with subsequent procfs symlink traversals is worth testing - that's where the things get subtle and that's where it's easy to get in trouble on modifications.
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Fri, Jul 12, 2019 at 10:20:17PM +1000, Aleksa Sarai wrote:
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Sun, Jul 07, 2019 at 12:57:28AM +1000, Aleksa Sarai wrote:
@@ -514,7 +516,14 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) p->stack = p->internal; p->dfd = dfd; p->name = name;
- p->total_link_count = old ? old->total_link_count : 0;
- p->total_link_count = 0;
- p->acc_mode = 0;
- p->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE;
- if (old) {
p->total_link_count = old->total_link_count;
p->acc_mode = old->acc_mode;
p->opath_mask = old->opath_mask;
- }
Huh? Could somebody explain why traversals of NFS4 referrals should inherit ->acc_mode and ->opath_mask?
I'll be honest -- I don't understand what set_nameidata() did so I just did what I thought would be an obvious change (to just copy the contents). I thought it was related to some aspect of the symlink stack handling.
No. It's handling of (very rare) nested pathwalk. The only case I can think of is handling of NFS4 referrals - they are triggered by ->d_automount() and include NFS4 mount. Which does internal pathwalk of its own, to get to the root of subtree being automounted.
NFS has its own recursion protection on that path (no deeper nesting than one level of referral traversals), but there some nesting is inevitable; we do get another nameidata instance on stack. And for nd_jump_link() we need to keep track of the innermost one.
For symlinks nothing of that sort happens - they are dealt with on the same struct nameidata. ->total_link_count copying is there for one reason only - we want the total amount of symlinks traversed during the pathwalk (including the referral processing, etc.) to count towards MAXSYMLINKS check. It could've been moved from nameidata to task_struct, but it's cheaper to handle it that way.
Again, nesting is *rare*.
Thanks for the explanation, much appreciated. I will drop the old->... copying hunk.
In that case, should they both be set to 0 on set_nameidata()? This will mean that fd re-opening (or magic-link opening) through a set_nameidata() would always fail.
Huh? set_nameidata() is done for *all* instances - it's pretty much the constructor of that object (and restore_nameidata() - a destructor). Everything goes through it.
Sorry, I meant to drop the copy-from-old logic -- not set it to zero explicitly in set_nameidata().
And again, I'm not sure we want these fields in nameidata - IMO they belong in open_flags. Things like e.g. stat() don't need them at all.
Yup, I'll work up a version that does the consolidation you mentioned in your other mail.
Incidentally, O_PATH opening of symlinks combined with subsequent procfs symlink traversals is worth testing - that's where the things get subtle and that's where it's easy to get in trouble on modifications.
I have some self-tests of a symlink-to-a-magic-link in the last patch of the series. Did you mean something even more chained like a symlink to a /proc/self/fd/$n of an O_NOFOLLOW|O_PATH of a symlink?
Now that magic-link modes are obeyed for file re-opening purposes, some of the pre-existing magic-link modes need to be adjusted to be more semantically correct.
The most blatant example of this is /proc/self/exe, which had a mode of a+rwx even though tautologically the file could never be opened for writing (because it is the current->mm of a live process).
With the new O_PATH restrictions, changing the default mode of these magic-links allows us to avoid delayed-access attacks such as we saw in CVE-2019-5736.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/proc/base.c | 20 ++++++++++---------- fs/proc/namespaces.c | 2 +- 2 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 255f6754c70d..82c06c21e69d 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -133,9 +133,9 @@ struct pid_entry {
#define DIR(NAME, MODE, iops, fops) \ NOD(NAME, (S_IFDIR|(MODE)), &iops, &fops, {} ) -#define LNK(NAME, get_link) \ - NOD(NAME, (S_IFLNK|S_IRWXUGO), \ - &proc_pid_link_inode_operations, NULL, \ +#define LNK(NAME, MODE, get_link) \ + NOD(NAME, (S_IFLNK|(MODE)), \ + &proc_pid_link_inode_operations, NULL, \ { .proc_get_link = get_link } ) #define REG(NAME, MODE, fops) \ NOD(NAME, (S_IFREG|(MODE)), NULL, &fops, {}) @@ -2995,9 +2995,9 @@ static const struct pid_entry tgid_base_stuff[] = { REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), - LNK("cwd", proc_cwd_link), - LNK("root", proc_root_link), - LNK("exe", proc_exe_link), + LNK("cwd", S_IRWXUGO, proc_cwd_link), + LNK("root", S_IRWXUGO, proc_root_link), + LNK("exe", S_IRUGO|S_IXUGO, proc_exe_link), REG("mounts", S_IRUGO, proc_mounts_operations), REG("mountinfo", S_IRUGO, proc_mountinfo_operations), REG("mountstats", S_IRUSR, proc_mountstats_operations), @@ -3393,11 +3393,11 @@ static const struct pid_entry tid_base_stuff[] = { REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), - LNK("cwd", proc_cwd_link), - LNK("root", proc_root_link), - LNK("exe", proc_exe_link), + LNK("cwd", S_IRWXUGO, proc_cwd_link), + LNK("root", S_IRWXUGO, proc_root_link), + LNK("exe", S_IRUGO|S_IXUGO, proc_exe_link), REG("mounts", S_IRUGO, proc_mounts_operations), - REG("mountinfo", S_IRUGO, proc_mountinfo_operations), + REG("mountinfo", S_IRUGO, proc_mountinfo_operations), #ifdef CONFIG_PROC_PAGE_MONITOR REG("clear_refs", S_IWUSR, proc_clear_refs_operations), REG("smaps", S_IRUGO, proc_pid_smaps_operations), diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c index dd2b35f78b09..cd1e130913f7 100644 --- a/fs/proc/namespaces.c +++ b/fs/proc/namespaces.c @@ -94,7 +94,7 @@ static struct dentry *proc_ns_instantiate(struct dentry *dentry, struct inode *inode; struct proc_inode *ei;
- inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO); + inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRUGO); if (!inode) return ERR_PTR(-ENOENT);
Userspace has made use of /proc/self/fd very liberally to allow for descriptors to be re-opened. There are a wide variety of uses for this feature, but it has always required constructing a pathname and could not be done without procfs mounted. The obvious solution for this is to extend openat(2) to have an AT_EMPTY_PATH-equivalent -- O_EMPTYPATH.
Now that descriptor re-opening has been made safe through the new magic-link resolution restrictions, we can replicate these restrictions for O_EMPTYPATH. In particular, we only allow "upgrading" the file descriptor if the corresponding FMODE_PATH_* bit is set (or the FMODE_{READ,WRITE} cases for non-O_PATH file descriptors).
When doing openat(O_EMPTYPATH|O_PATH), O_PATH takes precedence and O_EMPTYPATH is ignored. Very few users ever have a need to O_PATH re-open an existing file descriptor, and so accommodating them at the expense of further complicating O_PATH makes little sense. Ultimately, if users ask for this we can always add RESOLVE_EMPTY_PATH to resolveat(2) in the future.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/fcntl.c | 2 +- fs/namei.c | 27 +++++++++++++++++++++++++++ fs/open.c | 7 ++++++- include/linux/fcntl.h | 2 +- include/uapi/asm-generic/fcntl.h | 5 +++++ 5 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index 3d40771e8e7c..4cf05a2fd162 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)); diff --git a/fs/namei.c b/fs/namei.c index 4ec6168762db..4895717d2760 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3605,6 +3605,31 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file) return error; }
+static int do_o_emptypath(struct nameidata *nd, struct file *newfile) +{ + int error; + struct fd f; + + /* We don't support AT_FDCWD since O_PATH cannot be set here. */ + f = fdget_raw(nd->dfd); + if (!f.file) + return -EBADF; + + /* Update opath_mask as though we went through trailing_symlink(). */ + if (!(f.file->f_mode & (FMODE_READ | FMODE_PATH_READ))) + nd->opath_mask &= ~FMODE_PATH_READ; + if (!(f.file->f_mode & (FMODE_WRITE | FMODE_PATH_WRITE))) + nd->opath_mask &= ~FMODE_PATH_WRITE; + + /* Obey the same restrictions as magic-links. */ + error = may_open_magiclink(f.file->f_mode, nd->acc_mode); + if (!error) + error = vfs_open(&f.file->f_path, newfile); + + fdput(f); + return error; +} + static struct file *path_openat(struct nameidata *nd, const struct open_flags *op, unsigned flags) { @@ -3620,6 +3645,8 @@ static struct file *path_openat(struct nameidata *nd,
if (unlikely(file->f_flags & __O_TMPFILE)) { error = do_tmpfile(nd, flags, op, file); + } else if (unlikely(file->f_flags & O_EMPTYPATH)) { + error = do_o_emptypath(nd, file); } else if (unlikely(file->f_flags & O_PATH)) { error = do_o_path(nd, flags, file); } else { diff --git a/fs/open.c b/fs/open.c index ab20eae39df7..bdca45528524 100644 --- a/fs/open.c +++ b/fs/open.c @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_DIRECTORY; if (!(flags & O_NOFOLLOW)) lookup_flags |= LOOKUP_FOLLOW; + if (flags & O_EMPTYPATH) + lookup_flags |= LOOKUP_EMPTY; op->lookup_flags = lookup_flags; return 0; } @@ -1057,14 +1059,17 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) { struct open_flags op; int fd = build_open_flags(flags, mode, &op); + int empty = 0; struct filename *tmp;
if (fd) return fd;
- tmp = getname(filename); + tmp = getname_flags(filename, op.lookup_flags, &empty); if (IS_ERR(tmp)) return PTR_ERR(tmp); + if (!empty) + op.open_flag &= ~O_EMPTYPATH;
fd = get_unused_fd_flags(flags); if (fd >= 0) { diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index d019df946cb2..2868ae6c8fc1 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -9,7 +9,7 @@ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_EMPTYPATH)
#ifndef force_o_largefile #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T)) diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 9dc0bf0c5a6e..307f7c414a51 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -89,6 +89,11 @@ #define __O_TMPFILE 020000000 #endif
+#ifndef O_EMPTYPATH +#define O_EMPTYPATH 040000000 +#endif + + /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
Previously, path_init's handling of *at(dfd, ...) was only done once, but with LOOKUP_BENEATH (and LOOKUP_IN_ROOT) we have to parse the initial nd->path at different times (before or after absolute path handling) depending on whether we have been asked to scope resolution within a root.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 103 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 59 insertions(+), 44 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 4895717d2760..b490bcf855f8 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2257,9 +2257,59 @@ static int link_path_walk(const char *name, struct nameidata *nd) } }
+/* + * Configure nd->path based on the nd->dfd. This is only used as part of + * path_init(). + */ +static inline int dirfd_path_init(struct nameidata *nd) +{ + if (nd->dfd == AT_FDCWD) { + if (nd->flags & LOOKUP_RCU) { + struct fs_struct *fs = current->fs; + unsigned seq; + + do { + seq = read_seqcount_begin(&fs->seq); + nd->path = fs->pwd; + nd->inode = nd->path.dentry->d_inode; + nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); + } while (read_seqcount_retry(&fs->seq, seq)); + } else { + get_fs_pwd(current->fs, &nd->path); + nd->inode = nd->path.dentry->d_inode; + } + } else { + /* Caller must check execute permissions on the starting path component */ + struct fd f = fdget_raw(nd->dfd); + struct dentry *dentry; + + if (!f.file) + return -EBADF; + + dentry = f.file->f_path.dentry; + + if (*nd->name->name && unlikely(!d_can_lookup(dentry))) { + fdput(f); + return -ENOTDIR; + } + + nd->path = f.file->f_path; + if (nd->flags & LOOKUP_RCU) { + nd->inode = nd->path.dentry->d_inode; + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); + } else { + path_get(&nd->path); + nd->inode = nd->path.dentry->d_inode; + } + fdput(f); + } + return 0; +} + /* must be paired with terminate_walk() */ static const char *path_init(struct nameidata *nd, unsigned flags) { + int error; const char *s = nd->name->name;
if (!*s) @@ -2293,52 +2343,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock); if (*s == '/') { - set_root(nd); - if (likely(!nd_jump_root(nd))) - return s; - return ERR_PTR(-ECHILD); - } else if (nd->dfd == AT_FDCWD) { - if (flags & LOOKUP_RCU) { - struct fs_struct *fs = current->fs; - unsigned seq; - - do { - seq = read_seqcount_begin(&fs->seq); - nd->path = fs->pwd; - nd->inode = nd->path.dentry->d_inode; - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); - } while (read_seqcount_retry(&fs->seq, seq)); - } else { - get_fs_pwd(current->fs, &nd->path); - nd->inode = nd->path.dentry->d_inode; - } - return s; - } else { - /* Caller must check execute permissions on the starting path component */ - struct fd f = fdget_raw(nd->dfd); - struct dentry *dentry; - - if (!f.file) - return ERR_PTR(-EBADF); - - dentry = f.file->f_path.dentry; - - if (*s && unlikely(!d_can_lookup(dentry))) { - fdput(f); - return ERR_PTR(-ENOTDIR); - } - - nd->path = f.file->f_path; - if (flags & LOOKUP_RCU) { - nd->inode = nd->path.dentry->d_inode; - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); - } else { - path_get(&nd->path); - nd->inode = nd->path.dentry->d_inode; - } - fdput(f); + if (likely(!nd->root.mnt)) + set_root(nd); + error = nd_jump_root(nd); + if (unlikely(error)) + s = ERR_PTR(error); return s; } + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + return s; }
static const char *trailing_symlink(struct nameidata *nd)
On Sun, Jul 07, 2019 at 12:57:31AM +1000, Aleksa Sarai wrote:
Previously, path_init's handling of *at(dfd, ...) was only done once, but with LOOKUP_BENEATH (and LOOKUP_IN_ROOT) we have to parse the initial nd->path at different times (before or after absolute path handling) depending on whether we have been asked to scope resolution within a root.
if (*s == '/') {
set_root(nd);
if (likely(!nd_jump_root(nd)))
return s;
return ERR_PTR(-ECHILD);
if (likely(!nd->root.mnt))
set_root(nd);
How can we get there with non-NULL nd->root.mnt, when LOOKUP_ROOT case has been already handled by that point?
error = nd_jump_root(nd);
if (unlikely(error))
s = ERR_PTR(error);
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Sun, Jul 07, 2019 at 12:57:31AM +1000, Aleksa Sarai wrote:
Previously, path_init's handling of *at(dfd, ...) was only done once, but with LOOKUP_BENEATH (and LOOKUP_IN_ROOT) we have to parse the initial nd->path at different times (before or after absolute path handling) depending on whether we have been asked to scope resolution within a root.
if (*s == '/') {
set_root(nd);
if (likely(!nd_jump_root(nd)))
return s;
return ERR_PTR(-ECHILD);
if (likely(!nd->root.mnt))
set_root(nd);
How can we get there with non-NULL nd->root.mnt, when LOOKUP_ROOT case has been already handled by that point?
Yup, you're completely right. I will remove the if (!nd->root.mnt) in the next version.
On 2019-07-12, Aleksa Sarai cyphar@cyphar.com wrote:
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Sun, Jul 07, 2019 at 12:57:31AM +1000, Aleksa Sarai wrote:
Previously, path_init's handling of *at(dfd, ...) was only done once, but with LOOKUP_BENEATH (and LOOKUP_IN_ROOT) we have to parse the initial nd->path at different times (before or after absolute path handling) depending on whether we have been asked to scope resolution within a root.
if (*s == '/') {
set_root(nd);
if (likely(!nd_jump_root(nd)))
return s;
return ERR_PTR(-ECHILD);
if (likely(!nd->root.mnt))
set_root(nd);
How can we get there with non-NULL nd->root.mnt, when LOOKUP_ROOT case has been already handled by that point?
Yup, you're completely right. I will remove the if (!nd->root.mnt) in the next version.
Ah sorry, there is a reason for it -- later in the series the LOOKUP_BENEATH case means that you might end up with a non-NULL nd->root.mnt. If you want, I can move the addition of the conditional to later in the series (it was easier to split the patch by-hunk back when you originally asked me to split out dirfd_path_init()).
Add the following flags to allow various restrictions on path resolution (these affect the *entire* resolution, rather than just the final path component -- as is the case with most other AT_* flags).
The primary justification for these flags is to allow for programs to be far more strict about how they want path resolution to handle symlinks, mountpoint crossings, and paths that escape the dirfd (through an absolute path or ".." shenanigans).
This is of particular concern to container runtimes that want to be very careful about malicious root filesystems that a container's init might have screwed around with (and there is no real way to protect against this in userspace if you consider potential races against a malicious container's init). More classical applications (which have their own potentially buggy userspace path sanitisation code) include web servers, archive extraction tools, network file servers, and so on.
These flags are exposed to userspace in a later patchset.
* LOOKUP_XDEV: Disallow mount-point crossing (both *down* into one, or *up* from one). The primary "scoping" use is to blocking resolution that crosses a bind-mount, which has a similar property to a symlink (in the way that it allows for escape from the starting-point). Since it is not possible to differentiate bind-mounts However since bind-mounting requires privileges (in ways symlinks don't) this has been split from LOOKUP_BENEATH. The naming is based on "find -xdev" as well as -EXDEV (though find(1) doesn't walk upwards, the semantics seem obvious).
* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" jumping. This is a very specific restriction, and it exists because /proc/$pid/fd/... "symlinks" allow for access outside nd->root and pose risk to container runtimes that don't want to be tricked into accessing a host path (but do want to allow no-funny-business symlink resolution).
* LOOKUP_NO_SYMLINKS: Disallows symlink jumping *of any kind*. Implies LOOKUP_NO_MAGICLINKS (obviously).
* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the filesystem tree during resolution (you must stay "beneath" the starting point at all times). Currently this is done by disallowing ".." and absolute paths (either in the given path or found during symlink resolution) entirely, as well as all magic-link jumping.
The wholesale banning of ".." is because it is currently not safe to allow ".." resolution (races can cause the path to be moved outside of the root -- this is conceptually similar to historical chroot(2) escape attacks). Future patches in this series will address this, and will re-enable ".." resolution once it is safe. With those patches, ".." resolution will only be allowed if it remains in the root throughout resolution (such as "a/../b" not "a/../../outside/b").
The banning of magic-link jumping is done because it is not clear whether semantically they should be allowed -- while some magic-links are safe there are many that can cause escapes (and once a resolution is outside of the root, O_BENEATH will no longer detect it). Future patches may re-enable magic-link jumping when such jumps would remain inside the root.
The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would violates their requirement, while the others all return -EXDEV.
This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation on David Drysdale's O_BENEATH patchset[2], which in turn was based on the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS thread[4] determined most of the API changes made in this refresh.
[1]: https://lwn.net/Articles/721443/ [2]: https://lwn.net/Articles/619151/ [3]: https://lwn.net/Articles/603929/ [4]: https://lwn.net/Articles/723057/
Cc: Christian Brauner christian@brauner.io Suggested-by: David Drysdale drysdale@google.com Suggested-by: Al Viro viro@zeniv.linux.org.uk Suggested-by: Andy Lutomirski luto@kernel.org Suggested-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 74 ++++++++++++++++++++++++++++++++++++------- include/linux/namei.h | 7 ++++ 2 files changed, 70 insertions(+), 11 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index b490bcf855f8..9c3ed597466b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -852,6 +852,13 @@ static inline void path_to_nameidata(const struct path *path,
static int nd_jump_root(struct nameidata *nd) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; + if (unlikely(nd->flags & LOOKUP_XDEV)) { + /* Absolute path arguments to path_init() are allowed. */ + if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt) + return -EXDEV; + } if (nd->flags & LOOKUP_RCU) { struct dentry *d; nd->path = nd->root; @@ -1104,6 +1111,9 @@ const char *get_link(struct nameidata *nd, bool trailing) int error; const char *res;
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + return ERR_PTR(-ELOOP); + if (!(nd->flags & LOOKUP_RCU)) { touch_atime(&last->link); cond_resched(); @@ -1136,6 +1146,11 @@ const char *get_link(struct nameidata *nd, bool trailing) } /* If we just jumped it was because of a magic-link. */ if (unlikely(nd->flags & LOOKUP_JUMPED)) { + if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) + return ERR_PTR(-ELOOP); + /* Not currently safe. */ + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return ERR_PTR(-EXDEV); /* * For trailing_symlink we check whether the symlink's * mode allows us to do what we want through acc_mode. @@ -1178,8 +1193,9 @@ const char *get_link(struct nameidata *nd, bool trailing) if (*res == '/') { if (!nd->root.mnt) set_root(nd); - if (unlikely(nd_jump_root(nd))) - return ERR_PTR(-ECHILD); + error = nd_jump_root(nd); + if (unlikely(error)) + return ERR_PTR(error); while (unlikely(*++res == '/')) ; } @@ -1360,12 +1376,16 @@ static int follow_managed(struct path *path, struct nameidata *nd) break; }
- if (need_mntput && path->mnt == mnt) - mntput(path->mnt); + if (need_mntput) { + if (path->mnt == mnt) + mntput(path->mnt); + if (unlikely(nd->flags & LOOKUP_XDEV)) + ret = -EXDEV; + else + nd->flags |= LOOKUP_JUMPED; + } if (ret == -EISDIR || !ret) ret = 1; - if (need_mntput) - nd->flags |= LOOKUP_JUMPED; if (unlikely(ret < 0)) path_put_conditional(path, nd); return ret; @@ -1422,6 +1442,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, mounted = __lookup_mnt(path->mnt, path->dentry); if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return false; path->mnt = &mounted->mnt; path->dentry = mounted->mnt.mnt_root; nd->flags |= LOOKUP_JUMPED; @@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode;
while (1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (&mparent->mnt == nd->path.mnt) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; /* we know that mountpoint was pinned */ nd->path.dentry = mountpoint; nd->path.mnt = &mparent->mnt; @@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; nd->path.mnt = &mounted->mnt; nd->path.dentry = mounted->mnt.mnt_root; inode = nd->path.dentry->d_inode; @@ -1570,8 +1599,11 @@ static int path_parent_directory(struct path *path) static int follow_dotdot(struct nameidata *nd) { while(1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret) @@ -1580,6 +1612,8 @@ static int follow_dotdot(struct nameidata *nd) } if (!follow_up(&nd->path)) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; } follow_mount(&nd->path); nd->inode = nd->path.dentry->d_inode; @@ -1794,6 +1828,13 @@ static inline int may_lookup(struct nameidata *nd) static inline int handle_dots(struct nameidata *nd, int type) { if (type == LAST_DOTDOT) { + /* + * LOOKUP_BENEATH resolving ".." is not currently safe -- races can + * cause our parent to have moved outside of the root and us to skip + * over it. + */ + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; if (!nd->root.mnt) set_root(nd); if (nd->flags & LOOKUP_RCU) { @@ -2342,6 +2383,15 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock); + + if (unlikely(nd->flags & LOOKUP_BENEATH)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + nd->root = nd->path; + if (!(nd->flags & LOOKUP_RCU)) + path_get(&nd->root); + } if (*s == '/') { if (likely(!nd->root.mnt)) set_root(nd); @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; } - error = dirfd_path_init(nd); - if (unlikely(error)) - return ERR_PTR(error); + if (likely(!nd->path.mnt)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + } return s; }
diff --git a/include/linux/namei.h b/include/linux/namei.h index 9138b4471dbf..7bc819ad0cd3 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -50,6 +50,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_EMPTY 0x4000 #define LOOKUP_DOWN 0x8000
+/* Scoping flags for lookup. */ +#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */ +#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */ +#define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ +#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. + Implies LOOKUP_NO_MAGICLINKS. */ + extern int path_pts(struct path *path);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
On Sun, Jul 07, 2019 at 12:57:32AM +1000, Aleksa Sarai wrote:
@@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode; while (1) {
if (path_equal(&nd->path, &nd->root))
if (path_equal(&nd->path, &nd->root)) {
if (unlikely(nd->flags & LOOKUP_BENEATH))
return -EXDEV;
@@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (&mparent->mnt == nd->path.mnt) break;
if (unlikely(nd->flags & LOOKUP_XDEV))
return -EXDEV; /* we know that mountpoint was pinned */ nd->path.dentry = mountpoint; nd->path.mnt = &mparent->mnt;
@@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (!mounted) break;
if (unlikely(nd->flags & LOOKUP_XDEV))
return -EXDEV;
Are you sure these failure exits in follow_dotdot_rcu() won't give suprious hard errors?
- if (unlikely(nd->flags & LOOKUP_BENEATH)) {
error = dirfd_path_init(nd);
if (unlikely(error))
return ERR_PTR(error);
nd->root = nd->path;
if (!(nd->flags & LOOKUP_RCU))
path_get(&nd->root);
- } if (*s == '/') { if (likely(!nd->root.mnt)) set_root(nd);
@@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; }
- error = dirfd_path_init(nd);
- if (unlikely(error))
return ERR_PTR(error);
- if (likely(!nd->path.mnt)) {
Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Sun, Jul 07, 2019 at 12:57:32AM +1000, Aleksa Sarai wrote:
@@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode; while (1) {
if (path_equal(&nd->path, &nd->root))
if (path_equal(&nd->path, &nd->root)) {
if (unlikely(nd->flags & LOOKUP_BENEATH))
return -EXDEV;
@@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (&mparent->mnt == nd->path.mnt) break;
if (unlikely(nd->flags & LOOKUP_XDEV))
return -EXDEV; /* we know that mountpoint was pinned */ nd->path.dentry = mountpoint; nd->path.mnt = &mparent->mnt;
@@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (!mounted) break;
if (unlikely(nd->flags & LOOKUP_XDEV))
return -EXDEV;
Are you sure these failure exits in follow_dotdot_rcu() won't give suprious hard errors?
I could switch to -ECHILD for the *_rcu() checks if you'd prefer that. Though, I'd have (probably naively) thought that you'd have already gotten -ECHILD from the seqlock checks if there was a race during ".." handling.
- if (unlikely(nd->flags & LOOKUP_BENEATH)) {
error = dirfd_path_init(nd);
if (unlikely(error))
return ERR_PTR(error);
nd->root = nd->path;
if (!(nd->flags & LOOKUP_RCU))
path_get(&nd->root);
- } if (*s == '/') { if (likely(!nd->root.mnt)) set_root(nd);
@@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; }
- error = dirfd_path_init(nd);
- if (unlikely(error))
return ERR_PTR(error);
- if (likely(!nd->path.mnt)) {
Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
Yes. I did it to be more consistent with the other "have we got the root" checks elsewhere. Is there another way you'd prefer I do it?
On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
@@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; }
- error = dirfd_path_init(nd);
- if (unlikely(error))
return ERR_PTR(error);
- if (likely(!nd->path.mnt)) {
Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
Yes. I did it to be more consistent with the other "have we got the root" checks elsewhere. Is there another way you'd prefer I do it?
"Have we got the root" checks are inevitable evil; here you are making the control flow in a single function hard to follow.
I *think* what you are doing is absolute pathname, no LOOKUP_BENEATH: set_root error = nd_jump_root(nd) else error = dirfd_path_init(nd) return unlikely(error) ? ERR_PTR(error) : s; which should be a lot easier to follow (not to mention shorter), but I might be missing something in all of that.
On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
@@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; }
- error = dirfd_path_init(nd);
- if (unlikely(error))
return ERR_PTR(error);
- if (likely(!nd->path.mnt)) {
Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
Yes. I did it to be more consistent with the other "have we got the root" checks elsewhere. Is there another way you'd prefer I do it?
"Have we got the root" checks are inevitable evil; here you are making the control flow in a single function hard to follow.
I *think* what you are doing is absolute pathname, no LOOKUP_BENEATH: set_root error = nd_jump_root(nd) else error = dirfd_path_init(nd) return unlikely(error) ? ERR_PTR(error) : s; which should be a lot easier to follow (not to mention shorter), but I might be missing something in all of that.
PS: if that's what's going on, I would be tempted to turn the entire path_init() part into this: if (flags & LOOKUP_BENEATH) while (*s == '/') s++; in the very beginning (plus the handling of nd_jump_root() prototype change, but that belongs with nd_jump_root() change itself, obviously). Again, I might be missing something here...
On Fri, Jul 12, 2019 at 01:55:52PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
@@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; }
- error = dirfd_path_init(nd);
- if (unlikely(error))
return ERR_PTR(error);
- if (likely(!nd->path.mnt)) {
Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
Yes. I did it to be more consistent with the other "have we got the root" checks elsewhere. Is there another way you'd prefer I do it?
"Have we got the root" checks are inevitable evil; here you are making the control flow in a single function hard to follow.
I *think* what you are doing is absolute pathname, no LOOKUP_BENEATH: set_root error = nd_jump_root(nd) else error = dirfd_path_init(nd) return unlikely(error) ? ERR_PTR(error) : s; which should be a lot easier to follow (not to mention shorter), but I might be missing something in all of that.
PS: if that's what's going on, I would be tempted to turn the entire path_init() part into this: if (flags & LOOKUP_BENEATH) while (*s == '/') s++; in the very beginning (plus the handling of nd_jump_root() prototype change, but that belongs with nd_jump_root() change itself, obviously). Again, I might be missing something here...
Argh... I am, at that - you have setting path->root (and grabbing it) in LOOKUP_BENEATH cases and you do it after dirfd_path_init(). So how about if (flags & LOOKUP_BENEATH) while (*s == '/') s++; before the whole thing and if (*s == '/') { /* can happen only without LOOKUP_BENEATH */ set_root(nd); error = nd_jump_root(nd); if (unlikely(error)) return ERR_PTR(error); } else if (nd->dfd == AT_FDCWD) { if (flags & LOOKUP_RCU) { struct fs_struct *fs = current->fs; unsigned seq;
do { seq = read_seqcount_begin(&fs->seq); nd->path = fs->pwd; nd->inode = nd->path.dentry->d_inode; nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); } while (read_seqcount_retry(&fs->seq, seq)); } else { get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; } } else { /* Caller must check execute permissions on the starting path component */ struct fd f = fdget_raw(nd->dfd); struct dentry *dentry;
if (!f.file) return ERR_PTR(-EBADF);
dentry = f.file->f_path.dentry;
if (*s && unlikely(!d_can_lookup(dentry))) { fdput(f); return ERR_PTR(-ENOTDIR); }
nd->path = f.file->f_path; if (flags & LOOKUP_RCU) { nd->inode = nd->path.dentry->d_inode; nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); } else { path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } fdput(f); } if (flags & LOOKUP_BENEATH) { nd->root = nd->path; if (!(flags & LOOKUP_RCU)) path_get(&nd->root); else nd->root_seq = nd->seq; } return s; replacing the part in the end? Makes for much smaller change; it might very well still make sense to add dirfd_path_init() as a separate cleanup (perhaps with the *s == '/' case included), though.
On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
if (flags & LOOKUP_BENEATH) { nd->root = nd->path; if (!(flags & LOOKUP_RCU)) path_get(&nd->root); else nd->root_seq = nd->seq;
BTW, this assignment is needed for LOOKUP_RCU case. Without it you are pretty much guaranteed that lazy pathwalk will fail, when it comes to complete_walk().
Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH combination would someday get passed?
On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
if (flags & LOOKUP_BENEATH) { nd->root = nd->path; if (!(flags & LOOKUP_RCU)) path_get(&nd->root); else nd->root_seq = nd->seq;
BTW, this assignment is needed for LOOKUP_RCU case. Without it you are pretty much guaranteed that lazy pathwalk will fail, when it comes to complete_walk().
Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH combination would someday get passed?
I don't understand what's going on with ->r_seq in there - your call of path_is_under() is after having (re-)sampled rename_lock, but if that was the only .. in there, who's going to recheck the value? For that matter, what's to guarantee that the thing won't get moved just as you are returning from handle_dots()?
IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
On Sat, Jul 13, 2019 at 03:41:53AM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
if (flags & LOOKUP_BENEATH) { nd->root = nd->path; if (!(flags & LOOKUP_RCU)) path_get(&nd->root); else nd->root_seq = nd->seq;
BTW, this assignment is needed for LOOKUP_RCU case. Without it you are pretty much guaranteed that lazy pathwalk will fail, when it comes to complete_walk().
Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH combination would someday get passed?
I don't understand what's going on with ->r_seq in there - your call of path_is_under() is after having (re-)sampled rename_lock, but if that was the only .. in there, who's going to recheck the value? For that matter, what's to guarantee that the thing won't get moved just as you are returning from handle_dots()?
IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
Sigh... Usual effects of trying to document things:
1) LOOKUP_NO_EVAL looks bogus. It had been introduced by commit 57d4657716ac (audit: ignore fcaps on umount) and AFAICS it's crap. It is set in ksys_umount() and nowhere else. It's ignored by everything except filename_mountpoint(). The thing is, call graph for filename_mountpoint() is filename_mountpoint() <- user_path_mountpoint_at() <- ksys_umount() <- kern_path_mountpoint() <- autofs_dev_ioctl_ismountpoint() <- find_autofs_mount() <- autofs_dev_ioctl_open_mountpoint() <- autofs_dev_ioctl_requester() <- autofs_dev_ioctl_ismountpoint() In other words, that flag is basically "was filename_mountpoint() been called by umount(2) or has it come from an autofs ioctl?". And looking at the rationale in that commit, autofs ioctls need it just as much as umount(2) does. Why is it not set for those as well? And why is it conditional at all?
1b) ... because audit_inode() wants LOOKUP_... as the last argument, only to remap it into AUDIT_..., that's why. So audit needs something guaranteed not to conflict with LOOKUP_PARENT (another flag getting remapped). So why do we bother with remapping those, anyway? Let's look at the callers:
fs/namei.c:933: audit_inode(nd->name, nd->stack[0].link.dentry, 0); fs/namei.c:2353: audit_inode(name, path->dentry, flags & LOOKUP_PARENT); fs/namei.c:2394: audit_inode(name, parent->dentry, LOOKUP_PARENT); fs/namei.c:2721: audit_inode(name, path->dentry, flags & LOOKUP_NO_EVAL); fs/namei.c:3302: audit_inode(nd->name, dir, LOOKUP_PARENT); fs/namei.c:3336: audit_inode(nd->name, file->f_path.dentry, 0); fs/namei.c:3371: audit_inode(nd->name, path.dentry, 0); fs/namei.c:3389: audit_inode(nd->name, nd->path.dentry, 0); fs/namei.c:3490: audit_inode(nd->name, child, 0); fs/namei.c:3509: audit_inode(nd->name, path.dentry, 0); ipc/mqueue.c:788: audit_inode(name, dentry, 0);
In all but two of those we have a nice constant value - 0 or AUDIT_INODE_PARENT. One of two exceptions is in filename_mountpoint(), and there we want unconditional AUDIT_INODE_NOEVAL (see above). What of the other? It's if (likely(!retval)) audit_inode(name, path->dentry, flags & LOOKUP_PARENT); in filename_lookup(). And that is bogus as well. filename_lookupat() would better *NOT* get LOOKUP_PARENT in flags. And it doesn't - not since commit 8bcb77fabd7c (namei: split off filename_lookupat() with LOOKUP_PARENT) back in 2015. In filename_parentat() introduced there we have audit_inode(name, parent->dentry, LOOKUP_PARENT); and at the same point the call in filename_lookupat() should've become audit_inode(name, path->dentry, 0); It hadn't; my fault. And after fixing that everything becomes nice and unconditional - the last argument of audit_inode() is always an AUDIT_... constant or zero. Moving AUDIT_... definitions outside of ifdef on CONFIG_AUDITSYSCALL, getting rid of remapping in audit_inode() and passing the right values in 3 callers that don't pass 0 and LOOKUP_NO_EVAL can go to hell.
Any objections from audit folks?
2) comment in namei.h is seriously out of sync with reality. To quote: * - follow links at the end OK, that's LOOKUP_FOLLOW (1) * - require a directory ... and LOOKUP_DIRECTORY (2) * - ending slashes ok even for nonexistent files ... used to be about LOOKUP_CONTINUE (eight years dead now) * - internal "there are more path components" flag ... LOOKUP_PARENT (16) * - dentry cache is untrusted; force a real lookup ... LOOKUP_REVAL (32) * - suppress terminal automount ... used to be LOOKUP_NO_AUTOMOUNT (128), except that it's been replaced with LOOKUP_AUTOMOUNT (at 4) almost eight years ago. And the meaning of LOOKUP_AUTOMOUNT is opposite to the comment, of course. * - skip revalidation ... LOOKUP_NO_REVAL (128) * - don't fetch xattrs on audit_inode ... and that's about soon-to-be dead LOOKUP_NO_EVAL (256)
Note that LOOKUP_RCU (at 64) is quietly skipped and so's the tail of the list. If not for "suppress terminal automount" bit, I wouldn't really care, but that one makes for a really nasty trap for readers. I'm going to convert that to (accurate) comments next to actual defines...
3) while looking through LOOKUP_AUTOMOUNT users, in aa_bind_mount() we have error = kern_path(dev_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &old_path); matching do_loopback(), while tomoyo_mount_acl() has if (!dev_name || kern_path(dev_name, LOOKUP_FOLLOW, &path)) { And yes, that *is* hit on mount --bind. As well as on new mounts, where apparmor (and bdev_lookup()) has plain LOOKUP_FOLLOW.
->sb_mount() is garbage by design (not the least because of the need to have pathname lookups in the first place, as well as having to duplicate the demultiplexing parts of do_mount() without fucking it up)...
On 2019-07-14, Al Viro viro@zeniv.linux.org.uk wrote:
On Sat, Jul 13, 2019 at 03:41:53AM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
if (flags & LOOKUP_BENEATH) { nd->root = nd->path; if (!(flags & LOOKUP_RCU)) path_get(&nd->root); else nd->root_seq = nd->seq;
BTW, this assignment is needed for LOOKUP_RCU case. Without it you are pretty much guaranteed that lazy pathwalk will fail, when it comes to complete_walk().
Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH combination would someday get passed?
I don't understand what's going on with ->r_seq in there - your call of path_is_under() is after having (re-)sampled rename_lock, but if that was the only .. in there, who's going to recheck the value? For that matter, what's to guarantee that the thing won't get moved just as you are returning from handle_dots()?
IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
Sigh... Usual effects of trying to document things:
- LOOKUP_NO_EVAL looks bogus. It had been introduced by commit 57d4657716ac
(audit: ignore fcaps on umount) and AFAICS it's crap. It is set in ksys_umount() and nowhere else. It's ignored by everything except filename_mountpoint(). The thing is, call graph for filename_mountpoint() is filename_mountpoint() <- user_path_mountpoint_at() <- ksys_umount() <- kern_path_mountpoint() <- autofs_dev_ioctl_ismountpoint() <- find_autofs_mount() <- autofs_dev_ioctl_open_mountpoint() <- autofs_dev_ioctl_requester() <- autofs_dev_ioctl_ismountpoint() In other words, that flag is basically "was filename_mountpoint() been called by umount(2) or has it come from an autofs ioctl?". And looking at the rationale in that commit, autofs ioctls need it just as much as umount(2) does. Why is it not set for those as well? And why is it conditional at all?
In addition, LOOKUP_NO_EVAL == LOOKUP_OPEN (0x100). Is that meant to be the case? Also I just saw you have a patch in work.namei that fixes this up -- do you want me to rebase on top of that?
On 2019-07-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
if (flags & LOOKUP_BENEATH) { nd->root = nd->path; if (!(flags & LOOKUP_RCU)) path_get(&nd->root); else nd->root_seq = nd->seq;
BTW, this assignment is needed for LOOKUP_RCU case. Without it you are pretty much guaranteed that lazy pathwalk will fail, when it comes to complete_walk().
Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH combination would someday get passed?
I don't understand what's going on with ->r_seq in there - your call of path_is_under() is after having (re-)sampled rename_lock, but if that was the only .. in there, who's going to recheck the value? For that matter, what's to guarantee that the thing won't get moved just as you are returning from handle_dots()?
IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
I tried to explain this in the commit message for "namei: aggressively check for nd->root escape on ".." resolution", but I probably could've explained it better.
The basic property being guaranteed by LOOKUP_IN_ROOT is that it will not result in resolution of a path component which was not inside the root of the dirfd tree at some point during resolution (and that all absolute symlink and ".." resolution will be done relative to the dirfd). This may smell slightly of chroot(2), because unfortunately it is a similar concept -- the reason for this is to allow for a more efficient way to safely resolve paths inside a rootfs than spawning a separate process to then pass back the fd to the caller.
We don't want to do a path_is_under() check for every ".." (otherwise lookups have a quadratic slowdown when doing many ".."s), so I instead only do a check after a rename or a mount (which are the only operations which could change what ".." points to). And since we do the path_is_under() check if m_seq or r_seq need a retry, we can re-take them[+].
The main reason why I don't re-check path_is_under() after handle_dots() is that there is no way to be sure that a racing rename didn't happen after your last path_is_under() check. The rename could happen after the syscall returns, after all.
So, the main purpose of the check is to make sure that a ".."s after a rename doesn't result in an escape. If the rename happens after we've traversed through a ".." that means that the ".." was inside the root in the first place (a root ".." is handled by follow_dotdot). If the rename happens after we've gone through handle_dots() and there is no subsequent ".." then to userspace it looks identical to the rename occurring after the syscall has returned. If there is a subsequent ".." after a racing rename then we may have moved into a path that wasn't path_is_under() and so we have to check it.
The only way I could see you could solve the race completely is if you had a way for userspace to lock things from being able to be renamed (or MS_MOVE'd). And that feels like a really bad idea to me.
[+]: You asked why don't I re-take m_seq. The reason is that I don't fully understand all the other m_seq checks being done during resolution, and we aren't definitely doing them all in handle_dots(). So I assumed re-taking it could result in me breaking RCU-walk which obviously would be bad. Since I am the only thing using nd->r_seq, I can re-take it without issue.
On Sun, Jul 14, 2019 at 05:00:29PM +1000, Aleksa Sarai wrote:
The basic property being guaranteed by LOOKUP_IN_ROOT is that it will not result in resolution of a path component which was not inside the root of the dirfd tree at some point during resolution (and that all absolute symlink and ".." resolution will be done relative to the dirfd). This may smell slightly of chroot(2), because unfortunately it is a similar concept -- the reason for this is to allow for a more efficient way to safely resolve paths inside a rootfs than spawning a separate process to then pass back the fd to the caller.
IDGI... If attacker can modify your subtree, you have already lost - after all, they can make anything appear inside that tree just before your syscall is made and bring it back out immediately afterwards. And if they can't, what is the race you are trying to protect against? Confused...
On 2019-07-14, Al Viro viro@zeniv.linux.org.uk wrote:
On Sun, Jul 14, 2019 at 05:00:29PM +1000, Aleksa Sarai wrote:
The basic property being guaranteed by LOOKUP_IN_ROOT is that it will not result in resolution of a path component which was not inside the root of the dirfd tree at some point during resolution (and that all absolute symlink and ".." resolution will be done relative to the dirfd). This may smell slightly of chroot(2), because unfortunately it is a similar concept -- the reason for this is to allow for a more efficient way to safely resolve paths inside a rootfs than spawning a separate process to then pass back the fd to the caller.
IDGI... If attacker can modify your subtree, you have already lost - after all, they can make anything appear inside that tree just before your syscall is made and bring it back out immediately afterwards. And if they can't, what is the race you are trying to protect against? Confused...
I'll be honest, this code mostly exists because Jann Horn said that it was necessary in order for this interface to be safe against those kinds of attacks. Though, it's also entirely possible I just am mis-remembering the attack scenario he described when I posted v1 of this series last year.
The use-case I need this functionality for (as do other container runtimes) is one where you are trying to safely interact with a directory tree that is a (malicious) container's root filesystem -- so the container won't be able to move the directory tree root, nor can they move things outside the rootfs into it (or the reverse). Users dealing with FTP, web, or file servers probably have similar requirements.
There is an obvious race condition if you allow the attacker to move the root (I give an example and test-case of it in the last patch in the series), and given that it is fairly trivial to defend against I don't see the downside in including it? But it's obviously your call -- and maybe Jann Horn can explain the reasoning behind this much better than I can.
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Fri, Jul 12, 2019 at 01:55:52PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote:
@@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; }
- error = dirfd_path_init(nd);
- if (unlikely(error))
return ERR_PTR(error);
- if (likely(!nd->path.mnt)) {
Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
Yes. I did it to be more consistent with the other "have we got the root" checks elsewhere. Is there another way you'd prefer I do it?
"Have we got the root" checks are inevitable evil; here you are making the control flow in a single function hard to follow.
I *think* what you are doing is absolute pathname, no LOOKUP_BENEATH: set_root error = nd_jump_root(nd) else error = dirfd_path_init(nd) return unlikely(error) ? ERR_PTR(error) : s; which should be a lot easier to follow (not to mention shorter), but I might be missing something in all of that.
PS: if that's what's going on, I would be tempted to turn the entire path_init() part into this: if (flags & LOOKUP_BENEATH) while (*s == '/') s++; in the very beginning (plus the handling of nd_jump_root() prototype change, but that belongs with nd_jump_root() change itself, obviously). Again, I might be missing something here...
Argh... I am, at that - you have setting path->root (and grabbing it) in LOOKUP_BENEATH cases and you do it after dirfd_path_init(). So how about if (flags & LOOKUP_BENEATH) while (*s == '/') s++;
I can do this for LOOKUP_IN_ROOT, but currently the semantics for LOOKUP_BENEATH is that absolute paths will return -EXDEV indiscriminately (nd_jump_root() errors out with LOOKUP_BENEATH). To be honest, the check could actually just be:
if (flags & LOOKUP_BENEATH) if (*s == '/') return ERR_PTR(-EXDEV);
(Though we'd still need -EXDEV in nd_jump_root() for obvious reasons.)
The logic being that an absolute path means that the resolution starts out without being "beneath" the starting point -- thus violating the contract of LOOKUP_BENEATH. And since the "handle absolute paths like they're scoped to the root" is only implemented for LOOKUP_IN_ROOT, I'd think it's a bit odd to have LOOKUP_BENEATH do it too for absolute paths.
I'll be honest, this patchset is more confusing to both of us because of LOOKUP_BENEATH -- I've only kept it since it was part of the original patchset (O_BENEATH). Personally I think more people will be far more interested in LOOKUP_IN_ROOT. Does anyone mind if I drop the LOOKUP_BENEATH parts of this series, and only keep LOOKUP_NO_* and LOOKUP_IN_ROOT?
I make a change as you outlined for LOOKUP_IN_ROOT, though.
The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths[*]. The already-existing LOOKUP_XDEV and LOOKUP_NO_MAGICLINKS help defend against other potential attacks in a malicious rootfs scenario.
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec (for some runtimes) which is *very* costly if necessary for every filesystem operation involving a container.
[*] At the moment, ".." and magic-link jumping are disallowed for the same reason it is disabled for LOOKUP_BENEATH -- currently it is not safe to allow it. Future patches may enable it unconditionally once we have resolved the possible races (for "..") and semantics (for magic-link jumping).
The most significant *at(2) semantic change with LOOKUP_IN_ROOT is that absolute pathnames no longer cause dirfd to be ignored completely. The rationale is that LOOKUP_IN_ROOT must necessarily chroot-scope symlinks with absolute paths to dirfd, and so doing it for the base path seems to be the most consistent behaviour (and also avoids foot-gunning users who want to scope paths that are absolute).
[1]: https://github.com/cyphar/filepath-securejoin
Co-developed-by: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 6 +++--- include/linux/namei.h | 1 + 2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 9c3ed597466b..ff016b9e9082 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1149,7 +1149,7 @@ const char *get_link(struct nameidata *nd, bool trailing) if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) return ERR_PTR(-ELOOP); /* Not currently safe. */ - if (unlikely(nd->flags & LOOKUP_BENEATH)) + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) return ERR_PTR(-EXDEV); /* * For trailing_symlink we check whether the symlink's @@ -1833,7 +1833,7 @@ static inline int handle_dots(struct nameidata *nd, int type) * cause our parent to have moved outside of the root and us to skip * over it. */ - if (unlikely(nd->flags & LOOKUP_BENEATH)) + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) return -EXDEV; if (!nd->root.mnt) set_root(nd); @@ -2384,7 +2384,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock);
- if (unlikely(nd->flags & LOOKUP_BENEATH)) { + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) { error = dirfd_path_init(nd); if (unlikely(error)) return ERR_PTR(error); diff --git a/include/linux/namei.h b/include/linux/namei.h index 7bc819ad0cd3..4b1ee717cb14 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -56,6 +56,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. Implies LOOKUP_NO_MAGICLINKS. */ +#define LOOKUP_IN_ROOT 0x100000 /* Treat dirfd as %current->fs->root. */
extern int path_pts(struct path *path);
This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution (in the case of LOOKUP_BENEATH the resolution will still fail if ".." resolution would resolve a path outside of the root -- while LOOKUP_IN_ROOT will chroot(2)-style scope it). magic-link jumps are still disallowed entirely because now they could result in inconsistent behaviour if resolution encounters a subsequent "..".
The need for this patch is explained by observing there is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root.
thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) resolveat(dirb, "b/c/../../etc/shadow", RESOLVE_IN_ROOT);
With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar (though somewhat more privileged) attack using MS_MOVE.
With this patch, such cases will be detected *during* ".." resolution (which is the weak point of chroot(2) -- since walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root -- except through a bind-mount or magic-link). By detecting this at ".." resolution (rather than checking only at the end of the entire resolution) we can both correct escapes by jumping back to the root (in the case of LOOKUP_IN_ROOT), as well as avoid revealing to attackers the structure of the filesystem outside of the root (through timing attacks for instance).
In order to avoid a quadratic lookup with each ".." entry, we only activate the slow path if a write through &rename_lock or &mount_lock has occurred during path resolution (&rename_lock and &mount_lock are re-taken to further optimise the lookup). Since the primary attack being protected against is MS_MOVE or rename(2), not doing additional checks unless a mount or rename have occurred avoids making the common case slow.
The use of path_is_under() here might seem suspect, but on further inspection of the most important race (a path was *inside* the root but is now *outside*), there appears to be no attack potential:
* If path_is_under() occurs before the rename, then the path will be resolved -- however the path was originally inside the root and thus there is no escape (and to userspace it'd look like the rename occurred after the path was resolved). If path_is_under() occurs afterwards, the resolution is blocked.
* Subsequent ".." jumps are guaranteed to check path_is_under() -- by construction, &rename_lock or &mount_lock must have been taken by the attacker after path_is_under() returned in the victim. Thus ".." will not be able to escape from the previously-inside-root path.
* Walking down in the moved path is still safe since the entire subtree was moved (either by rename(2) or MS_MOVE) and because (as discussed above) walking down is safe.
I have run a variant of the above attack in a loop on several machines with this patch, and no instances of a breakout were detected. While this is not concrete proof that this is safe, when combined with the above argument it should lend some trustworthiness to this construction.
Cc: Al Viro viro@zeniv.linux.org.uk Cc: Jann Horn jannh@google.com Cc: Kees Cook keescook@chromium.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 48 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 33 insertions(+), 15 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index ff016b9e9082..a3f527494791 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -491,7 +491,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags; - unsigned seq, m_seq; + unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count; @@ -1828,19 +1828,35 @@ static inline int may_lookup(struct nameidata *nd) static inline int handle_dots(struct nameidata *nd, int type) { if (type == LAST_DOTDOT) { - /* - * LOOKUP_BENEATH resolving ".." is not currently safe -- races can - * cause our parent to have moved outside of the root and us to skip - * over it. - */ - if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) - return -EXDEV; + int error = 0; + if (!nd->root.mnt) set_root(nd); - if (nd->flags & LOOKUP_RCU) { - return follow_dotdot_rcu(nd); - } else - return follow_dotdot(nd); + if (nd->flags & LOOKUP_RCU) + error = follow_dotdot_rcu(nd); + else + error = follow_dotdot(nd); + if (error) + return error; + + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) { + bool m_retry = read_seqretry(&mount_lock, nd->m_seq); + bool r_retry = read_seqretry(&rename_lock, nd->r_seq); + + /* + * Don't bother checking unless there's a racing + * rename(2) or MS_MOVE. + */ + if (likely(!m_retry && !r_retry)) + return 0; + + if (m_retry && !(nd->flags & LOOKUP_RCU)) + nd->m_seq = read_seqbegin(&mount_lock); + if (r_retry) + nd->r_seq = read_seqbegin(&rename_lock); + if (!path_is_under(&nd->path, &nd->root)) + return -EXDEV; + } } return 0; } @@ -2361,6 +2377,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0; + + nd->m_seq = read_seqbegin(&mount_lock); + if (unlikely(flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) + nd->r_seq = read_seqbegin(&rename_lock); + if (flags & LOOKUP_ROOT) { struct dentry *root = nd->root.dentry; struct inode *inode = root->d_inode; @@ -2371,7 +2392,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) if (flags & LOOKUP_RCU) { nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); nd->root_seq = nd->seq; - nd->m_seq = read_seqbegin(&mount_lock); } else { path_get(&nd->path); } @@ -2382,8 +2402,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.mnt = NULL; nd->path.dentry = NULL;
- nd->m_seq = read_seqbegin(&mount_lock); - if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) { error = dirfd_path_init(nd); if (unlikely(error))
The most obvious syscall to add support for the new LOOKUP_* scoping flags would be openat(2). However, there are a few reasons to not do this:
* The new LOOKUP_* flags are intended to be security features, and openat(2) will silently ignore all unknown flags. This means that users would need to avoid foot-gunning themselves constantly when using this interface if it were part of openat(2).
* Resolution scoping feels like a different operation to the existing O_* flags. And since openat(2) has limited flag space, it seems to be quite wasteful to clutter it with 5 flags that are all resolution-related. Arguably O_NOFOLLOW is also a resolution flag but its entire purpose is to error out if you encounter a trailing symlink -- not to scope resolution.
* Other systems would be able to reimplement this syscall allowing for cross-OS standardisation rather than being hidden amongst O_* flags which may result in it not being used by all the parties that might want to use it (file servers, web servers, container runtimes, etc).
* It gives us the opportunity to iterate on the O_PATH interface. In particular, the new @how->upgrade_mask field for fd re-opening is only possible because we have a clean slate without needing to re-use the ACC_MODE flag design nor the existing openat(2) @mode semantics.
To this end, we introduce the openat2(2) syscall. It provides all of the features of openat(2) through the @how->flags argument, but also also provides a new @how->resolve argument which exposes RESOLVE_* flags that map to our new LOOKUP_* flags. It also eliminates the long-standing ugliness of variadic-open(2) by embedding it in a struct.
In order to allow for userspace to lock down their usage of file descriptor re-opening, openat2(2) has the ability for users to disallow certain re-opening modes through @how->upgrade_mask. At the moment, there is no UPGRADE_NOEXEC.
Co-developed-by: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/open.c | 136 ++++++++++++++------ include/linux/fcntl.h | 15 ++- include/linux/fs.h | 4 +- include/linux/syscalls.h | 14 +- include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/fcntl.h | 38 ++++++ 24 files changed, 186 insertions(+), 46 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 9e7704e44f6d..1703d048c141 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -461,6 +461,7 @@ 530 common getegid sys_getegid 531 common geteuid sys_geteuid 532 common getppid sys_getppid +533 common openat2 sys_openat2 # all other architectures have common numbers for new syscall, alpha # is the exception. 534 common pidfd_send_signal sys_pidfd_send_signal diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index aaf479a9e92d..4ad262698396 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -447,3 +447,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index c9f8dd421c5f..0d4aa3e5389e 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -33,7 +33,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 434 +#define __NR_compat_syscalls 435 #endif
#define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index aa995920bd34..b134419c0421 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -875,6 +875,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig) __SYSCALL(__NR_fsmount, sys_fsmount) #define __NR_fspick 433 __SYSCALL(__NR_fspick, sys_fspick) +#define __NR_openat2 434 +__SYSCALL(__NR_openat2, sys_openat2)
/* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index e01df3f2f80d..28d954acf214 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -354,3 +354,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 7e3d0734b2f3..b744b1a1c80e 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -433,3 +433,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 26339e417695..bee07b73a898 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -439,3 +439,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 0e2dd68ade57..a3ec5e27630a 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -372,3 +372,4 @@ 431 n32 fsconfig sys_fsconfig 432 n32 fsmount sys_fsmount 433 n32 fspick sys_fspick +434 n32 openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 5eebfa0d155c..3503ac6ef482 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -348,3 +348,4 @@ 431 n64 fsconfig sys_fsconfig 432 n64 fsmount sys_fsmount 433 n64 fspick sys_fspick +434 n64 openat2 sys_openat2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 3cc1374e02d0..e901367371c4 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -421,3 +421,4 @@ 431 o32 fsconfig sys_fsconfig 432 o32 fsmount sys_fsmount 433 o32 fspick sys_fspick +434 o32 openat2 sys_openat2 sys_openat2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index c9e377d59232..5758b0826e4d 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -430,3 +430,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 103655d84b4b..9e8377d5b2f6 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -515,3 +515,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index e822b2964a83..d6e8eaa20d44 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -436,3 +436,4 @@ 431 common fsconfig sys_fsconfig sys_fsconfig 432 common fsmount sys_fsmount sys_fsmount 433 common fspick sys_fspick sys_fspick +434 common openat2 sys_openat2 sys_openat2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 016a727d4357..dc38733d264b 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -436,3 +436,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index e047480b1605..cfeb24ac5299 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -479,3 +479,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index ad968b7bac72..1d76a0e84f42 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -438,3 +438,4 @@ 431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig 432 i386 fsmount sys_fsmount __ia32_sys_fsmount 433 i386 fspick sys_fspick __ia32_sys_fspick +434 i386 openat2 sys_openat2 __ia32_sys_openat2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index b4e6f9e6204a..828bade2e505 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -355,6 +355,7 @@ 431 common fsconfig __x64_sys_fsconfig 432 common fsmount __x64_sys_fsmount 433 common fspick __x64_sys_fspick +434 common openat2 __x64_sys_openat2
# # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 5fa0ee1c8e00..78ed6e97d3ae 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -404,3 +404,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2 diff --git a/fs/open.c b/fs/open.c index bdca45528524..63278294d1d4 100644 --- a/fs/open.c +++ b/fs/open.c @@ -928,24 +928,32 @@ struct file *open_with_fake_path(const struct path *path, int flags, } EXPORT_SYMBOL(open_with_fake_path);
-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op) +static inline int build_open_flags(struct open_how how, struct open_flags *op) { int lookup_flags = 0; - int acc_mode = ACC_MODE(flags); + int opath_mask = 0; + int acc_mode = ACC_MODE(how.flags); + + if (how.resolve & ~VALID_RESOLVE_FLAGS) + return -EINVAL; + if (!(how.flags & (O_PATH | O_CREAT | __O_TMPFILE)) && how.mode != 0) + return -EINVAL; + if (memchr_inv(how.reserved, 0, sizeof(how.reserved))) + return -EINVAL;
/* * Clear out all open flags we don't know about so that we don't report * them in fcntl(F_GETFD) or similar interfaces. */ - flags &= VALID_OPEN_FLAGS; + how.flags &= VALID_OPEN_FLAGS;
- if (flags & (O_CREAT | __O_TMPFILE)) - op->mode = (mode & S_IALLUGO) | S_IFREG; + if (how.flags & (O_CREAT | __O_TMPFILE)) + op->mode = (how.mode & S_IALLUGO) | S_IFREG; else op->mode = 0;
/* Must never be set by userspace */ - flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC; + how.flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;
/* * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only @@ -953,51 +961,70 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o * always set instead of having to deal with possibly weird behaviour * for malicious applications setting only __O_SYNC. */ - if (flags & __O_SYNC) - flags |= O_DSYNC; + if (how.flags & __O_SYNC) + how.flags |= O_DSYNC;
- if (flags & __O_TMPFILE) { - if ((flags & O_TMPFILE_MASK) != O_TMPFILE) + if (how.flags & __O_TMPFILE) { + if ((how.flags & O_TMPFILE_MASK) != O_TMPFILE) return -EINVAL; if (!(acc_mode & MAY_WRITE)) return -EINVAL; - } else if (flags & O_PATH) { + } else if (how.flags & O_PATH) { /* * If we have O_PATH in the open flag. Then we * cannot have anything other than the below set of flags */ - flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH; + how.flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH; acc_mode = 0; + + /* Allow userspace to restrict the re-opening of O_PATH fds. */ + if (how.upgrade_mask & ~VALID_UPGRADE_FLAGS) + return -EINVAL; + if (!(how.upgrade_mask & UPGRADE_NOREAD)) + opath_mask |= FMODE_PATH_READ; + if (!(how.upgrade_mask & UPGRADE_NOWRITE)) + opath_mask |= FMODE_PATH_WRITE; }
- op->open_flag = flags; + op->open_flag = how.flags;
/* O_TRUNC implies we need access checks for write permissions */ - if (flags & O_TRUNC) + if (how.flags & O_TRUNC) acc_mode |= MAY_WRITE;
/* Allow the LSM permission hook to distinguish append access from general write access. */ - if (flags & O_APPEND) + if (how.flags & O_APPEND) acc_mode |= MAY_APPEND;
op->acc_mode = acc_mode; - op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN; - /* For O_PATH backwards-compatibility we default to an all-set mask. */ - op->opath_mask = FMODE_PATH_READ | FMODE_PATH_WRITE; + op->intent = how.flags & O_PATH ? 0 : LOOKUP_OPEN; + op->opath_mask = opath_mask;
- if (flags & O_CREAT) { + if (how.flags & O_CREAT) { op->intent |= LOOKUP_CREATE; - if (flags & O_EXCL) + if (how.flags & O_EXCL) op->intent |= LOOKUP_EXCL; }
- if (flags & O_DIRECTORY) + if (how.flags & O_DIRECTORY) lookup_flags |= LOOKUP_DIRECTORY; - if (!(flags & O_NOFOLLOW)) + if (!(how.flags & O_NOFOLLOW)) lookup_flags |= LOOKUP_FOLLOW; - if (flags & O_EMPTYPATH) + if (how.flags & O_EMPTYPATH) lookup_flags |= LOOKUP_EMPTY; + + if (how.resolve & RESOLVE_NO_XDEV) + lookup_flags |= LOOKUP_XDEV; + if (how.resolve & RESOLVE_NO_MAGICLINKS) + lookup_flags |= LOOKUP_NO_MAGICLINKS; + if (how.resolve & RESOLVE_NO_SYMLINKS) + lookup_flags |= LOOKUP_NO_SYMLINKS; + if (how.resolve & RESOLVE_BENEATH) + lookup_flags |= LOOKUP_BENEATH; + if (how.resolve & RESOLVE_IN_ROOT) + lookup_flags |= LOOKUP_IN_ROOT; + op->lookup_flags = lookup_flags; return 0; } @@ -1016,8 +1043,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o struct file *file_open_name(struct filename *name, int flags, umode_t mode) { struct open_flags op; - int err = build_open_flags(flags, mode, &op); - return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op); + struct open_how how = { + .flags = flags, + .mode = OPENHOW_MODE(flags, mode), + }; + int err = build_open_flags(how, &op); + if (err) + return ERR_PTR(err); + return do_filp_open(AT_FDCWD, name, &op); }
/** @@ -1048,17 +1081,22 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt, const char *filename, int flags, umode_t mode) { struct open_flags op; - int err = build_open_flags(flags, mode, &op); + struct open_how how = { + .flags = flags, + .mode = OPENHOW_MODE(flags, mode), + }; + int err = build_open_flags(how, &op); if (err) return ERR_PTR(err); return do_file_open_root(dentry, mnt, filename, &op); } EXPORT_SYMBOL(file_open_root);
-long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) +long do_sys_open(int dfd, const char __user *filename, + struct open_how *how) { struct open_flags op; - int fd = build_open_flags(flags, mode, &op); + int fd = build_open_flags(*how, &op); int empty = 0; struct filename *tmp;
@@ -1071,7 +1109,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) if (!empty) op.open_flag &= ~O_EMPTYPATH;
- fd = get_unused_fd_flags(flags); + fd = get_unused_fd_flags(how->flags); if (fd >= 0) { struct file *f = do_filp_open(dfd, tmp, &op); if (IS_ERR(f)) { @@ -1088,19 +1126,35 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) { - if (force_o_largefile()) - flags |= O_LARGEFILE; - - return do_sys_open(AT_FDCWD, filename, flags, mode); + return ksys_open(filename, flags, mode); }
SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, umode_t, mode) { + struct open_how how = { + .flags = flags, + .mode = OPENHOW_MODE(flags, mode), + }; + + if (force_o_largefile()) + how.flags |= O_LARGEFILE; + + return do_sys_open(dfd, filename, &how); +} + +SYSCALL_DEFINE3(openat2, int, dfd, const char __user *, filename, + const struct open_how __user *, how) +{ + struct open_how tmp; + + if (copy_from_user(&tmp, how, sizeof(tmp))) + return -EFAULT; + if (force_o_largefile()) - flags |= O_LARGEFILE; + tmp.flags |= O_LARGEFILE;
- return do_sys_open(dfd, filename, flags, mode); + return do_sys_open(dfd, filename, &tmp); }
#ifdef CONFIG_COMPAT @@ -1110,7 +1164,11 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, */ COMPAT_SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) { - return do_sys_open(AT_FDCWD, filename, flags, mode); + struct open_how how = { + .flags = flags, + .mode = OPENHOW_MODE(flags, mode), + }; + return do_sys_open(AT_FDCWD, filename, &how); }
/* @@ -1119,7 +1177,11 @@ COMPAT_SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, */ COMPAT_SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, umode_t, mode) { - return do_sys_open(dfd, filename, flags, mode); + struct open_how how = { + .flags = flags, + .mode = OPENHOW_MODE(flags, mode), + }; + return do_sys_open(dfd, filename, &how); } #endif
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 2868ae6c8fc1..e59917292213 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -4,13 +4,26 @@
#include <uapi/linux/fcntl.h>
-/* list of all valid flags for the open/openat flags argument: */ +/* Should open_how.mode be set for older syscalls wrappers? */ +#define OPENHOW_MODE(flags, mode) \ + (((flags) | (O_CREAT | __O_TMPFILE)) ? (mode) : 0) + +/* List of all valid flags for the open/openat flags argument: */ #define VALID_OPEN_FLAGS \ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_EMPTYPATH)
+/* List of all valid flags for the how->upgrade_mask argument: */ +#define VALID_UPGRADE_FLAGS \ + (UPGRADE_NOWRITE | UPGRADE_NOREAD) + +/* List of all valid flags for the how->resolve argument: */ +#define VALID_RESOLVE_FLAGS \ + (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ + RESOLVE_BENEATH | RESOLVE_IN_ROOT) + #ifndef force_o_largefile #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T)) #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index f7df213405ea..a3aede2b3a91 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2515,8 +2515,8 @@ extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs, struct file *filp); extern int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len); -extern long do_sys_open(int dfd, const char __user *filename, int flags, - umode_t mode); +extern long do_sys_open(int dfd, const char __user *filename, + struct open_how *how); extern struct file *file_open_name(struct filename *, int, umode_t); extern struct file *filp_open(const char *, int, umode_t); extern struct file *file_open_root(struct dentry *, struct vfsmount *, diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2bcef4c70183..227303532bb7 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1369,15 +1369,21 @@ static inline int ksys_close(unsigned int fd) return __close_fd(current->files, fd); }
-extern long do_sys_open(int dfd, const char __user *filename, int flags, - umode_t mode); +extern long do_sys_open(int dfd, const char __user *filename, + struct open_how *how);
static inline long ksys_open(const char __user *filename, int flags, umode_t mode) { + struct open_how how = { + .flags = flags, + .mode = OPENHOW_MODE(flags, mode), + }; + if (force_o_largefile()) - flags |= O_LARGEFILE; - return do_sys_open(AT_FDCWD, filename, flags, mode); + how.flags |= O_LARGEFILE; + + return do_sys_open(AT_FDCWD, filename, &how); }
extern long do_sys_truncate(const char __user *pathname, loff_t length); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index a87904daf103..67486188918b 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -845,8 +845,11 @@ __SYSCALL(__NR_fsmount, sys_fsmount) #define __NR_fspick 433 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_openat2 435 +__SYSCALL(__NR_openat2, sys_openat2) + #undef __NR_syscalls -#define __NR_syscalls 434 +#define __NR_syscalls 435
/* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 1d338357df8a..b7c904f0fca9 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -93,5 +93,43 @@
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+/** + * Arguments for how openat2(2) should open the target path. If @extra is zero, + * then openat2(2) is identical to openat(2). + * + * @flags: O_* flags (unknown flags ignored). + * @mode: O_CREAT file mode (ignored otherwise). + * @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise). + * @resolve: RESOLVE_* flags (-EINVAL on unknown flags). + * @reserved: reserved for future extensions, must be zeroed. + */ +struct open_how { + __u32 flags; + union { + __u16 mode; + __u16 upgrade_mask; + }; + __u16 resolve; + __u64 reserved[7]; /* must be zeroed */ +}; + +/* how->resolve flags for openat2(2). */ +#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings + (includes bind-mounts). */ +#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style + "magic-links". */ +#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks + (implies OEXT_NO_MAGICLINKS) */ +#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like + "..", symlinks, and absolute + paths which escape the dirfd. */ +#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." + be scoped inside the dirfd + (similar to chroot(2)). */ + +/* how->upgrade flags for openat2(2). */ +/* First bit is reserved for a future UPGRADE_NOEXEC flag. */ +#define UPGRADE_NOREAD 0x02 /* Block re-opening with MAY_READ. */ +#define UPGRADE_NOWRITE 0x04 /* Block re-opening with MAY_WRITE. */
#endif /* _UAPI_LINUX_FCNTL_H */
On 06/07/2019 16.57, Aleksa Sarai wrote:
--- a/fs/open.c +++ b/fs/open.c @@ -928,24 +928,32 @@ struct file *open_with_fake_path(const struct path *path, int flags, } EXPORT_SYMBOL(open_with_fake_path); -static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op) +static inline int build_open_flags(struct open_how how, struct open_flags *op) {
How does passing such a huge struct by value affect code generation? Does gcc actually inline the function (and does it even inline the old one given that it's already non-trivial and has more than one caller).
int lookup_flags = 0;
- int acc_mode = ACC_MODE(flags);
- int opath_mask = 0;
- int acc_mode = ACC_MODE(how.flags);
- if (how.resolve & ~VALID_RESOLVE_FLAGS)
return -EINVAL;
- if (!(how.flags & (O_PATH | O_CREAT | __O_TMPFILE)) && how.mode != 0)
return -EINVAL;
- if (memchr_inv(how.reserved, 0, sizeof(how.reserved)))
return -EINVAL;
How about passing how by const reference, and copy the few fields you need to local variables. That would at least simplify this patch by eliminating a lot of the
- flags &= VALID_OPEN_FLAGS;
- how.flags &= VALID_OPEN_FLAGS;
- if (flags & (O_CREAT | __O_TMPFILE))
op->mode = (mode & S_IALLUGO) | S_IFREG;
- if (how.flags & (O_CREAT | __O_TMPFILE))
op->mode = (how.mode & S_IALLUGO) | S_IFREG;
churn.
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 2868ae6c8fc1..e59917292213 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -4,13 +4,26 @@ #include <uapi/linux/fcntl.h> -/* list of all valid flags for the open/openat flags argument: */ +/* Should open_how.mode be set for older syscalls wrappers? */ +#define OPENHOW_MODE(flags, mode) \
- (((flags) | (O_CREAT | __O_TMPFILE)) ? (mode) : 0)
Typo: (((flags) & (O_CREAT | __O_TMPFILE)) ? (mode) : 0)
+/**
- Arguments for how openat2(2) should open the target path. If @extra is zero,
- then openat2(2) is identical to openat(2).
- @flags: O_* flags (unknown flags ignored).
- @mode: O_CREAT file mode (ignored otherwise).
should probably say "O_CREAT/O_TMPFILE file mode".
- @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise).
- @resolve: RESOLVE_* flags (-EINVAL on unknown flags).
- @reserved: reserved for future extensions, must be zeroed.
- */
+struct open_how {
- __u32 flags;
- union {
__u16 mode;
__u16 upgrade_mask;
- };
- __u16 resolve;
So mode and upgrade_mask are naturally u16 aka mode_t. And yes, they probably never need to be used together, so the union works. That then makes the next member 2-byte aligned, so using a u16 for the resolve flags brings us to an 8-byte boundary, and 11 unused flag bits should be enough for a while. But it seems a bit artificial to cram all this together and then add 56 bytes of reserved space.
Rasmus
On 2019-07-18, Rasmus Villemoes linux@rasmusvillemoes.dk wrote:
On 06/07/2019 16.57, Aleksa Sarai wrote:
--- a/fs/open.c +++ b/fs/open.c @@ -928,24 +928,32 @@ struct file *open_with_fake_path(const struct path *path, int flags, } EXPORT_SYMBOL(open_with_fake_path); -static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op) +static inline int build_open_flags(struct open_how how, struct open_flags *op) {
How does passing such a huge struct by value affect code generation? Does gcc actually inline the function (and does it even inline the old one given that it's already non-trivial and has more than one caller).
I'm not sure, but I'll just do what you suggested with passing a const reference and just copying the few fields that actually are touched by this function.
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 2868ae6c8fc1..e59917292213 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -4,13 +4,26 @@ #include <uapi/linux/fcntl.h> -/* list of all valid flags for the open/openat flags argument: */ +/* Should open_how.mode be set for older syscalls wrappers? */ +#define OPENHOW_MODE(flags, mode) \
- (((flags) | (O_CREAT | __O_TMPFILE)) ? (mode) : 0)
Typo: (((flags) & (O_CREAT | __O_TMPFILE)) ? (mode) : 0)
Yup, thanks. I'm not sure why my tests passed on v9 with this bug (they didn't pass in my v10-draft until I fixed this bug earlier today).
+/**
- Arguments for how openat2(2) should open the target path. If @extra is zero,
- then openat2(2) is identical to openat(2).
- @flags: O_* flags (unknown flags ignored).
- @mode: O_CREAT file mode (ignored otherwise).
should probably say "O_CREAT/O_TMPFILE file mode".
:+1:
- @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise).
- @resolve: RESOLVE_* flags (-EINVAL on unknown flags).
- @reserved: reserved for future extensions, must be zeroed.
- */
+struct open_how {
- __u32 flags;
- union {
__u16 mode;
__u16 upgrade_mask;
- };
- __u16 resolve;
So mode and upgrade_mask are naturally u16 aka mode_t. And yes, they probably never need to be used together, so the union works. That then makes the next member 2-byte aligned, so using a u16 for the resolve flags brings us to an 8-byte boundary, and 11 unused flag bits should be enough for a while. But it seems a bit artificial to cram all this together and then add 56 bytes of reserved space.
I will happily admit that padding to 64 bytes is probably _very_ extreme (I picked it purely because it's the size of a cache-line so anything bigger makes even less sense). I was hoping someone would suggest a better size once I posted the patchset, since I couldn't think of a good answer myself.
Do you have any suggestions for a better layout or padding size?
On Sat, Jul 6, 2019 at 5:00 PM Aleksa Sarai cyphar@cyphar.com wrote:
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 9e7704e44f6d..1703d048c141 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -461,6 +461,7 @@ 530 common getegid sys_getegid 531 common geteuid sys_geteuid 532 common getppid sys_getppid +533 common openat2 sys_openat2 # all other architectures have common numbers for new syscall, alpha # is the exception. 534 common pidfd_send_signal sys_pidfd_send_signal
My plan here was to add new syscalls in the same order as everwhere else, just with the number 110 higher. In the long run, I hope we can automate this.
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index aaf479a9e92d..4ad262698396 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -447,3 +447,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2
434 is already used in linux-next, I suggest you use 437 (Palmer just submitted fchmodat4, which could become 436).
+/**
- Arguments for how openat2(2) should open the target path. If @extra is zero,
- then openat2(2) is identical to openat(2).
- @flags: O_* flags (unknown flags ignored).
- @mode: O_CREAT file mode (ignored otherwise).
- @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise).
- @resolve: RESOLVE_* flags (-EINVAL on unknown flags).
- @reserved: reserved for future extensions, must be zeroed.
- */
+struct open_how {
__u32 flags;
union {
__u16 mode;
__u16 upgrade_mask;
};
__u16 resolve;
__u64 reserved[7]; /* must be zeroed */
+};
We can have system calls with up to six arguments on all architectures, so this could still be done more conventionally without the indirection: like
long openat2(int dfd, const char __user * filename, int flags, mode_t mode_mask, __u16 resolve);
In fact, that seems similar enough to the existing openat() that I think you could also just add the fifth argument to the existing call when a newly defined flag is set, similarly to how we only use the 'mode' argument when O_CREAT or O_TMPFILE are set.
--- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h
This file seems to lack a declaration for the system call, which means it will cause a build failure on some architectures, e.g. arch/arc/kernel/sys.c:
#define __SYSCALL(nr, call) [nr] = (call), void *sys_call_table[NR_syscalls] = { [0 ... NR_syscalls-1] = sys_ni_syscall, #include <asm/unistd.h> };
Arnd
On 2019-07-18, Arnd Bergmann arnd@arndb.de wrote:
On Sat, Jul 6, 2019 at 5:00 PM Aleksa Sarai cyphar@cyphar.com wrote:
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 9e7704e44f6d..1703d048c141 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -461,6 +461,7 @@ 530 common getegid sys_getegid 531 common geteuid sys_geteuid 532 common getppid sys_getppid +533 common openat2 sys_openat2 # all other architectures have common numbers for new syscall, alpha # is the exception. 534 common pidfd_send_signal sys_pidfd_send_signal
My plan here was to add new syscalls in the same order as everwhere else, just with the number 110 higher. In the long run, I hope we can automate this.
Alright, I will adjust this.
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index aaf479a9e92d..4ad262698396 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -447,3 +447,4 @@ 431 common fsconfig sys_fsconfig 432 common fsmount sys_fsmount 433 common fspick sys_fspick +434 common openat2 sys_openat2
434 is already used in linux-next, I suggest you use 437 (Palmer just submitted fchmodat4, which could become 436).
437 sounds good to me.
+/**
- Arguments for how openat2(2) should open the target path. If @extra is zero,
- then openat2(2) is identical to openat(2).
- @flags: O_* flags (unknown flags ignored).
- @mode: O_CREAT file mode (ignored otherwise).
- @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise).
- @resolve: RESOLVE_* flags (-EINVAL on unknown flags).
- @reserved: reserved for future extensions, must be zeroed.
- */
+struct open_how {
__u32 flags;
union {
__u16 mode;
__u16 upgrade_mask;
};
__u16 resolve;
__u64 reserved[7]; /* must be zeroed */
+};
We can have system calls with up to six arguments on all architectures, so this could still be done more conventionally without the indirection: like
long openat2(int dfd, const char __user * filename, int flags, mode_t mode_mask, __u16 resolve);
In fact, that seems similar enough to the existing openat() that I think you could also just add the fifth argument to the existing call when a newly defined flag is set, similarly to how we only use the 'mode' argument when O_CREAT or O_TMPFILE are set.
I considered doing this (and even had a preliminary version of it), but I discovered that I was not in favour of this idea -- once I started to write tests using it -- for a few reasons:
1. It doesn't really allow for clean extension for a future 6th argument (because you are using up O_* flags to signify "use the next argument", and O_* flags don't give -EINVAL if they're unknown). Now, yes you can do the on-start runtime check that everyone does -- but I've never really liked having to do it.
Having reserved padding for later extensions (that is actually checked and gives -EINVAL) matches more modern syscall designs.
2. I really was hoping that the variadic openat(2) could be done away using this union setup (Linus said he didn't like it, and suggested using something like 'struct stat' as an argument for openat(2) -- though personally I am not sure I would personally like to use an interface like that).
3. In order to avoid wasting a syscall argument for mode/mask you need to either have something like your suggested mode_mask (which makes the syscall arguments less consistent) or have some sort of mode-like argument that is treated specially (which is really awful on multiple levels -- this one I also tried and even wrote my original tests using). And in both cases, the shims for open{,at}(2) are somewhat less clean.
All of that being said, I'd be happy to switch to whatever you think makes the most sense. As long as it's possible to get an O_PATH with RESOLVE_IN_ROOT set, I'm happy.
--- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h
This file seems to lack a declaration for the system call, which means it will cause a build failure on some architectures, e.g. arch/arc/kernel/sys.c:
#define __SYSCALL(nr, call) [nr] = (call), void *sys_call_table[NR_syscalls] = { [0 ... NR_syscalls-1] = sys_ni_syscall, #include <asm/unistd.h> };
Thanks, I will fix this.
On Thu, Jul 18, 2019 at 6:12 PM Aleksa Sarai cyphar@cyphar.com wrote:
On 2019-07-18, Arnd Bergmann arnd@arndb.de wrote:
On Sat, Jul 6, 2019 at 5:00 PM Aleksa Sarai cyphar@cyphar.com wrote:
In fact, that seems similar enough to the existing openat() that I think you could also just add the fifth argument to the existing call when a newly defined flag is set, similarly to how we only use the 'mode' argument when O_CREAT or O_TMPFILE are set.
I considered doing this (and even had a preliminary version of it), but I discovered that I was not in favour of this idea -- once I started to write tests using it -- for a few reasons:
It doesn't really allow for clean extension for a future 6th argument (because you are using up O_* flags to signify "use the next argument", and O_* flags don't give -EINVAL if they're unknown). Now, yes you can do the on-start runtime check that everyone does -- but I've never really liked having to do it.
Having reserved padding for later extensions (that is actually checked and gives -EINVAL) matches more modern syscall designs.
I really was hoping that the variadic openat(2) could be done away using this union setup (Linus said he didn't like it, and suggested using something like 'struct stat' as an argument for openat(2) -- though personally I am not sure I would personally like to use an interface like that).
In order to avoid wasting a syscall argument for mode/mask you need to either have something like your suggested mode_mask (which makes the syscall arguments less consistent) or have some sort of mode-like argument that is treated specially (which is really awful on multiple levels -- this one I also tried and even wrote my original tests using). And in both cases, the shims for open{,at}(2) are somewhat less clean.
These are all good reasons, thanks for providing the background.
All of that being said, I'd be happy to switch to whatever you think makes the most sense. As long as it's possible to get an O_PATH with RESOLVE_IN_ROOT set, I'm happy.
I don't feel I should be in charge of making the decision. I'd still prefer avoiding the indirect argument structure because
4. it's inconsistent with most other syscalls
5. you get the same problem with seccomp and strace that clone3() has -- these and others only track the register arguments by default.
6. copying the structure adds a small overhead compared to passing registers
7. the calling conventions may be inconvenient for a user space library, so you end up with different prototypes for the low-level syscall and the libc abstraction.
I don't see any of the above seven points as a showstopper either way, so I hope someone else has a strong opinion and can make the decision easier for you.
In the meantime just keep what you have, so you don't have to change it multiple times.
Arnd
On Thu, Jul 18, 2019 at 11:29:50PM +0200, Arnd Bergmann wrote: [...]
- you get the same problem with seccomp and strace that clone3() has -- these and others only track the register arguments by default.
Just for the record, this is definitely not the case for strace: it decodes arrays, structures, netlink messages, and so on by default.
On Fri, Jul 19, 2019 at 05:12:18AM +0300, Dmitry V. Levin wrote:
On Thu, Jul 18, 2019 at 11:29:50PM +0200, Arnd Bergmann wrote: [...]
- you get the same problem with seccomp and strace that clone3() has -- these and others only track the register arguments by default.
Just for the record, this is definitely not the case for strace: it decodes arrays, structures, netlink messages, and so on by default.
There sure is value in trying to design syscalls that can be handled nicely by seccomp but that shouldn't become a burden on designing extensible syscalls. I suggested a session for Ksummit where we can discuss if and how we can make seccomp more compatible with pointer-args in syscalls.
Christian
On Sun, Jul 07, 2019 at 12:57:35AM +1000, Aleksa Sarai wrote: [...]
+/**
- Arguments for how openat2(2) should open the target path. If @extra is zero,
- then openat2(2) is identical to openat(2).
- @flags: O_* flags (unknown flags ignored).
What was the rationale for implementing this semantics? Ignoring unknown flags makes potential extension of this new interface problematic. This has bitten us many times already, so ...
- @mode: O_CREAT file mode (ignored otherwise).
- @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise).
- @resolve: RESOLVE_* flags (-EINVAL on unknown flags).
... could you consider implementing this (-EINVAL on unknown flags) semantics for @flags as well, please?
On 2019-07-19, Dmitry V. Levin ldv@altlinux.org wrote:
On Sun, Jul 07, 2019 at 12:57:35AM +1000, Aleksa Sarai wrote: [...]
+/**
- Arguments for how openat2(2) should open the target path. If @extra is zero,
- then openat2(2) is identical to openat(2).
- @flags: O_* flags (unknown flags ignored).
What was the rationale for implementing this semantics? Ignoring unknown flags makes potential extension of this new interface problematic. This has bitten us many times already, so ...
I am mirroring the semantics of open(2) and openat(2).
To be clear, I am in favour of doing it -- and it would definitely be possible to implement it with -EINVAL (you would just mask off ~VALID_OPEN_FLAGS for the older syscalls). But Linus' response to my point about (the lack of) -EINVAL for unknown open(2) flags gave me the impression he would be against this idea (though I might be misunderstanding the point he was making).
Previously, using "%m" in a ksft_* format string can result in strange output because the errno value wasn't saved before calling other libc functions. The solution is to simply save and restore the errno before we format the user-supplied format string.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/kselftest.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/tools/testing/selftests/kselftest.h b/tools/testing/selftests/kselftest.h index ec15c4f6af55..0ac49d91a260 100644 --- a/tools/testing/selftests/kselftest.h +++ b/tools/testing/selftests/kselftest.h @@ -10,6 +10,7 @@ #ifndef __KSELFTEST_H #define __KSELFTEST_H
+#include <errno.h> #include <stdlib.h> #include <unistd.h> #include <stdarg.h> @@ -81,58 +82,68 @@ static inline void ksft_print_cnts(void)
static inline void ksft_print_msg(const char *msg, ...) { + int saved_errno = errno; va_list args;
va_start(args, msg); printf("# "); + errno = saved_errno; vprintf(msg, args); va_end(args); }
static inline void ksft_test_result_pass(const char *msg, ...) { + int saved_errno = errno; va_list args;
ksft_cnt.ksft_pass++;
va_start(args, msg); printf("ok %d ", ksft_test_num()); + errno = saved_errno; vprintf(msg, args); va_end(args); }
static inline void ksft_test_result_fail(const char *msg, ...) { + int saved_errno = errno; va_list args;
ksft_cnt.ksft_fail++;
va_start(args, msg); printf("not ok %d ", ksft_test_num()); + errno = saved_errno; vprintf(msg, args); va_end(args); }
static inline void ksft_test_result_skip(const char *msg, ...) { + int saved_errno = errno; va_list args;
ksft_cnt.ksft_xskip++;
va_start(args, msg); printf("not ok %d # SKIP ", ksft_test_num()); + errno = saved_errno; vprintf(msg, args); va_end(args); }
static inline void ksft_test_result_error(const char *msg, ...) { + int saved_errno = errno; va_list args;
ksft_cnt.ksft_error++;
va_start(args, msg); printf("not ok %d # error ", ksft_test_num()); + errno = saved_errno; vprintf(msg, args); va_end(args); } @@ -152,10 +163,12 @@ static inline int ksft_exit_fail(void)
static inline int ksft_exit_fail_msg(const char *msg, ...) { + int saved_errno = errno; va_list args;
va_start(args, msg); printf("Bail out! "); + errno = saved_errno; vprintf(msg, args); va_end(args);
@@ -178,10 +191,12 @@ static inline int ksft_exit_xpass(void) static inline int ksft_exit_skip(const char *msg, ...) { if (msg) { + int saved_errno = errno; va_list args;
va_start(args, msg); printf("not ok %d # SKIP ", 1 + ksft_test_num()); + errno = saved_errno; vprintf(msg, args); va_end(args); } else {
Test all of the various openat2(2) flags, as well as how file descriptor re-opening works. A small stress-test of a symlink-rename attack is included to show that the protections against ".."-based attacks are sufficient.
In addition, the memfd selftest is fixed to no longer depend on the now-disallowed functionality of upgrading an O_RDONLY descriptor to O_RDWR.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/memfd/memfd_test.c | 7 +- tools/testing/selftests/openat2/.gitignore | 1 + tools/testing/selftests/openat2/Makefile | 12 + tools/testing/selftests/openat2/helpers.c | 162 +++++++ tools/testing/selftests/openat2/helpers.h | 114 +++++ .../testing/selftests/openat2/linkmode_test.c | 325 ++++++++++++++ .../selftests/openat2/rename_attack_test.c | 124 ++++++ .../testing/selftests/openat2/resolve_test.c | 395 ++++++++++++++++++ 9 files changed, 1139 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/openat2/.gitignore create mode 100644 tools/testing/selftests/openat2/Makefile create mode 100644 tools/testing/selftests/openat2/helpers.c create mode 100644 tools/testing/selftests/openat2/helpers.h create mode 100644 tools/testing/selftests/openat2/linkmode_test.c create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c create mode 100644 tools/testing/selftests/openat2/resolve_test.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 9781ca79794a..42a27d029c10 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -37,6 +37,7 @@ TARGETS += powerpc TARGETS += proc TARGETS += pstore TARGETS += ptrace +TARGETS += openat2 TARGETS += rseq TARGETS += rtc TARGETS += seccomp diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index c67d32eeb668..e71df3d3e55d 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -925,7 +925,7 @@ static void test_share_mmap(char *banner, char *b_suffix) */ static void test_share_open(char *banner, char *b_suffix) { - int fd, fd2; + int procfd, fd, fd2;
printf("%s %s %s\n", memfd_str, banner, b_suffix);
@@ -950,13 +950,16 @@ static void test_share_open(char *banner, char *b_suffix) mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK); mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+ /* We cannot do a MAY_WRITE re-open of an O_RDONLY fd. */ + procfd = mfd_assert_open(fd2, O_PATH, 0); close(fd2); - fd2 = mfd_assert_open(fd, O_RDWR, 0); + fd2 = mfd_assert_open(procfd, O_WRONLY, 0);
mfd_assert_add_seals(fd2, F_SEAL_SEAL); mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL); mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+ close(procfd); close(fd2); close(fd); } diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore new file mode 100644 index 000000000000..bd68f6c3fd07 --- /dev/null +++ b/tools/testing/selftests/openat2/.gitignore @@ -0,0 +1 @@ +/*_test diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile new file mode 100644 index 000000000000..8235a49928f6 --- /dev/null +++ b/tools/testing/selftests/openat2/Makefile @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0 + +CFLAGS += -Wall -O2 -g +TEST_GEN_PROGS := linkmode_test resolve_test rename_attack_test + +include ../lib.mk + +$(OUTPUT)/linkmode_test: linkmode_test.c helpers.o +$(OUTPUT)/rename_attack_test: rename_attack_test.c helpers.o +$(OUTPUT)/resolve_test: resolve_test.c helpers.o + +EXTRA_CLEAN = helpers.o $(wildcard /tmp/ksft-openat2-*) diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c new file mode 100644 index 000000000000..c16213ff1946 --- /dev/null +++ b/tools/testing/selftests/openat2/helpers.c @@ -0,0 +1,162 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <stdbool.h> +#include <string.h> +#include <syscall.h> +#include <limits.h> + +#include "helpers.h" + +int sys_openat2(int dfd, const char *path, const struct open_how *how) +{ + int ret = syscall(__NR_openat2, dfd, path, how); + return ret >= 0 ? ret : -errno; +} + +int sys_openat(int dfd, const char *path, const struct open_how *how) +{ + int ret = openat(dfd, path, how->flags, how->mode); + return ret >= 0 ? ret : -errno; +} + +int sys_renameat2(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, unsigned int flags) +{ + int ret = syscall(__NR_renameat2, olddirfd, oldpath, + newdirfd, newpath, flags); + return ret >= 0 ? ret : -errno; +} + +char *openat_flags(unsigned int flags) +{ + char *flagset, *accmode = "(none)"; + + switch (flags & 0x03) { + case O_RDWR: + accmode = "O_RDWR"; + break; + case O_RDONLY: + accmode = "O_RDONLY"; + break; + case O_WRONLY: + accmode = "O_WRONLY"; + break; + } + + E_asprintf(&flagset, "%s%s%s", + (flags & O_PATH) ? "O_PATH|" : "", + (flags & O_CREAT) ? "O_CREAT|" : "", + accmode); + + return flagset; +} + +char *openat2_flags(const struct open_how *how) +{ + char *p; + char *flags_set, *resolve_set, *acc_set, *set; + + flags_set = openat_flags(how->flags); + + E_asprintf(&resolve_set, "%s%s%s%s%s0", + (how->resolve & RESOLVE_NO_XDEV) ? "RESOLVE_NO_XDEV|" : "", + (how->resolve & RESOLVE_NO_MAGICLINKS) ? "RESOLVE_NO_MAGICLINKS|" : "", + (how->resolve & RESOLVE_NO_SYMLINKS) ? "RESOLVE_NO_SYMLINKS|" : "", + (how->resolve & RESOLVE_BENEATH) ? "RESOLVE_BENEATH|" : "", + (how->resolve & RESOLVE_IN_ROOT) ? "RESOLVE_IN_ROOT|" : ""); + + /* Remove trailing "|0". */ + p = strstr(resolve_set, "|0"); + if (p) + *p = '\0'; + + if (how->flags & O_PATH) + E_asprintf(&acc_set, ", upgrade_mask=%s%s0", + (how->upgrade_mask & UPGRADE_NOREAD) ? "UPGRADE_NOREAD|" : "", + (how->upgrade_mask & UPGRADE_NOWRITE) ? "UPGRADE_NOWRITE|" : ""); + else if (how->flags & O_CREAT) + E_asprintf(&acc_set, ", mode=0%o", how->mode); + else + acc_set = strdup(""); + + /* Remove trailing "|0". */ + p = strstr(acc_set, "|0"); + if (p) + *p = '\0'; + + /* And now generate our flagset. */ + E_asprintf(&set, "[flags=%s, resolve=%s%s]", + flags_set, resolve_set, acc_set); + + free(flags_set); + free(resolve_set); + free(acc_set); + return set; +} + +int touchat(int dfd, const char *path) +{ + int fd = openat(dfd, path, O_CREAT); + if (fd >= 0) + close(fd); + return fd; +} + +char *fdreadlink(int fd) +{ + char *target, *tmp; + + E_asprintf(&tmp, "/proc/self/fd/%d", fd); + + target = malloc(PATH_MAX); + if (!target) + ksft_exit_fail_msg("fdreadlink: malloc failed\n"); + memset(target, 0, PATH_MAX); + + E_readlink(tmp, target, PATH_MAX); + free(tmp); + return target; +} + +bool fdequal(int fd, int dfd, const char *path) +{ + char *fdpath, *dfdpath, *other; + bool cmp; + + fdpath = fdreadlink(fd); + dfdpath = fdreadlink(dfd); + + if (!path) + E_asprintf(&other, "%s", dfdpath); + else if (*path == '/') + E_asprintf(&other, "%s", path); + else + E_asprintf(&other, "%s/%s", dfdpath, path); + + cmp = !strcmp(fdpath, other); + if (!cmp) + ksft_print_msg("fdequal: expected '%s' but got '%s'\n", other, fdpath); + + free(fdpath); + free(dfdpath); + free(other); + return cmp; +} + +void test_openat2_supported(void) +{ + struct open_how how = {}; + int fd = sys_openat2(AT_FDCWD, ".", &how); + if (fd == -ENOSYS) + ksft_exit_skip("openat2(2) unsupported on this kernel\n"); + if (fd < 0) + ksft_exit_fail_msg("openat2(2) supported check failed: %s\n", strerror(-fd)); + close(fd); +} diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h new file mode 100644 index 000000000000..ccdd1fc874c2 --- /dev/null +++ b/tools/testing/selftests/openat2/helpers.h @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#ifndef __RESOLVEAT_H__ +#define __RESOLVEAT_H__ + +#define _GNU_SOURCE +#include <stdint.h> +#include "../kselftest.h" + +#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X))) + +#ifndef SYS_openat2 +#ifndef __NR_openat2 +#define __NR_openat2 434 +#endif /* __NR_openat2 */ +#define SYS_openat2 __NR_openat2 +#endif /* SYS_openat2 */ + +/** + * Arguments for how openat2(2) should open the target path. If @extra is zero, + * then openat2 is identical to openat(2). Only one of @mode or @upgrade_mask + * may be set at any given time. + * + * @flags: O_* flags (unknown flags ignored). + * @mode: O_CREAT file mode (ignored otherwise). + * @upgrade_mask: restrict how the O_PATH may be re-opened (ignored otherwise). + * @resolve: RESOLVE_* flags (-EINVAL on unknown flags). + * @reserved: reserved for future extensions, must be zeroed. + */ +struct open_how { + uint32_t flags; + union { + uint16_t mode; + uint16_t upgrade_mask; + }; + uint16_t resolve; + uint64_t reserved[7]; /* must be zeroed */ +}; + +#ifndef RESOLVE_INROOT +/* how->resolve flags for openat2(2). */ +#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings + (includes bind-mounts). */ +#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style + "magic-links". */ +#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks + (implies OEXT_NO_MAGICLINKS) */ +#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like + "..", symlinks, and absolute + paths which escape the dirfd. */ +#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." + be scoped inside the dirfd + (similar to chroot(2)). */ +#endif /* RESOLVE_IN_ROOT */ + +#ifndef UPGRADE_NOREAD +/* how->upgrade flags for openat2(2). */ +/* First bit is reserved for a future UPGRADE_NOEXEC flag. */ +#define UPGRADE_NOREAD 0x02 /* Block re-opening with MAY_READ. */ +#define UPGRADE_NOWRITE 0x04 /* Block re-opening with MAY_WRITE. */ +#endif /* UPGRADE_NOREAD */ + +#ifndef O_EMPTYPATH +#define O_EMPTYPATH 040000000 +#endif /* O_EMPTYPATH */ + +#define E_func(func, ...) \ + do { \ + if (func(__VA_ARGS__) < 0) \ + ksft_exit_fail_msg("%s:%d %s failed\n", \ + __FILE__, __LINE__, #func);\ + } while (0) + +#define E_mkdirat(...) E_func(mkdirat, __VA_ARGS__) +#define E_symlinkat(...) E_func(symlinkat, __VA_ARGS__) +#define E_touchat(...) E_func(touchat, __VA_ARGS__) +#define E_readlink(...) E_func(readlink, __VA_ARGS__) +#define E_fstatat(...) E_func(fstatat, __VA_ARGS__) +#define E_asprintf(...) E_func(asprintf, __VA_ARGS__) +#define E_fchdir(...) E_func(fchdir, __VA_ARGS__) +#define E_mount(...) E_func(mount, __VA_ARGS__) +#define E_unshare(...) E_func(unshare, __VA_ARGS__) +#define E_setresuid(...) E_func(setresuid, __VA_ARGS__) +#define E_chmod(...) E_func(chmod, __VA_ARGS__) + +#define E_assert(expr, msg, ...) \ + do { \ + if (!(expr)) \ + ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \ + __FILE__, __LINE__, #expr, ##__VA_ARGS__); \ + } while (0) + +typedef int (*openfunc_t)(int dfd, const char *path, const struct open_how *how); + +int sys_openat2(int dfd, const char *path, const struct open_how *how); +char *openat2_flags(const struct open_how *how); + +int sys_openat(int dfd, const char *path, const struct open_how *how); +char *openat_flags(unsigned int flags); + +int sys_renameat2(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, unsigned int flags); + +int touchat(int dfd, const char *path); +char *fdreadlink(int fd); +bool fdequal(int fd, int dfd, const char *path); + +void test_openat2_supported(void); + +#endif /* __RESOLVEAT_H__ */ diff --git a/tools/testing/selftests/openat2/linkmode_test.c b/tools/testing/selftests/openat2/linkmode_test.c new file mode 100644 index 000000000000..37124bf97fef --- /dev/null +++ b/tools/testing/selftests/openat2/linkmode_test.c @@ -0,0 +1,325 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <fcntl.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <stdbool.h> +#include <string.h> + +#include "../kselftest.h" +#include "helpers.h" + +static mode_t fdmode(int fd) +{ + char *fdpath; + struct stat statbuf; + mode_t mode; + + E_asprintf(&fdpath, "/proc/self/fd/%d", fd); + E_fstatat(AT_FDCWD, fdpath, &statbuf, AT_SYMLINK_NOFOLLOW); + mode = (statbuf.st_mode & ~S_IFMT); + free(fdpath); + + return mode; +} + +static int reopen_proc(int fd, unsigned int flags) +{ + int ret, saved_errno; + char *fdpath; + + E_asprintf(&fdpath, "/proc/self/fd/%d", fd); + ret = open(fdpath, flags); + saved_errno = errno; + free(fdpath); + + return ret >= 0 ? ret : -saved_errno; +} + +static int reopen_oemptypath(int fd, unsigned int flags) +{ + int ret = openat(fd, "", O_EMPTYPATH | flags); + return ret >= 0 ? ret : -errno; +} + +struct reopen_test { + openfunc_t open; + mode_t chmod_mode; + struct { + struct open_how how; + mode_t mode; + int err; + } orig, new; +}; + +static bool reopen(int fd, struct reopen_test *test) +{ + int newfd; + mode_t proc_mode; + bool failed = false; + + /* Check that the proc mode is correct. */ + proc_mode = fdmode(fd); + if (proc_mode != test->orig.mode) { + ksft_print_msg("incorrect fdmode (got[%o] != want[%o])\n", + proc_mode, test->orig.mode); + failed = true; + } + + /* Re-open through /proc. */ + newfd = reopen_proc(fd, test->new.how.flags); + if (newfd != test->new.err && (newfd < 0 || test->new.err < 0)) { + ksft_print_msg("/proc failure (%d != %d [%s])\n", + newfd, test->new.err, strerror(-test->new.err)); + failed = true; + } + if (newfd >= 0) { + proc_mode = fdmode(newfd); + if (proc_mode != test->new.mode) { + ksft_print_msg("/proc wrong fdmode (got[%o] != want[%o])\n", + proc_mode, test->new.mode); + failed = true; + } + close(newfd); + } + + /* Re-open with O_EMPTYPATH. */ + newfd = reopen_oemptypath(fd, test->new.how.flags); + if (newfd != test->new.err && (newfd < 0 || test->new.err < 0)) { + ksft_print_msg("O_EMPTYPATH failure (%d != %d [%s])\n", + newfd, test->new.err, strerror(-test->new.err)); + failed = true; + } + if (newfd >= 0) { + proc_mode = fdmode(newfd); + if (proc_mode != test->new.mode) { + ksft_print_msg("O_EMPTYPATH wrong fdmode (got[%o] != want[%o])\n", + proc_mode, test->new.mode); + failed = true; + } + close(newfd); + } + + return failed; +} + +void test_reopen_ordinary(bool privileged) +{ + int fd; + int err_access = privileged ? 0 : -EACCES; + char tmpfile[] = "/tmp/ksft-openat2-reopen-testfile.XXXXXX"; + + fd = mkstemp(tmpfile); + E_assert(fd >= 0, "mkstemp failed: %m\n"); + close(fd); + + struct reopen_test tests[] = { + /* Re-opening with the same mode should succeed. */ + { .open = sys_openat, .chmod_mode = 0400, + .orig.how.flags = O_RDONLY, .orig.mode = 0500, + .new.how.flags = O_RDONLY, .new.mode = 0500 }, + { .open = sys_openat, .chmod_mode = 0200, + .orig.how.flags = O_WRONLY, .orig.mode = 0300, + .new.how.flags = O_WRONLY, .new.mode = 0300 }, + { .open = sys_openat, .chmod_mode = 0600, + .orig.how.flags = O_RDWR, .orig.mode = 0700, + .new.how.flags = O_RDWR, .new.mode = 0700 }, + { .open = sys_openat, .chmod_mode = 0600, + .orig.how.flags = O_RDWR, .orig.mode = 0700, + .new.how.flags = O_RDONLY, .new.mode = 0500 }, + { .open = sys_openat, .chmod_mode = 0600, + .orig.how.flags = O_RDWR, .orig.mode = 0700, + .new.how.flags = O_WRONLY, .new.mode = 0300 }, + + /* + * Re-opening with a different mode will always fail (with an obvious + * carve-out for privileged users). + */ + { .open = sys_openat, .chmod_mode = 0600, + .orig.how.flags = O_RDONLY, .orig.mode = 0500, + .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access }, + { .open = sys_openat, .chmod_mode = 0600, + .orig.how.flags = O_WRONLY, .orig.mode = 0300, + .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access }, + { .open = sys_openat, .chmod_mode = 0600, + .orig.how.flags = O_RDONLY, .orig.mode = 0500, + .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access }, + { .open = sys_openat, .chmod_mode = 0600, + .orig.how.flags = O_WRONLY, .orig.mode = 0300, + .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access }, + + /* Doubly so if they didn't even have permissions at open-time. */ + { .open = sys_openat, .chmod_mode = 0400, + .orig.how.flags = O_RDONLY, .orig.mode = 0500, + .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access }, + { .open = sys_openat, .chmod_mode = 0200, + .orig.how.flags = O_WRONLY, .orig.mode = 0300, + .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access }, + { .open = sys_openat, .chmod_mode = 0400, + .orig.how.flags = O_RDONLY, .orig.mode = 0500, + .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access }, + { .open = sys_openat, .chmod_mode = 0200, + .orig.how.flags = O_WRONLY, .orig.mode = 0300, + .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access }, + + /* O_PATH re-opens (of ordinary files) will always work. */ + { .open = sys_openat, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0070, + .new.how.flags = O_WRONLY, .new.mode = 0300 }, + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0070, + .new.how.flags = O_WRONLY, .new.mode = 0300 }, + + { .open = sys_openat, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0070, + .new.how.flags = O_RDONLY, .new.mode = 0500 }, + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0070, + .new.how.flags = O_RDONLY, .new.mode = 0500 }, + + { .open = sys_openat, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0070, + .new.how.flags = O_RDWR, .new.mode = 0700 }, + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0070, + .new.how.flags = O_RDWR, .new.mode = 0700 }, + + /* + * openat2(2) UPGRADE_NO* flags. In the privileged case, the re-open + * will work but the mode will still be scoped to the mode (or'd with + * the open acc_mode). + */ + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0010, + .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE, + .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access }, + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0010, + .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE, + .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access }, + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0010, + .orig.how.upgrade_mask = UPGRADE_NOREAD | UPGRADE_NOWRITE, + .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access }, + + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0050, + .orig.how.upgrade_mask = UPGRADE_NOWRITE, + .new.how.flags = O_RDONLY, .new.mode = 0500 }, + + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0030, + .orig.how.upgrade_mask = UPGRADE_NOREAD, + .new.how.flags = O_WRONLY, .new.mode = 0300 }, + + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0030, + .orig.how.upgrade_mask = UPGRADE_NOREAD, + .new.how.flags = O_RDONLY, .new.mode = 0500, .new.err = err_access }, + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0050, + .orig.how.upgrade_mask = UPGRADE_NOWRITE, + .new.how.flags = O_WRONLY, .new.mode = 0300, .new.err = err_access }, + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0030, + .orig.how.upgrade_mask = UPGRADE_NOREAD, + .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access }, + { .open = sys_openat2, .chmod_mode = 0000, + .orig.how.flags = O_PATH, .orig.mode = 0050, + .orig.how.upgrade_mask = UPGRADE_NOWRITE, + .new.how.flags = O_RDWR, .new.mode = 0700, .new.err = err_access }, + }; + + for (int i = 0; i < ARRAY_LEN(tests); i++) { + int fd; + char *orig_flagset, *new_flagset; + struct reopen_test *test = &tests[i]; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + + E_chmod(tmpfile, test->chmod_mode); + + fd = test->open(AT_FDCWD, tmpfile, &test->orig.how); + E_assert(fd >= 0, "open '%s' failed: %m\n", tmpfile); + + /* Make sure that any EACCES we see is not from inode permissions. */ + E_chmod(tmpfile, 0777); + + if (reopen(fd, test)) + resultfn = ksft_test_result_fail; + + close(fd); + + new_flagset = openat_flags(test->new.how.flags); + if (test->open == sys_openat) + orig_flagset = openat_flags(test->orig.how.flags); + else if (test->open == sys_openat2) + orig_flagset = openat2_flags(&test->orig.how); + else + ksft_exit_fail_msg("unknown test->open\n"); + + resultfn("%sordinary reopen of (orig[%s]=%s, new=%s) chmod=%.3o %s\n", + privileged ? "privileged " : "", + test->open == sys_openat ? "openat" : "openat2", + orig_flagset, new_flagset, test->chmod_mode, + test->new.err < 0 ? strerror(-test->new.err) : "works"); + + free(new_flagset); + free(orig_flagset); + } + + unlink(tmpfile); +} + +void test_openat2_cloexec_test(void) +{ + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + struct open_how how = { + .flags = O_CLOEXEC | O_PATH | O_DIRECTORY, + }; + + int fd = sys_openat2(AT_FDCWD, ".", &how); + E_assert(fd >= 0, "open '.' failed: %m\n"); + + int flags = fcntl(fd, F_GETFD); + E_assert(flags >= 0, "F_GETFD failed: %m\n"); + + if (!(flags & FD_CLOEXEC)) + resultfn = ksft_test_result_fail; + + resultfn("openat2(O_CLOEXEC) works as expected\n"); +} + +int main(int argc, char **argv) +{ + bool privileged; + + ksft_print_header(); + test_openat2_supported(); + + /* + * Technically we should be checking CAP_DAC_OVERRIDE, but it's easier to + * just assume that euid=0 has the full capability set. + */ + privileged = (geteuid() == 0); + if (!privileged) + ksft_test_result_skip("privileged tests require euid == 0\n"); + else { + test_reopen_ordinary(privileged); + + E_setresuid(65534, 65534, 65534); + privileged = (geteuid() == 0); + } + + test_reopen_ordinary(privileged); + test_openat2_cloexec_test(); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c new file mode 100644 index 000000000000..b5e2a68609f1 --- /dev/null +++ b/tools/testing/selftests/openat2/rename_attack_test.c @@ -0,0 +1,124 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <sys/mman.h> +#include <sys/prctl.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <stdbool.h> +#include <string.h> +#include <syscall.h> +#include <limits.h> +#include <unistd.h> + +#include "../kselftest.h" +#include "helpers.h" + +/* Construct a test directory with the following structure: + * + * root/ + * |-- a/ + * | `-- c/ + * `-- b/ + */ +int setup_testdir(void) +{ + int dfd; + char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX"; + + /* Make the top-level directory. */ + if (!mkdtemp(dirname)) + ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n"); + dfd = open(dirname, O_PATH | O_DIRECTORY); + if (dfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + + E_mkdirat(dfd, "a", 0755); + E_mkdirat(dfd, "b", 0755); + E_mkdirat(dfd, "a/c", 0755); + + return dfd; +} + +/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */ +pid_t spawn_attack(int dirfd, char *a, char *b) +{ + pid_t child = fork(); + if (child != 0) + return child; + + /* If the parent (the test process) dies, kill ourselves too. */ + prctl(PR_SET_PDEATHSIG, SIGKILL); + + /* Swap @a and @b. */ + for (;;) + renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE); + exit(1); +} + +#define ROUNDS 400000 +void test_rename_attack(void) +{ + int dfd, afd, escaped_count = 0; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + pid_t child; + + dfd = setup_testdir(); + afd = openat(dfd, "a", O_PATH); + if (afd < 0) + ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n"); + + child = spawn_attack(dfd, "a/c", "b"); + + for (int i = 0; i < ROUNDS; i++) { + int fd; + bool failed; + struct open_how how = { + .flags = O_PATH, + .resolve = RESOLVE_IN_ROOT, + }; + char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../.."; + + fd = sys_openat2(afd, victim_path, &how); + if (fd < 0) + failed = (fd != -EXDEV); + else + failed = !fdequal(fd, afd, NULL); + + escaped_count += failed; + close(fd); + } + + if (escaped_count > 0) + resultfn = ksft_test_result_fail; + + resultfn("rename attack fails (expected 0 breakouts in %d runs, got %d)\n", + ROUNDS, escaped_count); + + /* Should be killed anyway, but might as well make sure. */ + kill(child, SIGKILL); +} + +int main(int argc, char **argv) +{ + ksft_print_header(); + test_openat2_supported(); + + test_rename_attack(); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c new file mode 100644 index 000000000000..7f109cf6126a --- /dev/null +++ b/tools/testing/selftests/openat2/resolve_test.c @@ -0,0 +1,395 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018-2019 SUSE LLC. + */ + +#define _GNU_SOURCE +#include <fcntl.h> +#include <sched.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/mount.h> +#include <stdlib.h> +#include <stdbool.h> +#include <string.h> + +#include "../kselftest.h" +#include "helpers.h" + +/* + * Construct a test directory with the following structure: + * + * root/ + * |-- procexe -> /proc/self/exe + * |-- procroot -> /proc/self/root + * |-- root/ + * |-- mnt/ [mountpoint] + * | |-- self -> ../mnt/ + * | `-- absself -> /mnt/ + * |-- etc/ + * | `-- passwd + * |-- creatlink -> /newfile3 + * |-- relsym -> etc/passwd + * |-- abssym -> /etc/passwd + * |-- abscheeky -> /cheeky + * |-- abscheeky -> /cheeky + * `-- cheeky/ + * |-- absself -> / + * |-- self -> ../../root/ + * |-- garbageself -> /../../root/ + * |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd + * |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd + * |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd + * `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd + */ +int setup_testdir(void) +{ + int dfd, tmpfd; + char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX"; + + /* Unshare and make /tmp a new directory. */ + E_unshare(CLONE_NEWNS); + E_mount("", "/tmp", "", MS_PRIVATE, ""); + + /* Make the top-level directory. */ + if (!mkdtemp(dirname)) + ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n"); + dfd = open(dirname, O_PATH | O_DIRECTORY); + if (dfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + + /* A sub-directory which is actually used for tests. */ + E_mkdirat(dfd, "root", 0755); + tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY); + if (tmpfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + close(dfd); + dfd = tmpfd; + + E_symlinkat("/proc/self/exe", dfd, "procexe"); + E_symlinkat("/proc/self/root", dfd, "procroot"); + E_mkdirat(dfd, "root", 0755); + + /* There is no mountat(2), so use chdir. */ + E_mkdirat(dfd, "mnt", 0755); + E_fchdir(dfd); + E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, ""); + E_symlinkat("../mnt/", dfd, "mnt/self"); + E_symlinkat("/mnt/", dfd, "mnt/absself"); + + E_mkdirat(dfd, "etc", 0755); + E_touchat(dfd, "etc/passwd"); + + E_symlinkat("/newfile3", dfd, "creatlink"); + E_symlinkat("etc/passwd", dfd, "relsym"); + E_symlinkat("/etc/passwd", dfd, "abssym"); + E_symlinkat("/cheeky", dfd, "abscheeky"); + + E_mkdirat(dfd, "cheeky", 0755); + + E_symlinkat("/", dfd, "cheeky/absself"); + E_symlinkat("../../root/", dfd, "cheeky/self"); + E_symlinkat("/../../root/", dfd, "cheeky/garbageself"); + + E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd"); + E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd"); + + E_symlinkat("../../../../../../../../../../../../../../etc/passwd", + dfd, "cheeky/dotdotlink"); + E_symlinkat("/../../../../../../../../../../../../../../etc/passwd", + dfd, "cheeky/garbagelink"); + + return dfd; +} + +struct basic_test { + const char *dir; + const char *path; + struct open_how how; + bool pass; + union { + int err; + const char *path; + } out; +}; + +void test_openat2_opath_tests(void) +{ + int rootfd; + char *procselfexe; + + E_asprintf(&procselfexe, "/proc/%d/exe", getpid()); + rootfd = setup_testdir(); + + struct basic_test tests[] = { + /** RESOLVE_BENEATH **/ + /* Attempts to cross dirfd should be blocked. */ + { .path = "/", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "cheeky/absself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "abscheeky/absself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "..", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "../root/", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "cheeky/self", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "abscheeky/self", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "cheeky/garbageself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + /* Only relative paths that stay inside dirfd should work. */ + { .path = "root", .how.resolve = RESOLVE_BENEATH, + .out.path = "root", .pass = true }, + { .path = "etc", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc", .pass = true }, + { .path = "etc/passwd", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .path = "relsym", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .path = "cheeky/passwd", .how.resolve = RESOLVE_BENEATH, + .out.path = "etc/passwd", .pass = true }, + { .path = "abscheeky/passwd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "abssym", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "/etc/passwd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "cheeky/abspasswd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + /* Tricky paths should fail. */ + { .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH, + .out.err = -EXDEV, .pass = false }, + + /** RESOLVE_IN_ROOT **/ + /* All attempts to cross the dirfd will be scoped-to-root. */ + { .path = "/", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .path = "cheeky/absself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .path = "abscheeky/absself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .path = "..", .how.resolve = RESOLVE_IN_ROOT, + .out.path = NULL, .pass = true }, + { .path = "../root/", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .path = "../root/", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .path = "cheeky/self", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .path = "cheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .path = "root", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "root", .pass = true }, + { .path = "etc", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc", .pass = true }, + { .path = "etc/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "relsym", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "cheeky/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "abscheeky/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "abssym", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "/etc/passwd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "cheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "abscheeky/abspasswd",.how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + { .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT, + .out.path = "etc/passwd", .pass = true }, + /* O_CREAT should handle trailing symlinks correctly. */ + { .path = "newfile1", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile1", .pass = true }, + { .path = "/newfile2", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile2", .pass = true }, + { .path = "/creatlink", .how.flags = O_CREAT, + .how.mode = 0700, + .how.resolve = RESOLVE_IN_ROOT, + .out.path = "newfile3", .pass = true }, + + /** RESOLVE_NO_XDEV **/ + /* Crossing *down* into a mountpoint is disallowed. */ + { .path = "mnt", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .path = "mnt/", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .path = "mnt/.", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Crossing *up* out of a mountpoint is disallowed. */ + { .dir = "mnt", .path = ".", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "mnt", .pass = true }, + { .dir = "mnt", .path = "..", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .dir = "mnt", .path = "self", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + /* Jumping to "/" is ok, but later components cannot cross. */ + { .dir = "mnt", .path = "/", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "/", .pass = true }, + { .dir = "/", .path = "/", .how.resolve = RESOLVE_NO_XDEV, + .out.path = "/", .pass = true }, + { .path = "/proc/1", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + { .path = "/tmp", .how.resolve = RESOLVE_NO_XDEV, + .out.err = -EXDEV, .pass = false }, + + /** RESOLVE_NO_MAGICLINKS **/ + /* Regular symlinks should work. */ + { .path = "relsym", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.path = "etc/passwd", .pass = true }, + /* Magic-links should not work. */ + { .path = "procexe", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "/proc/self/exe", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "procroot/etc", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "/proc/self/exe", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_MAGICLINKS, + .out.path = procselfexe, .pass = true }, + + /** RESOLVE_NO_SYMLINKS **/ + /* Normal paths should work. */ + { .path = ".", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = NULL, .pass = true }, + { .path = "root", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "root", .pass = true }, + { .path = "etc", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "etc", .pass = true }, + { .path = "etc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "etc/passwd", .pass = true }, + /* Regular symlinks are blocked. */ + { .path = "relsym", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "abssym", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "cheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "abscheeky/absself", .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + /* Trailing symlinks with NO_FOLLOW. */ + { .path = "relsym", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "relsym", .pass = true }, + { .path = "abssym", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "abssym", .pass = true }, + { .path = "cheeky/garbagelink", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.path = "cheeky/garbagelink", .pass = true }, + { .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + { .path = "abscheeky/absself", .how.flags = O_NOFOLLOW, + .how.resolve = RESOLVE_NO_SYMLINKS, + .out.err = -ELOOP, .pass = false }, + }; + + for (int i = 0; i < ARRAY_LEN(tests); i++) { + int dfd, fd; + bool failed; + void (*resultfn)(const char *msg, ...) = ksft_test_result_pass; + struct basic_test *test = &tests[i]; + char *flagstr; + + /* Auto-set O_PATH. */ + if (!(test->how.flags & O_CREAT)) + test->how.flags |= O_PATH; + flagstr = openat2_flags(&test->how); + + if (test->dir) + dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY); + else + dfd = dup(rootfd); + if (dfd < 0) { + resultfn = ksft_test_result_error; + goto next; + } + + fd = sys_openat2(dfd, test->path, &test->how); + if (test->pass) + failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path)); + else + failed = (fd != test->out.err); + if (fd >= 0) + close(fd); + close(dfd); + + if (failed) + resultfn = ksft_test_result_fail; + +next: + if (test->pass) + resultfn("openat2(root[%s], %s, %s) ==> %s\n", + test->dir ?: ".", test->path, flagstr, + test->out.path ?: "."); + else + resultfn("openat2(root[%s], %s, %s) ==> %d (%s)\n", + test->dir ?: ".", test->path, flagstr, + test->out.err, strerror(-test->out.err)); + free(flagstr); + } + + free(procselfexe); + close(rootfd); +} + +int main(int argc, char **argv) +{ + ksft_print_header(); + test_openat2_supported(); + + /* NOTE: We should be checking for CAP_SYS_ADMIN here... */ + if (geteuid() != 0) + ksft_exit_skip("openat2(2) tests require euid == 0\n"); + + test_openat2_opath_tests(); + + if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +}
Hi Aleksa,
A few minor comments below.
Aleksa Sarai cyphar@cyphar.com writes:
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile new file mode 100644 index 000000000000..8235a49928f6 --- /dev/null +++ b/tools/testing/selftests/openat2/Makefile @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -Wall -O2 -g +TEST_GEN_PROGS := linkmode_test resolve_test rename_attack_test
+include ../lib.mk
+$(OUTPUT)/linkmode_test: linkmode_test.c helpers.o +$(OUTPUT)/rename_attack_test: rename_attack_test.c helpers.o +$(OUTPUT)/resolve_test: resolve_test.c helpers.o
You don't need to tell make that foo depends on foo.c.
Also if you make the dependency be on helpers.c then you won't get an intermediate helpers.o, and then you don't need to clean it.
So the above three lines could just be:
$(TEST_GEN_PROGS): helpers.c
+EXTRA_CLEAN = helpers.o $(wildcard /tmp/ksft-openat2-*)
If you follow my advice above you don't need helpers.o in there.
Deleting things from /tmp is also a bit fishy on shared machines, ie. it will error if those files happen to be owned by another user.
cheers
On 2019-07-08, Michael Ellerman mpe@ellerman.id.au wrote:
Aleksa Sarai cyphar@cyphar.com writes:
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile new file mode 100644 index 000000000000..8235a49928f6 --- /dev/null +++ b/tools/testing/selftests/openat2/Makefile @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -Wall -O2 -g +TEST_GEN_PROGS := linkmode_test resolve_test rename_attack_test
+include ../lib.mk
+$(OUTPUT)/linkmode_test: linkmode_test.c helpers.o +$(OUTPUT)/rename_attack_test: rename_attack_test.c helpers.o +$(OUTPUT)/resolve_test: resolve_test.c helpers.o
You don't need to tell make that foo depends on foo.c.
Also if you make the dependency be on helpers.c then you won't get an intermediate helpers.o, and then you don't need to clean it.
So the above three lines could just be:
$(TEST_GEN_PROGS): helpers.c
I had some trouble getting this to work (hence why I went with the version in the patch), but it looks like this works. I'll include it in the next set.
+EXTRA_CLEAN = helpers.o $(wildcard /tmp/ksft-openat2-*)
If you follow my advice above you don't need helpers.o in there.
Deleting things from /tmp is also a bit fishy on shared machines, ie. it will error if those files happen to be owned by another user.
Good point. I'll drop that hunk in the next set.
Thanks!
On Sun, Jul 07, 2019 at 12:57:27AM +1000, Aleksa Sarai wrote:
Patch changelog: v9: * Replace resolveat(2) with openat2(2). [Linus] * Output a warning to dmesg if may_open_magiclink() is violated. * Add an openat2(O_CREAT) testcase.
One general note for the future, BTW: for such series it's generally a good idea to put it into a public git tree somewhere and mention that in the announcement...
On 2019-07-12, Al Viro viro@zeniv.linux.org.uk wrote:
On Sun, Jul 07, 2019 at 12:57:27AM +1000, Aleksa Sarai wrote:
Patch changelog: v9: * Replace resolveat(2) with openat2(2). [Linus] * Output a warning to dmesg if may_open_magiclink() is violated. * Add an openat2(O_CREAT) testcase.
One general note for the future, BTW: for such series it's generally a good idea to put it into a public git tree somewhere and mention that in the announcement...
Sure, I'll mention it next time. For the record the tree is https://github.com/cyphar/linux/tree/resolveat/master
linux-kselftest-mirror@lists.linaro.org