The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
* Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH. * Mountpoint crossings are blocked by AT_XDEV. * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS path didn't consider "/tmp/.." as a mountpoint crossing -- this patch blocks this as well (feel free to ask me to remove it if you feel this is not sane).
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
One additional feature I've implemented is AT_THIS_ROOT (I imagine this is probably going to be more contentious than the refresh of AT_NO_JUMPS, so I've included it in a separate patch). The patch itself describes my reasoning, but the shortened version of the premise is that continer runtimes need to have a way to resolve paths within a potentially malicious rootfs. Container runtimes currently do this in userspace[2] which has implicit race conditions that are not resolvable in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for path resolution, which would be invaluable for us -- and the implementation is basically identical to AT_BENEATH (except that we don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
[1]: https://lore.kernel.org/patchwork/patch/784221/ [2]: https://github.com/cyphar/filepath-securejoin
Aleksa Sarai (3): namei: implement O_BENEATH-style AT_* flags namei: implement AT_THIS_ROOT chroot-like path resolution selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +- fs/namei.c | 158 ++++++++++++------ fs/open.c | 10 ++ fs/stat.c | 15 +- include/linux/fcntl.h | 3 +- include/linux/namei.h | 8 + include/uapi/asm-generic/fcntl.h | 20 +++ include/uapi/linux/fcntl.h | 10 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 ++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++ 19 files changed, 707 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
Add the following flags for path resolution. The primary justification for these flags is to allow for programs to be far more strict about how they want path resolution to handle symlinks, mountpoint crossings, and paths that escape the dirfd (through an absolute path or ".." shenanigans).
This is of particular concern to container runtimes that want to be very careful about malicious root filesystems that a container's init might have screwed around with (and there is no real way to protect against this in userspace if you consider potential races against a malicious container's init).
* AT_BENEATH: Disallow ".." or absolute paths (either in the path or found during symlink resolution) to escape the starting point of name resolution, though ".." is permitted in cases like "foo/../bar". Relative symlinks are still allowed (as long as they don't escape the starting point).
* AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up* from one). The primary "scoping" use is to blocking resolution that crosses a bind-mount, which has a similar property to a symlink (in the way that it allows for escape from the starting-point). Since it is not possible to differentiate bind-mounts However since bind-mounting requires privileges (in ways symlinks don't) this has been split from LOOKUP_BENEATH. The naming is based on "find -xdev" (though find(1) doesn't walk upwards, the semantics seem obvious).
* AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very specific restriction, and it exists because /proc/$pid/fd/... "symlinks" allow for access outside nd->root and pose risk to container runtimes that don't want to be tricked into accessing a host path (but do want to allow no-funny-business symlink resolution).
* AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies AT_NO_PROCLINK (obviously).
The AT_NO_*LINK flags return -ELOOP if path resolution would violates their requirement, while the others all return -EXDEV. Currently these are only enabled for the stat(2) family and the openat(2) family (the latter has its own brand of O_* flags with the same semantics). Ideally these flags would be supported by all *at(2) syscalls, but this will require adding flags arguments to many of them (and will be done in a separate patchset).
Cc: Andy Lutomirski luto@kernel.org Cc: Eric Biederman ebiederm@xmission.com Cc: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/fcntl.c | 2 +- fs/namei.c | 61 ++++++++++++++++++++++++++------ fs/open.c | 8 +++++ fs/stat.c | 13 +++++-- include/linux/fcntl.h | 3 +- include/linux/namei.h | 7 ++++ include/uapi/asm-generic/fcntl.h | 17 +++++++++ include/uapi/linux/fcntl.h | 8 +++++ 8 files changed, 104 insertions(+), 15 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index 4137d96534a6..e343618736f7 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)); diff --git a/fs/namei.c b/fs/namei.c index fb913148d4d1..757dd783771c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -859,6 +859,8 @@ static int nd_jump_root(struct nameidata *nd) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; nd->flags |= LOOKUP_JUMPED; return 0; } @@ -1083,14 +1085,19 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); } + /* If we just jumped it was because of a procfs-style link. */ + if (unlikely(nd->flags & LOOKUP_JUMPED) && + unlikely(nd->flags & LOOKUP_NO_PROCLINKS)) + return ERR_PTR(-ELOOP); if (IS_ERR_OR_NULL(res)) return res; } if (*res == '/') { if (!nd->root.mnt) set_root(nd); - if (unlikely(nd_jump_root(nd))) - return ERR_PTR(-ECHILD); + error = nd_jump_root(nd); + if (unlikely(error)) + return ERR_PTR(error); while (unlikely(*++res == '/')) ; } @@ -1271,12 +1278,16 @@ static int follow_managed(struct path *path, struct nameidata *nd) break; }
- if (need_mntput && path->mnt == mnt) - mntput(path->mnt); + if (need_mntput) { + if (path->mnt == mnt) + mntput(path->mnt); + if (unlikely(nd->flags & LOOKUP_XDEV)) + ret = -EXDEV; + else + nd->flags |= LOOKUP_JUMPED; + } if (ret == -EISDIR || !ret) ret = 1; - if (need_mntput) - nd->flags |= LOOKUP_JUMPED; if (unlikely(ret < 0)) path_put_conditional(path, nd); return ret; @@ -1333,6 +1344,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, mounted = __lookup_mnt(path->mnt, path->dentry); if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return false; path->mnt = &mounted->mnt; path->dentry = mounted->mnt.mnt_root; nd->flags |= LOOKUP_JUMPED; @@ -1353,8 +1366,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode;
while (1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1379,6 +1395,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (&mparent->mnt == nd->path.mnt) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; /* we know that mountpoint was pinned */ nd->path.dentry = mountpoint; nd->path.mnt = &mparent->mnt; @@ -1481,8 +1499,11 @@ static int path_parent_directory(struct path *path) static int follow_dotdot(struct nameidata *nd) { while(1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret) @@ -1491,6 +1512,8 @@ static int follow_dotdot(struct nameidata *nd) } if (!follow_up(&nd->path)) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; } follow_mount(&nd->path); nd->inode = nd->path.dentry->d_inode; @@ -1720,6 +1743,8 @@ static int pick_link(struct nameidata *nd, struct path *link, { int error; struct saved *last; + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + return -ELOOP; if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) { path_to_nameidata(link, nd); return -ELOOP; @@ -2175,6 +2200,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
if (!*s) flags &= ~LOOKUP_RCU; + if (flags & LOOKUP_NO_SYMLINKS) + flags |= LOOKUP_NO_PROCLINKS; if (flags & LOOKUP_RCU) rcu_read_lock();
@@ -2204,10 +2231,12 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->m_seq = read_seqbegin(&mount_lock); if (*s == '/') { + int error; set_root(nd); - if (likely(!nd_jump_root(nd))) - return s; - return ERR_PTR(-ECHILD); + error = nd_jump_root(nd); + if (unlikely(error)) + s = ERR_PTR(error); + return s; } else if (nd->dfd == AT_FDCWD) { if (flags & LOOKUP_RCU) { struct fs_struct *fs = current->fs; @@ -2223,6 +2252,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; } + if (unlikely(flags & LOOKUP_BENEATH)) { + nd->root = nd->path; + if (!(flags & LOOKUP_RCU)) + path_get(&nd->root); + } return s; } else { /* Caller must check execute permissions on the starting path component */ @@ -2247,6 +2281,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } + if (unlikely(flags & LOOKUP_BENEATH)) { + nd->root = nd->path; + if (!(flags & LOOKUP_RCU)) + path_get(&nd->root); + } fdput(f); return s; } diff --git a/fs/open.c b/fs/open.c index 0285ce7dbd51..80f5f566a5ff 100644 --- a/fs/open.c +++ b/fs/open.c @@ -988,6 +988,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_DIRECTORY; if (!(flags & O_NOFOLLOW)) lookup_flags |= LOOKUP_FOLLOW; + if (flags & O_BENEATH) + lookup_flags |= LOOKUP_BENEATH; + if (flags & O_XDEV) + lookup_flags |= LOOKUP_XDEV; + if (flags & O_NOPROCLINKS) + lookup_flags |= LOOKUP_NO_PROCLINKS; + if (flags & O_NOSYMLINKS) + lookup_flags |= LOOKUP_NO_SYMLINKS; op->lookup_flags = lookup_flags; return 0; } diff --git a/fs/stat.c b/fs/stat.c index f8e6fb2c3657..791e61b916ae 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -170,8 +170,9 @@ int vfs_statx(int dfd, const char __user *filename, int flags, int error = -EINVAL; unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
- if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | - AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0) + if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH | + KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV | + AT_NO_PROCLINKS | AT_NO_SYMLINKS)) return -EINVAL;
if (flags & AT_SYMLINK_NOFOLLOW) @@ -180,6 +181,14 @@ int vfs_statx(int dfd, const char __user *filename, int flags, lookup_flags &= ~LOOKUP_AUTOMOUNT; if (flags & AT_EMPTY_PATH) lookup_flags |= LOOKUP_EMPTY; + if (flags & AT_BENEATH) + lookup_flags |= LOOKUP_BENEATH; + if (flags & AT_XDEV) + lookup_flags |= LOOKUP_XDEV; + if (flags & AT_NO_PROCLINKS) + lookup_flags |= LOOKUP_NO_PROCLINKS; + if (flags & AT_NO_SYMLINKS) + lookup_flags |= LOOKUP_NO_SYMLINKS;
retry: error = user_path_at(dfd, filename, lookup_flags, &path); diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 27dc7a60693e..ad5bba4b5b12 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -9,7 +9,8 @@ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \ + O_NOPROCLINKS | O_NOSYMLINKS)
#ifndef force_o_largefile #define force_o_largefile() (BITS_PER_LONG != 32) diff --git a/include/linux/namei.h b/include/linux/namei.h index a78606e8e3df..5ff7f3362d1b 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_EMPTY 0x4000 #define LOOKUP_DOWN 0x8000
+/* Scoping flags for lookup. */ +#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */ +#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */ +#define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ +#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. + Implies LOOKUP_NO_PROCLINKS. */ + extern int path_pts(struct path *path);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty); diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 9dc0bf0c5a6e..c2bf5983e46a 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -97,6 +97,23 @@ #define O_NDELAY O_NONBLOCK #endif
+/* + * These are identical to their AT_* counterparts (which affect the entireity + * of path resolution). + */ +#ifndef O_BENEATH +#define O_BENEATH 00040000000 /* *Not* the same as capsicum's O_BENEATH! */ +#endif +#ifndef O_XDEV +#define O_XDEV 00100000000 +#endif +#ifndef O_NOPROCLINKS +#define O_NOPROCLINKS 00200000000 +#endif +#ifndef O_NOSYMLINKS +#define O_NOSYMLINKS 01000000000 +#endif + #define F_DUPFD 0 /* dup */ #define F_GETFD 1 /* get close_on_exec */ #define F_SETFD 2 /* set/clear close_on_exec */ diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 594b85f7cb86..551a9e2166a8 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -92,5 +92,13 @@
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+/* Flags which affect path *resolution*, not just last-component handling. */ +#define AT_BENEATH 0x10000 /* No absolute paths or ".." escaping + (in-path or through symlinks) */ +#define AT_XDEV 0x20000 /* No mountpoint crossing. */ +#define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */ +#define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*. + Implies AT_NO_PROCLINKS. */ +
#endif /* _UAPI_LINUX_FCNTL_H */
On Sat, Sep 29, 2018 at 08:34:51PM +1000, Aleksa Sarai wrote:
Add the following flags for path resolution. The primary justification for these flags is to allow for programs to be far more strict about how they want path resolution to handle symlinks, mountpoint crossings, and paths that escape the dirfd (through an absolute path or ".." shenanigans).
This is of particular concern to container runtimes that want to be very careful about malicious root filesystems that a container's init might have screwed around with (and there is no real way to protect against this in userspace if you consider potential races against a malicious container's init).
AT_BENEATH: Disallow ".." or absolute paths (either in the path or found during symlink resolution) to escape the starting point of name resolution, though ".." is permitted in cases like "foo/../bar". Relative symlinks are still allowed (as long as they don't escape the starting point).
AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up* from one). The primary "scoping" use is to blocking resolution that crosses a bind-mount, which has a similar property to a symlink (in the way that it allows for escape from the starting-point). Since it is not possible to differentiate bind-mounts However since bind-mounting requires privileges (in ways symlinks don't) this has been split from LOOKUP_BENEATH. The naming is based on "find -xdev" (though find(1) doesn't walk upwards, the semantics seem obvious).
AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very specific restriction, and it exists because /proc/$pid/fd/... "symlinks" allow for access outside nd->root and pose risk to container runtimes that don't want to be tricked into accessing a host path (but do want to allow no-funny-business symlink resolution).
AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies AT_NO_PROCLINK (obviously).
The AT_NO_*LINK flags return -ELOOP if path resolution would violates their requirement, while the others all return -EXDEV. Currently these are only enabled for the stat(2) family and the openat(2) family (the latter has its own brand of O_* flags with the same semantics). Ideally these flags would be supported by all *at(2) syscalls, but this will require adding flags arguments to many of them (and will be done in a separate patchset).
Cc: Andy Lutomirski luto@kernel.org Cc: Eric Biederman ebiederm@xmission.com Cc: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com
Not to be a stickler about protocol but given that this is based heavily on ideas from prior patchsets and suggestions as you mentioned it might make sense to cite them as links and also maybe add some Suggested-by lines for some of the authors for the sake of posterity. :)
[1]: https://lwn.net/Articles/721443/ [2]: https://lwn.net/Articles/723057/
fs/fcntl.c | 2 +- fs/namei.c | 61 ++++++++++++++++++++++++++------ fs/open.c | 8 +++++ fs/stat.c | 13 +++++-- include/linux/fcntl.h | 3 +- include/linux/namei.h | 7 ++++ include/uapi/asm-generic/fcntl.h | 17 +++++++++ include/uapi/linux/fcntl.h | 8 +++++ 8 files changed, 104 insertions(+), 15 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index 4137d96534a6..e343618736f7 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
- BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c index fb913148d4d1..757dd783771c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -859,6 +859,8 @@ static int nd_jump_root(struct nameidata *nd) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; }
- if (unlikely(nd->flags & LOOKUP_BENEATH))
nd->flags |= LOOKUP_JUMPED; return 0;return -EXDEV;
} @@ -1083,14 +1085,19 @@ const char *get_link(struct nameidata *nd) } else { res = get(dentry, inode, &last->done); }
/* If we just jumped it was because of a procfs-style link. */
if (unlikely(nd->flags & LOOKUP_JUMPED) &&
unlikely(nd->flags & LOOKUP_NO_PROCLINKS))
if (IS_ERR_OR_NULL(res)) return res; } if (*res == '/') { if (!nd->root.mnt) set_root(nd);return ERR_PTR(-ELOOP);
if (unlikely(nd_jump_root(nd)))
return ERR_PTR(-ECHILD);
error = nd_jump_root(nd);
if (unlikely(error))
while (unlikely(*++res == '/')) ; }return ERR_PTR(error);
@@ -1271,12 +1278,16 @@ static int follow_managed(struct path *path, struct nameidata *nd) break; }
- if (need_mntput && path->mnt == mnt)
mntput(path->mnt);
- if (need_mntput) {
if (path->mnt == mnt)
mntput(path->mnt);
if (unlikely(nd->flags & LOOKUP_XDEV))
ret = -EXDEV;
else
nd->flags |= LOOKUP_JUMPED;
- } if (ret == -EISDIR || !ret) ret = 1;
- if (need_mntput)
if (unlikely(ret < 0)) path_put_conditional(path, nd); return ret;nd->flags |= LOOKUP_JUMPED;
@@ -1333,6 +1344,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, mounted = __lookup_mnt(path->mnt, path->dentry); if (!mounted) break;
if (unlikely(nd->flags & LOOKUP_XDEV))
path->mnt = &mounted->mnt; path->dentry = mounted->mnt.mnt_root; nd->flags |= LOOKUP_JUMPED;return false;
@@ -1353,8 +1366,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode; while (1) {
if (path_equal(&nd->path, &nd->root))
if (path_equal(&nd->path, &nd->root)) {
if (unlikely(nd->flags & LOOKUP_BENEATH))
return -EXDEV; break;
if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent;}
@@ -1379,6 +1395,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (&mparent->mnt == nd->path.mnt) break;
if (unlikely(nd->flags & LOOKUP_XDEV))
return -EXDEV; /* we know that mountpoint was pinned */ nd->path.dentry = mountpoint; nd->path.mnt = &mparent->mnt;
@@ -1481,8 +1499,11 @@ static int path_parent_directory(struct path *path) static int follow_dotdot(struct nameidata *nd) { while(1) {
if (path_equal(&nd->path, &nd->root))
if (path_equal(&nd->path, &nd->root)) {
if (unlikely(nd->flags & LOOKUP_BENEATH))
return -EXDEV; break;
if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret)}
@@ -1491,6 +1512,8 @@ static int follow_dotdot(struct nameidata *nd) } if (!follow_up(&nd->path)) break;
if (unlikely(nd->flags & LOOKUP_XDEV))
} follow_mount(&nd->path); nd->inode = nd->path.dentry->d_inode;return -EXDEV;
@@ -1720,6 +1743,8 @@ static int pick_link(struct nameidata *nd, struct path *link, { int error; struct saved *last;
- if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) { path_to_nameidata(link, nd); return -ELOOP;return -ELOOP;
@@ -2175,6 +2200,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags) if (!*s) flags &= ~LOOKUP_RCU;
- if (flags & LOOKUP_NO_SYMLINKS)
if (flags & LOOKUP_RCU) rcu_read_lock();flags |= LOOKUP_NO_PROCLINKS;
@@ -2204,10 +2231,12 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->m_seq = read_seqbegin(&mount_lock); if (*s == '/') {
set_root(nd);int error;
if (likely(!nd_jump_root(nd)))
return s;
return ERR_PTR(-ECHILD);
error = nd_jump_root(nd);
if (unlikely(error))
s = ERR_PTR(error);
} else if (nd->dfd == AT_FDCWD) { if (flags & LOOKUP_RCU) { struct fs_struct *fs = current->fs;return s;
@@ -2223,6 +2252,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; }
if (unlikely(flags & LOOKUP_BENEATH)) {
nd->root = nd->path;
if (!(flags & LOOKUP_RCU))
path_get(&nd->root);
return s; } else { /* Caller must check execute permissions on the starting path component */}
@@ -2247,6 +2281,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; }
if (unlikely(flags & LOOKUP_BENEATH)) {
nd->root = nd->path;
if (!(flags & LOOKUP_RCU))
path_get(&nd->root);
fdput(f); return s; }}
diff --git a/fs/open.c b/fs/open.c index 0285ce7dbd51..80f5f566a5ff 100644 --- a/fs/open.c +++ b/fs/open.c @@ -988,6 +988,14 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_DIRECTORY; if (!(flags & O_NOFOLLOW)) lookup_flags |= LOOKUP_FOLLOW;
- if (flags & O_BENEATH)
lookup_flags |= LOOKUP_BENEATH;
- if (flags & O_XDEV)
lookup_flags |= LOOKUP_XDEV;
- if (flags & O_NOPROCLINKS)
lookup_flags |= LOOKUP_NO_PROCLINKS;
- if (flags & O_NOSYMLINKS)
op->lookup_flags = lookup_flags; return 0;lookup_flags |= LOOKUP_NO_SYMLINKS;
} diff --git a/fs/stat.c b/fs/stat.c index f8e6fb2c3657..791e61b916ae 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -170,8 +170,9 @@ int vfs_statx(int dfd, const char __user *filename, int flags, int error = -EINVAL; unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
- if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0)
- if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH |
KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV |
return -EINVAL;AT_NO_PROCLINKS | AT_NO_SYMLINKS))
if (flags & AT_SYMLINK_NOFOLLOW) @@ -180,6 +181,14 @@ int vfs_statx(int dfd, const char __user *filename, int flags, lookup_flags &= ~LOOKUP_AUTOMOUNT; if (flags & AT_EMPTY_PATH) lookup_flags |= LOOKUP_EMPTY;
- if (flags & AT_BENEATH)
lookup_flags |= LOOKUP_BENEATH;
- if (flags & AT_XDEV)
lookup_flags |= LOOKUP_XDEV;
- if (flags & AT_NO_PROCLINKS)
lookup_flags |= LOOKUP_NO_PROCLINKS;
- if (flags & AT_NO_SYMLINKS)
lookup_flags |= LOOKUP_NO_SYMLINKS;
retry: error = user_path_at(dfd, filename, lookup_flags, &path); diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 27dc7a60693e..ad5bba4b5b12 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -9,7 +9,8 @@ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
O_NOPROCLINKS | O_NOSYMLINKS)
#ifndef force_o_largefile #define force_o_largefile() (BITS_PER_LONG != 32) diff --git a/include/linux/namei.h b/include/linux/namei.h index a78606e8e3df..5ff7f3362d1b 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -47,6 +47,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_EMPTY 0x4000 #define LOOKUP_DOWN 0x8000 +/* Scoping flags for lookup. */ +#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */ +#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */ +#define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ +#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*.
Implies LOOKUP_NO_PROCLINKS. */
extern int path_pts(struct path *path); extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty); diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 9dc0bf0c5a6e..c2bf5983e46a 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -97,6 +97,23 @@ #define O_NDELAY O_NONBLOCK #endif +/*
- These are identical to their AT_* counterparts (which affect the entireity
- of path resolution).
- */
+#ifndef O_BENEATH +#define O_BENEATH 00040000000 /* *Not* the same as capsicum's O_BENEATH! */ +#endif +#ifndef O_XDEV +#define O_XDEV 00100000000 +#endif +#ifndef O_NOPROCLINKS +#define O_NOPROCLINKS 00200000000 +#endif +#ifndef O_NOSYMLINKS +#define O_NOSYMLINKS 01000000000 +#endif
#define F_DUPFD 0 /* dup */ #define F_GETFD 1 /* get close_on_exec */ #define F_SETFD 2 /* set/clear close_on_exec */ diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 594b85f7cb86..551a9e2166a8 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -92,5 +92,13 @@ #define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ +/* Flags which affect path *resolution*, not just last-component handling. */ +#define AT_BENEATH 0x10000 /* No absolute paths or ".." escaping
(in-path or through symlinks) */
+#define AT_XDEV 0x20000 /* No mountpoint crossing. */ +#define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */ +#define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*.
Implies AT_NO_PROCLINKS. */
#endif /* _UAPI_LINUX_FCNTL_H */
2.19.0
On 2018-09-29, Christian Brauner christian@brauner.io wrote:
Cc: Andy Lutomirski luto@kernel.org Cc: Eric Biederman ebiederm@xmission.com Cc: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com
Not to be a stickler about protocol but given that this is based heavily on ideas from prior patchsets and suggestions as you mentioned it might make sense to cite them as links and also maybe add some Suggested-by lines for some of the authors for the sake of posterity. :)
Oops -- yup, I will fix that up.
On 2018-09-29, Aleksa Sarai cyphar@cyphar.com wrote:
- AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up* from one). The primary "scoping" use is to blocking resolution that crosses a bind-mount, which has a similar property to a symlink (in the way that it allows for escape from the starting-point). Since it is not possible to differentiate bind-mounts However since bind-mounting requires privileges (in ways symlinks don't) this has been split from LOOKUP_BENEATH. The naming is based on "find -xdev" (though find(1) doesn't walk upwards, the semantics seem obvious).
I've just noticed that the mountpoint-crossing code for AT_XDEV doesn't detect things like:
% ln -s / /tmp/jumpup % vfs_helper -o open -F xdev -d /tmp jumpup /
I will fix that in v2.
On Sat, Sep 29, 2018 at 4:28 PM Aleksa Sarai cyphar@cyphar.com wrote:
Add the following flags for path resolution. The primary justification for these flags is to allow for programs to be far more strict about how they want path resolution to handle symlinks, mountpoint crossings, and paths that escape the dirfd (through an absolute path or ".." shenanigans).
This is of particular concern to container runtimes that want to be very careful about malicious root filesystems that a container's init might have screwed around with (and there is no real way to protect against this in userspace if you consider potential races against a malicious container's init).
- AT_BENEATH: Disallow ".." or absolute paths (either in the path or found during symlink resolution) to escape the starting point of name resolution, though ".." is permitted in cases like "foo/../bar". Relative symlinks are still allowed (as long as they don't escape the starting point).
As I said on the other thread, I would strongly prefer an API that behaves along the lines of David Drysdale's old patch https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@googl... : Forbid any use of "..". This would also be more straightforward to implement safely. If that doesn't work for you, I would like it if you could at least make that an option. I would like it if this API could mitigate straightforward directory traversal bugs such as https://bugs.chromium.org/p/project-zero/issues/detail?id=1583, where a confused deputy attempts to access a path like "/mnt/media_rw/../../data" while intending to access a directory under "/mnt/media_rw".
AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up* from one). The primary "scoping" use is to blocking resolution that crosses a bind-mount, which has a similar property to a symlink (in the way that it allows for escape from the starting-point). Since it is not possible to differentiate bind-mounts However since bind-mounting requires privileges (in ways symlinks don't) this has been split from LOOKUP_BENEATH. The naming is based on "find -xdev" (though find(1) doesn't walk upwards, the semantics seem obvious).
AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very specific restriction, and it exists because /proc/$pid/fd/... "symlinks" allow for access outside nd->root and pose risk to container runtimes that don't want to be tricked into accessing a host path (but do want to allow no-funny-business symlink resolution).
AT_BENEATH has to imply AT_NO_PROCLINK, right? Especially with the semantics you picked for AT_BENEATH. With the original O_BENEATH_ONLY semantics, it might be okay to not imply AT_NO_PROCLINK...
- AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies AT_NO_PROCLINK (obviously).
The AT_NO_*LINK flags return -ELOOP if path resolution would violates their requirement, while the others all return -EXDEV. Currently these are only enabled for the stat(2) family and the openat(2) family (the latter has its own brand of O_* flags with the same semantics). Ideally these flags would be supported by all *at(2) syscalls, but this will require adding flags arguments to many of them (and will be done in a separate patchset).
On Mon, Oct 01, 2018 at 02:28:03PM +0200, Jann Horn wrote:
On Sat, Sep 29, 2018 at 4:28 PM Aleksa Sarai cyphar@cyphar.com wrote:
Add the following flags for path resolution. The primary justification for these flags is to allow for programs to be far more strict about how they want path resolution to handle symlinks, mountpoint crossings, and paths that escape the dirfd (through an absolute path or ".." shenanigans).
This is of particular concern to container runtimes that want to be very careful about malicious root filesystems that a container's init might have screwed around with (and there is no real way to protect against this in userspace if you consider potential races against a malicious container's init).
- AT_BENEATH: Disallow ".." or absolute paths (either in the path or found during symlink resolution) to escape the starting point of name resolution, though ".." is permitted in cases like "foo/../bar". Relative symlinks are still allowed (as long as they don't escape the starting point).
As I said on the other thread, I would strongly prefer an API that behaves along the lines of David Drysdale's old patch https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@googl... : Forbid any use of "..". This would also be more straightforward to implement safely. If that doesn't work for you, I would like it if you could at least make that an option. I would like it if this API could mitigate straightforward directory traversal bugs such as https://bugs.chromium.org/p/project-zero/issues/detail?id=1583, where a confused deputy attempts to access a path like "/mnt/media_rw/../../data" while intending to access a directory under "/mnt/media_rw".
Oh, the semantics for this changed in this patchset, hah. I was still on vacation so didn't get to look at it before it was sent out. From prior discussion I remember that the original intention actual was what you argue for. And the patchset should be as tight as possible. Having special cases where ".." is allowed just sounds like an invitation for userspace to get it wrong. Aleksa, did you have a specific use-case in mind that made you change this or was it already present in an earlier iteration of the patchset by someone else?
AT_XDEV: Disallow mount-point crossing (both *down* into one, or *up* from one). The primary "scoping" use is to blocking resolution that crosses a bind-mount, which has a similar property to a symlink (in the way that it allows for escape from the starting-point). Since it is not possible to differentiate bind-mounts However since bind-mounting requires privileges (in ways symlinks don't) this has been split from LOOKUP_BENEATH. The naming is based on "find -xdev" (though find(1) doesn't walk upwards, the semantics seem obvious).
AT_NO_PROCLINK: Disallows ->get_link "symlink" jumping. This is a very specific restriction, and it exists because /proc/$pid/fd/... "symlinks" allow for access outside nd->root and pose risk to container runtimes that don't want to be tricked into accessing a host path (but do want to allow no-funny-business symlink resolution).
AT_BENEATH has to imply AT_NO_PROCLINK, right? Especially with the semantics you picked for AT_BENEATH. With the original O_BENEATH_ONLY semantics, it might be okay to not imply AT_NO_PROCLINK...
- AT_NO_SYMLINK: Disallows symlink jumping *of any kind*. Implies AT_NO_PROCLINK (obviously).
The AT_NO_*LINK flags return -ELOOP if path resolution would violates their requirement, while the others all return -EXDEV. Currently these are only enabled for the stat(2) family and the openat(2) family (the latter has its own brand of O_* flags with the same semantics). Ideally these flags would be supported by all *at(2) syscalls, but this will require adding flags arguments to many of them (and will be done in a separate patchset).
On 2018-10-01, Christian Brauner christian@brauner.io wrote:
On Mon, Oct 01, 2018 at 02:28:03PM +0200, Jann Horn wrote:
On Sat, Sep 29, 2018 at 4:28 PM Aleksa Sarai cyphar@cyphar.com wrote:
- AT_BENEATH: Disallow ".." or absolute paths (either in the path or found during symlink resolution) to escape the starting point of name resolution, though ".." is permitted in cases like "foo/../bar". Relative symlinks are still allowed (as long as they don't escape the starting point).
As I said on the other thread, I would strongly prefer an API that behaves along the lines of David Drysdale's old patch https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@googl... : Forbid any use of "..". This would also be more straightforward to implement safely. If that doesn't work for you, I would like it if you could at least make that an option. I would like it if this API could mitigate straightforward directory traversal bugs such as https://bugs.chromium.org/p/project-zero/issues/detail?id=1583, where a confused deputy attempts to access a path like "/mnt/media_rw/../../data" while intending to access a directory under "/mnt/media_rw".
Oh, the semantics for this changed in this patchset, hah. I was still on vacation so didn't get to look at it before it was sent out. From prior discussion I remember that the original intention actual was what you argue for. And the patchset should be as tight as possible. Having special cases where ".." is allowed just sounds like an invitation for userspace to get it wrong. Aleksa, did you have a specific use-case in mind that made you change this or was it already present in an earlier iteration of the patchset by someone else?
Al's original patchset allowed "..". A quick survey of my machine shows that there are 100k symlinks that contain ".." (~37% of all symlinks on my machine). This indicates to me that you would be restricting a large amount of reasonable resolutions because of this restriction.
I posted a proposed way to protect against ".." shenanigans. If it's turns out this is not possible, I'm okay with disallowing ".." (assuming Al is also okay with that).
On Tue, Oct 02, 2018 at 02:04:31AM +1000, Aleksa Sarai wrote:
On 2018-10-01, Christian Brauner christian@brauner.io wrote:
On Mon, Oct 01, 2018 at 02:28:03PM +0200, Jann Horn wrote:
On Sat, Sep 29, 2018 at 4:28 PM Aleksa Sarai cyphar@cyphar.com wrote:
- AT_BENEATH: Disallow ".." or absolute paths (either in the path or found during symlink resolution) to escape the starting point of name resolution, though ".." is permitted in cases like "foo/../bar". Relative symlinks are still allowed (as long as they don't escape the starting point).
As I said on the other thread, I would strongly prefer an API that behaves along the lines of David Drysdale's old patch https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@googl... : Forbid any use of "..". This would also be more straightforward to implement safely. If that doesn't work for you, I would like it if you could at least make that an option. I would like it if this API could mitigate straightforward directory traversal bugs such as https://bugs.chromium.org/p/project-zero/issues/detail?id=1583, where a confused deputy attempts to access a path like "/mnt/media_rw/../../data" while intending to access a directory under "/mnt/media_rw".
Oh, the semantics for this changed in this patchset, hah. I was still on vacation so didn't get to look at it before it was sent out. From prior discussion I remember that the original intention actual was what you argue for. And the patchset should be as tight as possible. Having special cases where ".." is allowed just sounds like an invitation for userspace to get it wrong. Aleksa, did you have a specific use-case in mind that made you change this or was it already present in an earlier iteration of the patchset by someone else?
Al's original patchset allowed "..". A quick survey of my machine shows that there are 100k symlinks that contain ".." (~37% of all symlinks on my machine). This indicates to me that you would be restricting a large amount of reasonable resolutions because of this restriction.
I posted a proposed way to protect against ".." shenanigans. If it's turns out this is not possible, I'm okay with disallowing ".." (assuming Al is also okay with that).
Sounds acceptable to me.
The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths. The already-existing AT_XDEV and AT_NO_PROCLINKS help defend against other potential attacks in a malicious rootfs scenario.
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
The most significant change in semantics with AT_THIS_ROOT is that *at(2) syscalls now no longer have the property that an absolute pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is specified). The reasoning behind this is that AT_THIS_ROOT necessarily has to chroot-scope symlinks with absolute paths to dirfd, and so doing it for the base path seems to be the most consistent behaviour (and also avoids foot-gunning users who want to chroot-scope paths that might be absolute).
Currently this is only enabled for the stat(2) and openat(2) family (the latter has its own flag O_THISROOT with the same semantics). Ideally this flag would be supported by all *at(2) syscalls, but this will require adding flags arguments to many of them (and will be done in a separate patchset).
[1]: https://github.com/cyphar/filepath-securejoin
Cc: Eric Biederman ebiederm@xmission.com Cc: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/fcntl.c | 2 +- fs/namei.c | 121 +++++++++++++++++-------------- fs/open.c | 2 + fs/stat.c | 4 +- include/linux/fcntl.h | 2 +- include/linux/namei.h | 1 + include/uapi/asm-generic/fcntl.h | 3 + include/uapi/linux/fcntl.h | 2 + 8 files changed, 81 insertions(+), 56 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index e343618736f7..4c36c5b9fdb9 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)); diff --git a/fs/namei.c b/fs/namei.c index 757dd783771c..1b984f0dbbb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd) } }
+/* + * Configure nd->path based on the nd->dfd. This is only used as part of + * path_init(). + */ +static inline int dirfd_path_init(struct nameidata *nd) +{ + if (nd->dfd == AT_FDCWD) { + if (nd->flags & LOOKUP_RCU) { + struct fs_struct *fs = current->fs; + unsigned seq; + + do { + seq = read_seqcount_begin(&fs->seq); + nd->path = fs->pwd; + nd->inode = nd->path.dentry->d_inode; + nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); + } while (read_seqcount_retry(&fs->seq, seq)); + } else { + get_fs_pwd(current->fs, &nd->path); + nd->inode = nd->path.dentry->d_inode; + } + } else { + /* Caller must check execute permissions on the starting path component */ + struct fd f = fdget_raw(nd->dfd); + struct dentry *dentry; + + if (!f.file) + return -EBADF; + + dentry = f.file->f_path.dentry; + + if (*nd->name->name && unlikely(!d_can_lookup(dentry))) { + fdput(f); + return -ENOTDIR; + } + + nd->path = f.file->f_path; + if (nd->flags & LOOKUP_RCU) { + nd->inode = nd->path.dentry->d_inode; + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); + } else { + path_get(&nd->path); + nd->inode = nd->path.dentry->d_inode; + } + fdput(f); + } + if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) { + nd->root = nd->path; + if (!(nd->flags & LOOKUP_RCU)) + path_get(&nd->root); + } + return 0; +} + /* must be paired with terminate_walk() */ static const char *path_init(struct nameidata *nd, unsigned flags) { + int error; const char *s = nd->name->name;
if (!*s) @@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock); + if (unlikely(flags & LOOKUP_CHROOT)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + } if (*s == '/') { - int error; - set_root(nd); + if (likely(!nd->root.mnt)) + set_root(nd); error = nd_jump_root(nd); if (unlikely(error)) s = ERR_PTR(error); return s; - } else if (nd->dfd == AT_FDCWD) { - if (flags & LOOKUP_RCU) { - struct fs_struct *fs = current->fs; - unsigned seq; - - do { - seq = read_seqcount_begin(&fs->seq); - nd->path = fs->pwd; - nd->inode = nd->path.dentry->d_inode; - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); - } while (read_seqcount_retry(&fs->seq, seq)); - } else { - get_fs_pwd(current->fs, &nd->path); - nd->inode = nd->path.dentry->d_inode; - } - if (unlikely(flags & LOOKUP_BENEATH)) { - nd->root = nd->path; - if (!(flags & LOOKUP_RCU)) - path_get(&nd->root); - } - return s; - } else { - /* Caller must check execute permissions on the starting path component */ - struct fd f = fdget_raw(nd->dfd); - struct dentry *dentry; - - if (!f.file) - return ERR_PTR(-EBADF); - - dentry = f.file->f_path.dentry; - - if (*s && unlikely(!d_can_lookup(dentry))) { - fdput(f); - return ERR_PTR(-ENOTDIR); - } - - nd->path = f.file->f_path; - if (flags & LOOKUP_RCU) { - nd->inode = nd->path.dentry->d_inode; - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); - } else { - path_get(&nd->path); - nd->inode = nd->path.dentry->d_inode; - } - if (unlikely(flags & LOOKUP_BENEATH)) { - nd->root = nd->path; - if (!(flags & LOOKUP_RCU)) - path_get(&nd->root); - } - fdput(f); - return s; } + if (likely(!nd->path.mnt)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + } + return s; }
static const char *trailing_symlink(struct nameidata *nd) diff --git a/fs/open.c b/fs/open.c index 80f5f566a5ff..81d148f626cd 100644 --- a/fs/open.c +++ b/fs/open.c @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_NO_PROCLINKS; if (flags & O_NOSYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS; + if (flags & O_THISROOT) + lookup_flags |= LOOKUP_CHROOT; op->lookup_flags = lookup_flags; return 0; } diff --git a/fs/stat.c b/fs/stat.c index 791e61b916ae..e8366e4812c3 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH | KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV | - AT_NO_PROCLINKS | AT_NO_SYMLINKS)) + AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT)) return -EINVAL;
if (flags & AT_SYMLINK_NOFOLLOW) @@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags, lookup_flags |= LOOKUP_NO_PROCLINKS; if (flags & AT_NO_SYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS; + if (flags & AT_THIS_ROOT) + lookup_flags |= LOOKUP_CHROOT;
retry: error = user_path_at(dfd, filename, lookup_flags, &path); diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index ad5bba4b5b12..95480cd4c09d 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -10,7 +10,7 @@ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \ - O_NOPROCLINKS | O_NOSYMLINKS) + O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT)
#ifndef force_o_largefile #define force_o_largefile() (BITS_PER_LONG != 32) diff --git a/include/linux/namei.h b/include/linux/namei.h index 5ff7f3362d1b..7ec9e2d84649 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. Implies LOOKUP_NO_PROCLINKS. */ +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */
extern int path_pts(struct path *path);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index c2bf5983e46a..11206b0e927c 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -113,6 +113,9 @@ #ifndef O_NOSYMLINKS #define O_NOSYMLINKS 01000000000 #endif +#ifndef O_THISROOT +#define O_THISROOT 02000000000 +#endif
#define F_DUPFD 0 /* dup */ #define F_GETFD 1 /* get close_on_exec */ diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 551a9e2166a8..ea978457b68f 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -99,6 +99,8 @@ #define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */ #define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*. Implies AT_NO_PROCLINKS. */ +#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though + it is chroot-ed into dirfd. */
#endif /* _UAPI_LINUX_FCNTL_H */
With the addition of so many new scoping flags, it's necessary to have some sort of validation that they really work. There were no vfs self-tests in the past, so this also includes a basic framework that future VFS tests can use.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 +++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154 ++++++++++++++++++ 11 files changed, 537 insertions(+) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index f1fe492c8e17..6f814e49071f 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -43,6 +43,7 @@ ifneq (1, $(quicktest)) TARGETS += timers endif TARGETS += user +TARGETS += vfs TARGETS += vm TARGETS += x86 TARGETS += zram diff --git a/tools/testing/selftests/vfs/.gitignore b/tools/testing/selftests/vfs/.gitignore new file mode 100644 index 000000000000..c57ebcba14c0 --- /dev/null +++ b/tools/testing/selftests/vfs/.gitignore @@ -0,0 +1 @@ +/vfs_helper diff --git a/tools/testing/selftests/vfs/Makefile b/tools/testing/selftests/vfs/Makefile new file mode 100644 index 000000000000..8ca3cef43dc3 --- /dev/null +++ b/tools/testing/selftests/vfs/Makefile @@ -0,0 +1,13 @@ +# SPDX-License-Identifier: GPL-2.0+ +# Author: Aleksa Sarai cyphar@cyphar.com +# Copyright (C) 2018 SUSE LLC. + +# Makefile for mount selftests. +CFLAGS = -Wall \ + -O2 \ + -I../../../../usr/include/ + +TEST_PROGS := $(wildcard tests/*.sh) +TEST_GEN_FILES := vfs_helper + +include ../lib.mk diff --git a/tools/testing/selftests/vfs/at_flags.h b/tools/testing/selftests/vfs/at_flags.h new file mode 100644 index 000000000000..a8ca8f689753 --- /dev/null +++ b/tools/testing/selftests/vfs/at_flags.h @@ -0,0 +1,40 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018 SUSE LLC. + */ + +#ifndef __AT_FLAGS_H__ +#define __AT_FLAGS_H__ + +/* These come from <uapi/asm-generic/linux/fcntl.h> */ +#ifndef O_BENEATH +# define O_BENEATH 00040000000 +# define O_XDEV 00100000000 +# define O_NOPROCLINKS 00200000000 +# define O_NOSYMLINKS 01000000000 +# define O_THISROOT 02000000000 +#endif +#ifndef AT_BENEATH +# define AT_BENEATH 0x8000 +# define AT_XDEV 0x10000 +# define AT_NO_PROCLINKS 0x20000 +# define AT_NO_SYMLINKS 0x40000 +# define AT_THIS_ROOT 0x80000 +#endif + +struct flag { + const char *name; + unsigned int at_flag, open_flag; +}; + +struct flag AT_FLAGS[] = { + { .name = "beneath", .at_flag = AT_BENEATH, .open_flag = O_BENEATH }, + { .name = "xdev", .at_flag = AT_XDEV, .open_flag = O_XDEV }, + { .name = "no_proclinks", .at_flag = AT_NO_PROCLINKS, .open_flag = O_NOPROCLINKS }, + { .name = "no_symlinks", .at_flag = AT_NO_SYMLINKS, .open_flag = O_NOSYMLINKS }, + { .name = "this_root", .at_flag = AT_THIS_ROOT, .open_flag = O_THISROOT }, + { 0 }, /* terminate */ +}; + +#endif /* !defined(__AT_FLAGS_H__) */ diff --git a/tools/testing/selftests/vfs/common.sh b/tools/testing/selftests/vfs/common.sh new file mode 100644 index 000000000000..82ac8ad2a5a5 --- /dev/null +++ b/tools/testing/selftests/vfs/common.sh @@ -0,0 +1,37 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# Author: Aleksa Sarai cyphar@cyphar.com +# Copyright (C) 2018 SUSE LLC. + +set -e -o pipefail + +tmpdir="$(mktemp -d --tmpdir vfs_test.XXXXXX)" +trap "rm -rf $tmpdir" EXIT + +root="$tmpdir/root" +mkdir -p "$root" + +function fail() { + echo "# not ok" "$@" + exit 1 +} + +ksft_skip=4 +function skip() { + echo "# skip" "$@" + exit "$ksft_skip" +} + +function run() { + local old_flags="$-" + set +eET + output="$("$@" 2>&1)" + status="$?" + set "-$old_flags" +} + +testrootdir="$(readlink -f "$(dirname "$BASH_SOURCE")")" +function vfs_helper() { + run "$testrootdir/vfs_helper" "$@" +} +vfs_ops=( "open" "stat" "lstat" ) diff --git a/tools/testing/selftests/vfs/tests/0001_at_beneath.sh b/tools/testing/selftests/vfs/tests/0001_at_beneath.sh new file mode 100755 index 000000000000..9a03b0953032 --- /dev/null +++ b/tools/testing/selftests/vfs/tests/0001_at_beneath.sh @@ -0,0 +1,72 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# Author: Aleksa Sarai cyphar@cyphar.com +# Copyright (C) 2018 SUSE LLC. + +sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")" +source "$sourcedir/../common.sh" + +touch "$root/inside" +ln -s / "$root/rootlink" +ln -s .. "$root/dotdot" +ln -s "/../../../../../../$root" "$root/badlink" + +mkdir -p "$root/subdir" +ln -s ../inside "$root/subdir/dotdotinside" +ln -s ../subdir "$root/subdir/dotdotsubdir" +ln -s subdir "$root/subdirlink" +ln -s ../subdirlink/../../inside "$root/subdir/complexlink" + +for op in "${vfs_ops[@]}" +do + vfs_helper -o "$op" -F beneath -d "$root" .. + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/].." + + vfs_helper -o "$op" -F beneath -d "$root" ../root + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]../root" + + vfs_helper -o "$op" -F beneath -d "$root" dotdot/root + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]dotdot(=..)/root" + + vfs_helper -o "$op" -F beneath -d "$root" "$root" + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]/root" + + vfs_helper -o "$op" -F beneath -d "$root" rootlink + if [[ "$op" == "lstat" ]] + then + [[ "$status" = 0 ]] || fail "$op beneath [/root/]rootlink(=/)" + else + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]rootlink(=/)" + fi + + vfs_helper -o "$op" -F beneath -d "$root" rootlink/ + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]rootlink(=/)/" + + vfs_helper -o "$op" -F beneath -d "$root" "rootlink/$root" + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]rootlink(=/)/root" + + vfs_helper -o "$op" -F beneath -d "$root" badlink + if [[ "$op" == "lstat" ]] + then + [[ "$status" = 0 ]] || fail "$op beneath [/root/]badlink(=/../.../root)" + else + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]badlink(=/../.../root)" + fi + + vfs_helper -o "$op" -F beneath -d "$root" subdir/../inside + [[ "$status" -eq 0 ]] || fail "$op beneath [/root/]subdir/../inside" + + vfs_helper -o "$op" -F beneath -d "$root" subdir/dotdotinside + [[ "$status" -eq 0 ]] || fail "$op beneath [/root/]subdir/dotdotinside(=../inside)" + + vfs_helper -o "$op" -F beneath -d "$root" subdir/dotdotsubdir/ + [[ "$status" -eq 0 ]] || fail "$op beneath [/root/]subdir/dotdotsubdir(=../subdir)/" + + vfs_helper -o "$op" -F beneath -d "$root" subdir/complexlink + if [[ "$op" == "lstat" ]] + then + [[ "$status" = 0 ]] || fail "$op beneath [/root/]complexlink(=../subdirlink/../../inside)" + else + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op beneath [/root/]complexlink(=../subdirlink/../../inside)" + fi +done diff --git a/tools/testing/selftests/vfs/tests/0002_at_xdev.sh b/tools/testing/selftests/vfs/tests/0002_at_xdev.sh new file mode 100755 index 000000000000..06be58a8ffe7 --- /dev/null +++ b/tools/testing/selftests/vfs/tests/0002_at_xdev.sh @@ -0,0 +1,54 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# Author: Aleksa Sarai cyphar@cyphar.com +# Copyright (C) 2018 SUSE LLC. + +sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")" +source "$sourcedir/../common.sh" + +( mountpoint -q "/tmp" ) || skip "/tmp is not a mountpoint" + +touch /tmp/foo + +ln -s /tmp "$root/link_tmp" + +for op in "${vfs_ops[@]}" +do + vfs_helper -o "$op" -F xdev -d / tmp/ + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/]tmp/" + + vfs_helper -o "$op" -F xdev -d / tmp/foo + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/]tmp/foo" + + vfs_helper -o "$op" -F xdev -d "$root" /tmp + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]/tmp" + + vfs_helper -o "$op" -F xdev -d "$root" /tmp/ + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]/tmp/" + + vfs_helper -o "$op" -F xdev -d "$root" /tmp/foo + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]/tmp/foo" + + vfs_helper -o "$op" -F xdev -d /tmp foo + [[ "$status" = 0 ]] || fail "$op xdev [/tmp/]foo" + + vfs_helper -o "$op" -F xdev -d /tmp .. + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/tmp/].." + + vfs_helper -o "$op" -F xdev -d /tmp ../ + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/tmp/]../" + + vfs_helper -o "$op" -F xdev -d /tmp ../tmp + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/tmp/]../tmp" + + vfs_helper -o "$op" -F xdev -d "$root" link_tmp + if [[ "$op" == "lstat" ]] + then + [[ "$status" = 0 ]] || fail "$op xdev [/root/]link_tmp(=/tmp)" + else + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]link_tmp(=/tmp)" + fi + + vfs_helper -o "$op" -F xdev -d "$root" link_tmp/ + [[ "$(errno "$status")" =~ "EXDEV "* ]] || fail "$op xdev [/root/]link_tmp(=/tmp)/" +done diff --git a/tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh b/tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh new file mode 100755 index 000000000000..41d9655a1e46 --- /dev/null +++ b/tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh @@ -0,0 +1,50 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# Author: Aleksa Sarai cyphar@cyphar.com +# Copyright (C) 2018 SUSE LLC. + +sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")" +source "$sourcedir/../common.sh" + +[ -e "/proc/$$/cwd" ] || skip "/proc/$$/cwd doesn't exist" + +ln -s / "$root/testlink" + +for op in "${vfs_ops[@]}" +do + for flags in {no_proclinks,no_symlinks,"no_proclinks,no_symlinks"} + do + vfs_helper -o "$op" -F "$flags" "/proc/$$/stat" + [[ "$status" = 0 ]] || fail "$op $flags /proc/$$/stat" + + vfs_helper -o "$op" -F "$flags" "/proc/$$/cwd" + if [[ "$op" == "lstat" ]] + then + [[ "$status" = 0 ]] || fail "$op $flags /proc/$$/cwd" + else + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags /proc/$$/cwd" + fi + + vfs_helper -o "$op" -F "$flags" -d "$root" "testlink/" + if [[ "$flags" == "no_proclinks" ]] + then + [[ "$status" = 0 ]] || fail "$op $flags [/root/]testlink/" + else + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags [/root/]testlink/" + fi + + vfs_helper -o "$op" -F "$flags" "/proc/$$/cwd/" + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags /proc/$$/cwd/" + + vfs_helper -o "$op" -F "$flags" "/proc/$$/cwd/$BASH_SOURCE" + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags /proc/$$/cwd/$BASH_SOURCE" + + vfs_helper -o "$op" -F "$flags" -d "/proc/self" cwd + if [[ "$op" == "lstat" ]] + then + [[ "$status" = 0 ]] || fail "$op $flags [/proc/self/]cwd" + else + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op $flags [/proc/self]/cwd" + fi + done +done diff --git a/tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh b/tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh new file mode 100755 index 000000000000..f7ec7e37f06a --- /dev/null +++ b/tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh @@ -0,0 +1,49 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# Author: Aleksa Sarai cyphar@cyphar.com +# Copyright (C) 2018 SUSE LLC. + +sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")" +source "$sourcedir/../common.sh" + +mkdir -p "$root/dir" +touch "$root/foo" + +ln -s . "$root/link_dot" +ln -s .. "$root/link_dotdot" +ln -s foo "$root/link_foo" + +for op in "${vfs_ops[@]}" +do + vfs_helper -o "$op" -F no_symlinks -d "$root" foo + [[ "$status" = 0 ]] || fail "$op no_symlinks [/root/]foo" + + vfs_helper -o "$op" -F no_symlinks -d "$root" ../root/foo + [[ "$status" = 0 ]] || fail "$op no_symlinks [/root/]../root/foo" + + vfs_helper -o "$op" -F no_symlinks -d "$root" link_foo + if [[ "$op" == "lstat" ]] + then + [[ "$status" = 0 ]] || fail "$op no_symlinks [/root/]link_foo(=foo)" + else + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op no_symlinks [/root/]link_foo(=foo)" + fi + + vfs_helper -o "$op" -F no_proclinks -d "$root" link_foo + [[ "$status" = 0 ]] || fail "$op no_proclinks [/root/]link_foo(=foo)" + + vfs_helper -o "$op" -F no_symlinks -d "$root" link_dotdot/ + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op no_symlinks [/root/]link_dotdot(=..)/" + + vfs_helper -o "$op" -F no_proclinks -d "$root" link_dotdot/ + [[ "$status" = 0 ]] || fail "$op no_proclinks [/root/]link_dotdot(=..)/" + + vfs_helper -o "$op" -F no_symlinks -d "$root" link_dot/dir + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op no_symlinks [/root/]link_dot(=.)/dir" + + vfs_helper -o "$op" -F no_proclinks -d "$root" link_dot/dir + [[ "$status" = 0 ]] || fail "$op no_proclinks [/root/]link_dot(=.)/dir" + + vfs_helper -o "$op" -F no_symlinks -d "$root" ../root/link_dot/link_dotdot/root/dir + [[ "$(errno "$status")" =~ "ELOOP "* ]] || fail "$op no_symlinks [/root/]../root/link_dot(=.)/link_dotdot(=..)/root/dir" +done diff --git a/tools/testing/selftests/vfs/tests/0005_at_this_root.sh b/tools/testing/selftests/vfs/tests/0005_at_this_root.sh new file mode 100755 index 000000000000..aba23c28a7b7 --- /dev/null +++ b/tools/testing/selftests/vfs/tests/0005_at_this_root.sh @@ -0,0 +1,66 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# Author: Aleksa Sarai cyphar@cyphar.com +# Copyright (C) 2018 SUSE LLC. + +sourcedir="$(readlink -f "$(dirname "$BASH_SOURCE")")" +source "$sourcedir/../common.sh" + +mkdir -p "$root/var" "$root/etc" "$root/usr/bin" "$root/usr/local/bin" +ln -s bash "$root/usr/bin/sh" +ln -s ../../bin/bash "$root/usr/local/bin/bash" +ln -s /bin/sh "$root/usr/local/bin/sh" +ln -s ../bin3 "$root/var/bin" +ln -s /usr/bin "$root/bin" +ln -s /usr/local/bin "$root/bin4" +ln -s ../../../../../../../../../bin "$root/bin2" +ln -s /../../../../../../../../../bin "$root/bin3" +touch "$root/etc/passwd" "$root/usr/bin/bash" + +# How should each path be mapped to a host path, in the form +# 'path:hostpath[:hostpath_trailing]'. Everything is assumed to be ${root} +# prefixed. +host_mappings=( + # Basic paths. + "..:." + "/:." + "/../../../../../../:." + "../var/../../../../../etc/passwd:etc/passwd" + "/var/../../../../../etc/passwd:etc/passwd" + "/../../../../../../var/../../../../../etc/passwd:etc/passwd" + "etc/passwd:etc/passwd" + "/etc/passwd:etc/passwd" + + # Basic symlink paths. + "/bin/bash:usr/bin/bash" + "/bin/sh:usr/bin/bash:usr/bin/sh" + "/bin2/bash:usr/bin/bash" + "/bin2/sh:usr/bin/bash:usr/bin/sh" + "/bin3/sh:usr/bin/bash:usr/bin/sh" + "/bin3/bash:usr/bin/bash" + + # More complicated symlink paths. + "/bin4/../../local/bin/bash:usr/bin/bash:usr/local/bin/bash" + "/bin4/../../local/bin/sh:usr/bin/bash:usr/local/bin/sh" + "/bin4/../../../../../../../../../../usr/local/bin/bash:usr/bin/bash:usr/local/bin/bash" + "/bin4/../../../../../../../../../../usr/local/bin/sh:usr/bin/bash:usr/local/bin/sh" + "/bin/../../bin4/../../local/bin/bash:usr/bin/bash:usr/local/bin/bash" + "/bin/../../bin4/../../local/bin/sh:usr/bin/bash:usr/local/bin/sh" +) + +for op in "${vfs_ops[@]}" +do + for mapping in "${host_mappings[@]}" + do + IFS=":" read path hostpath hostpath_trailing <<< "$mapping" + [[ "$hostpath_trailing" ]] || export hostpath_trailing="$hostpath" + [[ "$op" == "lstat" ]] && export hostpath="$hostpath_trailing" + + # Compare with and without this_root... + vfs_helper -o "$op" -d "$root" "$hostpath" + old_status="$status" old_output="$output" + vfs_helper -o "$op" -F this_root -d "$root" "$path" + [[ "$status" = "$old_status" ]] || fail "$op this_root $path=$status neq $old_status" + [[ "$output" == "$old_output" ]] || fail "$op this_root $path=$output neq $old_output" + done +done diff --git a/tools/testing/selftests/vfs/vfs_helper.c b/tools/testing/selftests/vfs/vfs_helper.c new file mode 100644 index 000000000000..d67ec74a3fca --- /dev/null +++ b/tools/testing/selftests/vfs/vfs_helper.c @@ -0,0 +1,154 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Author: Aleksa Sarai cyphar@cyphar.com + * Copyright (C) 2018 SUSE LLC. + */ + +#define _GNU_SOURCE + +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <limits.h> + +#include "at_flags.h" +#include "../kselftest.h" + +#define bail(...) \ + do { \ + fprintf(stderr, __VA_ARGS__); \ + fputs("\n", stderr); \ + exit(1); \ + } while (0) + +extern char *__progname; +#define usage() \ + bail("usage: %s -o {open|stat|lstat} [-d <dirfd-path>] " \ + "[-F [<flag1>,...<flagN>] <path>", __progname) + +static unsigned int parse_at_flags(char *opts) +{ + char *opt, *saveptr = NULL; + unsigned int flags = 0; + + opt = strtok_r(opts, ",", &saveptr); + do { + unsigned int found = 0; + + if (!*opt) + continue; + for (struct flag *flag = AT_FLAGS; flag->name != NULL; flag++) { + if (!strcmp(opt, flag->name)) + found |= flag->at_flag; + } + if (!found) + bail("unknown openat(2) flag: %s", opt); + flags |= found; + } while ((opt = strtok_r(NULL, ",", &saveptr)) != NULL); + + return flags; +} + +int stat_wrapper(int dirfd, const char *pathname, unsigned int flags) +{ + struct stat st = {0}; + int err; + + err = fstatat(dirfd, pathname, &st, flags); + if (err < 0) + return err; + + printf("%lu:%lu\n", st.st_dev, st.st_ino); + return 0; +} + +int lstat_wrapper(int dirfd, const char *pathname, unsigned int flags) +{ + return stat_wrapper(dirfd, pathname, flags | AT_SYMLINK_NOFOLLOW); +} + +int openat_wrapper(int dirfd, const char *pathname, unsigned int flags) +{ + int fd; + char *fdpath = NULL, fullpath[PATH_MAX] = {0}; + + fd = openat(dirfd, pathname, flags); + if (fd < 0) + return fd; + + /* Print the fully-qualified path using /proc/pid/fd/... */ + if (asprintf(&fdpath, "/proc/self/fd/%d", fd) < 0) + bail("asprintf /proc/self/fd/%d: %m", fd); + if (readlink(fdpath, fullpath, PATH_MAX) < 0) + bail("readlink %s: %m", fdpath); + puts(fullpath); + return fd; +} + +int main(int argc, char **argv) +{ + int opt, ret, dirfd; + unsigned int flags = 0; + char *opstr = NULL, *dir_path = NULL, *path = NULL; + int (*opfunc)(int dirfd, const char *pathname, unsigned int flags); + + while ((opt = getopt(argc, argv, "o:d:F:")) != -1) { + switch (opt) { + case 'o': + opstr = optarg; + break; + case 'd': + dir_path = optarg; + break; + case 'F': + flags |= parse_at_flags(optarg); + break; + default: + usage(); + } + } + + argc -= optind; + argv += optind; + + if (argc != 1) + usage(); + path = argv[0]; + + if (!opstr) + usage(); + else if (!strcmp(opstr, "stat")) + opfunc = stat_wrapper; + else if (!strcmp(opstr, "lstat")) + opfunc = lstat_wrapper; + else if (!strcmp(opstr, "open")) + opfunc = openat_wrapper; + else + usage(); + + if (opfunc == openat_wrapper) { + unsigned int open_flags = 0; + + for (struct flag *flag = AT_FLAGS; flag->name != NULL; flag++) { + if (flags & flag->at_flag) + open_flags |= flag->open_flag; + } + flags = open_flags; + } + + dirfd = AT_FDCWD; + if (dir_path) { + dirfd = open(dir_path, O_PATH|O_DIRECTORY); + if (dirfd < 0) + bail("cannot open dir_path: %m"); + } + + ret = opfunc(dirfd, path, flags); + if (ret < 0) + ret = -errno; + return (ret < 0) ? -ret : 0; +}
+cc linux-api; please keep them in CC for future versions of the patch
On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai cyphar@cyphar.com wrote:
The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths. The already-existing AT_XDEV and AT_NO_PROCLINKS help defend against other potential attacks in a malicious rootfs scenario.
So, I really like the concept for patch 1 of this series (but haven't read the code yet); but I dislike this patch because of its footgun potential.
If this patch landed as-is, the manpage would need some big warning labels. chroot() basically provides no security guarantees at all; and yes, that includes that if you do `chroot(...); chdir("/"); open(attacker_controlled_path, ...);`, you can potentially end up opening a file outside the chroot. See https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-014-2015.txt for an example, where Qubes OS did pretty much that, and ended up with a potentially exploitable security bug because of that, where one VM, while performing a file transfer into another VM, could write outside of the transfer target directory. The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
1. You start the path walk and reach /A/B/C. 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. 3. Your path walk follows the first ".." up into /A. This is outside the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that.
(Yes, this means that if you run an SFTP server with OpenSSH's ChrootDirectory directive, you have to be very careful about these things.)
I believe that the only way to robustly use this would be to point the dirfd at a mount point, such that you know that being moved out of the chroot is impossible because the mount point limits movement of directories under it. (Well, technically, it doesn't, but it ensures that if a directory does dangerously move away, the syscall fails.) It might make sense to hardcode this constraint in the implementation of AT_THIS_ROOT, to keep people from shooting themselves in the foot.
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()? I think something like this should work (except that you should add some error handling - omitted here because I'm lazy), assuming that the container runtime does NOT have CAP_SYS_ADMIN in the init namespace (otherwise it's easier). Of course, this is entirely untested, and probably won't compile because I screwed something up. :P But you should get the idea...
// Ensure that we are non-dumpable. Together with // commit bfedb589252c, this ensures that container root // can't trace our child once it enters the container. // My patch // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net... // would make this unnecessary, but that patch didn't // land because Eric nacked it (for political reasons, // because people incorrectly claimed that this was a // security fix): // https://lore.kernel.org/lkml/8760z7fope.fsf@x220.int.ebiederm.org/ // Note that dumpability is per-mm, not per-process, // so this hack has the unfortunate side effect of preventing // unprivileged debugging of the container runtime. // Oh well. prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE); // Inform gcc that this particular syscall will effectively // return twice, just like vfork() or setjmp(). __attribute__((returns_twice)) long syscall_(long sysno, ...) = (void*)syscall; int result_fd = -1; // CLONE_FILES means we don't need to do fd passing, // we share the file descriptor table. // CLONE_VM means we don't have the cost of duplicating // our VMAs and page tables, and we don't have to mark // all our pagetable entries as readonly for copy-on-write. // CLONE_VFORK is a dirty hack to avoid having to // allocate a child stack. // Lack of SIGCHLD means we don't want to have to wait() // for the child. int child_pid = syscall_(__NR_clone, CLONE_FILES|CLONE_VM|CLONE_VFORK, 0, 0, 0, 0); if (child_pid == 0) { // Enter the container's user namespace; this allows us // to afterwards join its mount namespace even if we're // not capable in the init namespace. // (I believe that it should be possible to change the kernel // such that this is not required if you have set the // no-new-privs flag.) setns(container_user_ns_fd, CLONE_NEWUSER); // Entering the filesystem namespace automatically // moves us to that namespace's filesystem root. setns(container_fs_ns_fd, CLONE_NEWNS); result_fd = open(untrusted_container_path, ...); syscall(__NR_exit, 0); }
The most significant change in semantics with AT_THIS_ROOT is that *at(2) syscalls now no longer have the property that an absolute pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is specified). The reasoning behind this is that AT_THIS_ROOT necessarily has to chroot-scope symlinks with absolute paths to dirfd, and so doing it for the base path seems to be the most consistent behaviour (and also avoids foot-gunning users who want to chroot-scope paths that might be absolute).
Currently this is only enabled for the stat(2) and openat(2) family (the latter has its own flag O_THISROOT with the same semantics). Ideally this flag would be supported by all *at(2) syscalls, but this will require adding flags arguments to many of them (and will be done in a separate patchset).
Cc: Eric Biederman ebiederm@xmission.com Cc: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com
fs/fcntl.c | 2 +- fs/namei.c | 121 +++++++++++++++++-------------- fs/open.c | 2 + fs/stat.c | 4 +- include/linux/fcntl.h | 2 +- include/linux/namei.h | 1 + include/uapi/asm-generic/fcntl.h | 3 + include/uapi/linux/fcntl.h | 2 + 8 files changed, 81 insertions(+), 56 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index e343618736f7..4c36c5b9fdb9 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */
BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c index 757dd783771c..1b984f0dbbb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd) } }
+/*
- Configure nd->path based on the nd->dfd. This is only used as part of
- path_init().
- */
+static inline int dirfd_path_init(struct nameidata *nd) +{
if (nd->dfd == AT_FDCWD) {
if (nd->flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;
do {
seq = read_seqcount_begin(&fs->seq);
nd->path = fs->pwd;
nd->inode = nd->path.dentry->d_inode;
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
} while (read_seqcount_retry(&fs->seq, seq));
} else {
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
} else {
/* Caller must check execute permissions on the starting path component */
struct fd f = fdget_raw(nd->dfd);
struct dentry *dentry;
if (!f.file)
return -EBADF;
dentry = f.file->f_path.dentry;
if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
fdput(f);
return -ENOTDIR;
}
nd->path = f.file->f_path;
if (nd->flags & LOOKUP_RCU) {
nd->inode = nd->path.dentry->d_inode;
nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
} else {
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
fdput(f);
}
if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) {
nd->root = nd->path;
if (!(nd->flags & LOOKUP_RCU))
path_get(&nd->root);
}
return 0;
+}
/* must be paired with terminate_walk() */ static const char *path_init(struct nameidata *nd, unsigned flags) {
int error; const char *s = nd->name->name; if (!*s)
@@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock);
if (unlikely(flags & LOOKUP_CHROOT)) {
error = dirfd_path_init(nd);
if (unlikely(error))
return ERR_PTR(error);
} if (*s == '/') {
int error;
set_root(nd);
if (likely(!nd->root.mnt))
set_root(nd); error = nd_jump_root(nd); if (unlikely(error)) s = ERR_PTR(error); return s;
} else if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;
do {
seq = read_seqcount_begin(&fs->seq);
nd->path = fs->pwd;
nd->inode = nd->path.dentry->d_inode;
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
} while (read_seqcount_retry(&fs->seq, seq));
} else {
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
if (unlikely(flags & LOOKUP_BENEATH)) {
nd->root = nd->path;
if (!(flags & LOOKUP_RCU))
path_get(&nd->root);
}
return s;
} else {
/* Caller must check execute permissions on the starting path component */
struct fd f = fdget_raw(nd->dfd);
struct dentry *dentry;
if (!f.file)
return ERR_PTR(-EBADF);
dentry = f.file->f_path.dentry;
if (*s && unlikely(!d_can_lookup(dentry))) {
fdput(f);
return ERR_PTR(-ENOTDIR);
}
nd->path = f.file->f_path;
if (flags & LOOKUP_RCU) {
nd->inode = nd->path.dentry->d_inode;
nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
} else {
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
if (unlikely(flags & LOOKUP_BENEATH)) {
nd->root = nd->path;
if (!(flags & LOOKUP_RCU))
path_get(&nd->root);
}
fdput(f);
return s; }
if (likely(!nd->path.mnt)) {
error = dirfd_path_init(nd);
if (unlikely(error))
return ERR_PTR(error);
}
return s;
}
static const char *trailing_symlink(struct nameidata *nd) diff --git a/fs/open.c b/fs/open.c index 80f5f566a5ff..81d148f626cd 100644 --- a/fs/open.c +++ b/fs/open.c @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_NO_PROCLINKS; if (flags & O_NOSYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS;
if (flags & O_THISROOT)
lookup_flags |= LOOKUP_CHROOT; op->lookup_flags = lookup_flags; return 0;
} diff --git a/fs/stat.c b/fs/stat.c index 791e61b916ae..e8366e4812c3 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH | KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV |
AT_NO_PROCLINKS | AT_NO_SYMLINKS))
AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT)) return -EINVAL; if (flags & AT_SYMLINK_NOFOLLOW)
@@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags, lookup_flags |= LOOKUP_NO_PROCLINKS; if (flags & AT_NO_SYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS;
if (flags & AT_THIS_ROOT)
lookup_flags |= LOOKUP_CHROOT;
retry: error = user_path_at(dfd, filename, lookup_flags, &path); diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index ad5bba4b5b12..95480cd4c09d 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -10,7 +10,7 @@ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
O_NOPROCLINKS | O_NOSYMLINKS)
O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT)
#ifndef force_o_largefile #define force_o_largefile() (BITS_PER_LONG != 32) diff --git a/include/linux/namei.h b/include/linux/namei.h index 5ff7f3362d1b..7ec9e2d84649 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. Implies LOOKUP_NO_PROCLINKS. */ +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */
extern int path_pts(struct path *path);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index c2bf5983e46a..11206b0e927c 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -113,6 +113,9 @@ #ifndef O_NOSYMLINKS #define O_NOSYMLINKS 01000000000 #endif +#ifndef O_THISROOT +#define O_THISROOT 02000000000 +#endif
#define F_DUPFD 0 /* dup */ #define F_GETFD 1 /* get close_on_exec */ diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 551a9e2166a8..ea978457b68f 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -99,6 +99,8 @@ #define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */ #define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*. Implies AT_NO_PROCLINKS. */ +#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though
it is chroot-ed into dirfd. */
#endif /* _UAPI_LINUX_FCNTL_H */
2.19.0
On Sep 29, 2018, at 9:35 AM, Jann Horn jannh@google.com wrote:
+cc linux-api; please keep them in CC for future versions of the patch
On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai cyphar@cyphar.com wrote: The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths. The already-existing AT_XDEV and AT_NO_PROCLINKS help defend against other potential attacks in a malicious rootfs scenario.
So, I really like the concept for patch 1 of this series (but haven't read the code yet); but I dislike this patch because of its footgun potential.
The code could do it differently: do the path walk and then, before accepting the result, walk back up and make sure the result is under the starting point.
This is *not* a full solution, though, since a walk above the root gas side effects on timing, various caches, and possibly network traffic, so it’s open to Spectre-like attacks in which a malicious container could use a runtime-initiated AT_THIS_ROOT to infer the existence of directories outside the container.
But what’s the container usecase? Any sane container is based on pivot_root or similar, so the runtime can just do the walk in the container context. IOW I’m a bit confused as to the exact intended use of the whole series. Can you elaborate?
On 2018-09-29, Andy Lutomirski luto@amacapital.net wrote:
On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai cyphar@cyphar.com wrote: The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths. The already-existing AT_XDEV and AT_NO_PROCLINKS help defend against other potential attacks in a malicious rootfs scenario.
So, I really like the concept for patch 1 of this series (but haven't read the code yet); but I dislike this patch because of its footgun potential.
The code could do it differently: do the path walk and then, before accepting the result, walk back up and make sure the result is under the starting point.
This is *not* a full solution, though, since a walk above the root gas side effects on timing, various caches, and possibly network traffic, so it’s open to Spectre-like attacks in which a malicious container could use a runtime-initiated AT_THIS_ROOT to infer the existence of directories outside the container.
I think that one way to solve this problem might be to have more strict checks on nd->root in follow_dotdot(). The problem here (as far as I can tell) is that ".." could end up skipping past the root because of a rename, however walking *down* into a path shouldn't be a problem (even absolute symlinks shouldn't be a problem because they will nd_jump_root and will land back in the root).
However, I'm not entirely sure what happens to nd->root if it gets renamed -- can you still safely do checks against it (we'd need to do some sort of is_descendant() check on the current path before we handle ".." in follow_dotdot).
That way, we wouldn't shouldn't have the spectre-like attack problem (since the attack would be halted at the ".." stage -- before the path walk can proceed into host paths). Would this be sufficient or is there a more serious issue I'm missing?
But what’s the container usecase? Any sane container is based on pivot_root or similar, so the runtime can just do the walk in the container context. IOW I’m a bit confused as to the exact intended use of the whole series. Can you elaborate?
I went into this in my response to Jann.
On 2018-09-29, Jann Horn jannh@google.com wrote:
The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that.
Please correct me if I'm wrong here (this is the first patch I've written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this -- or does that only handle if a particular path component changes *while* it's being walked through? Is it possible for a path walk to succeed after a path component was unmounted (obviously you can't delete a directory path component since you'd get -ENOTEMPTY)?
If this is an issue for AT_THIS_ROOT, I believe this might also be an issue for AT_BENEATH since they are effectively both using the same nd->root trick (so you could similarly trick AT_BENEATH to not error out). So we'd need to figure out how to solve this problem in order for AT_BENEATH to be safe.
Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit? What if we do more aggressive nd->root checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != current->mnt_ns->root)?
Regarding chroot attacks, I was aware of the trivial chroot-open-chroot-fchdir attack but I was not aware that there was a rename attack for chroot. Thanks for bringing this up!
I believe that the only way to robustly use this would be to point the dirfd at a mount point, such that you know that being moved out of the chroot is impossible because the mount point limits movement of directories under it. (Well, technically, it doesn't, but it ensures that if a directory does dangerously move away, the syscall fails.) It might make sense to hardcode this constraint in the implementation of AT_THIS_ROOT, to keep people from shooting themselves in the foot.
Unless I'm missing something, would this not also affect using a mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through path component) -- or does MS_MOVE cause a rewalk in a way that rename does not?
I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I thought that bind-mounts would be an issue but you also get -EXDEV when trying to rename across bind-mounts even if they are on the same underlying filesystem). But AT_BENEATH might be a more bitter pill to swallow. I'm not sure.
In the usecase of container runtimes, we wouldn't generally be doing resolution of attacker-controlled paths but it still definitely doesn't hurt to consider this part of the threat model -- to avoid foot-gunning as you've said. (There also might be some nested-container cases where you might want to do that.)
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
(It should be noted that multi-threaded C runtimes have somewhat similar issues -- AFAIK you can technically only use AS-Safe glibc functions after a fork() but that's more of a theoretical concern here. If you just use raw syscalls there isn't an issue.)
As for why use setns() rather than pivot_root(), there are cases where you're operating on a container's image without a running container (think image extraction or snapshotting tools). In those cases, you would need to set up a dummy container process in order to setns() into its namespaces. You are right that setns() would be a better option if you want the truthful state of what mounts the container sees.
[I also don't like the idea of joining the user namespace of a malicious container unless it's necessary but that's probably just needless paranoia more than anything -- since you're not joining the pidns you aren't trivially addressable by a malicious container.]
// Ensure that we are non-dumpable. Together with // commit bfedb589252c, this ensures that container root // can't trace our child once it enters the container. // My patch // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net... // would make this unnecessary, but that patch didn't // land because Eric nacked it (for political reasons, // because people incorrectly claimed that this was a // security fix):
Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks"). If you join a user namespace then processes within that user namespace won't have ptrace_may_access() permissions because your mm is owned by an ancestor user namespace -- only after exec() will you be traceable.
We still use PR_SET_DUMPABLE in runc but that's because we support older kernels (and people don't use user namespaces under Docker) but with user namespaces this should not be required anymore.
On Mon, Oct 1, 2018 at 7:44 AM Aleksa Sarai cyphar@cyphar.com wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that.
Please correct me if I'm wrong here (this is the first patch I've written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this -- or does that only handle if a particular path component changes *while* it's being walked through?
Eric Biederman should be able to talk about all this much better than me, but as far as I know, the LOOKUP_REVAL path is only for dealing with some special filesystems like procfs.
Is it possible for a path walk to succeed after a path component was unmounted (obviously you can't delete a directory path component since you'd get -ENOTEMPTY)?
I don't think so, but I'm not exactly a VFS expert.
If this is an issue for AT_THIS_ROOT, I believe this might also be an issue for AT_BENEATH since they are effectively both using the same nd->root trick (so you could similarly trick AT_BENEATH to not error out). So we'd need to figure out how to solve this problem in order for AT_BENEATH to be safe.
Oh, wait, what? I think I didn't notice that the semantics of AT_BENEATH changed like that since the original posting of David Drysdale's O_BENEATH_ONLY patch (https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@googl...). David's patch had nice, straightforward semantics, blocking any form of upwards traversal. Why was that changed? Does anyone actually want to use paths that contain ".." with AT_BENEATH? I would strongly prefer something that blocks any use of "..".
@Al: It looks like this already changed back when you posted https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ ?
Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit? What if we do more aggressive nd->root checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != current->mnt_ns->root)?
It seems to me like doing that would basically require looking at each node in the path walk twice? And it'd be difficult to guarantee forward progress unless you're willing to do some fairly heavy locking.
Regarding chroot attacks, I was aware of the trivial chroot-open-chroot-fchdir attack but I was not aware that there was a rename attack for chroot. Thanks for bringing this up!
I believe that the only way to robustly use this would be to point the dirfd at a mount point, such that you know that being moved out of the chroot is impossible because the mount point limits movement of directories under it. (Well, technically, it doesn't, but it ensures that if a directory does dangerously move away, the syscall fails.) It might make sense to hardcode this constraint in the implementation of AT_THIS_ROOT, to keep people from shooting themselves in the foot.
Unless I'm missing something, would this not also affect using a mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through path component) -- or does MS_MOVE cause a rewalk in a way that rename does not?
Hmm. Good point.
It looks to me like you probably wouldn't be able to walk up through a mountpoint in RCU mode after the mount hierarchy has changed (see follow_dotdot_rcu()), but it might work in refwalk mode.
Eric?
I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I thought that bind-mounts would be an issue but you also get -EXDEV when trying to rename across bind-mounts even if they are on the same underlying filesystem). But AT_BENEATH might be a more bitter pill to swallow. I'm not sure.
Which is part of why I strongly prefer the semantics from David Drysdale's O_BENEATH_ONLY.
In the usecase of container runtimes, we wouldn't generally be doing resolution of attacker-controlled paths but it still definitely doesn't hurt to consider this part of the threat model -- to avoid foot-gunning as you've said. (There also might be some nested-container cases where you might want to do that.)
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
If you insist on implementing every last bit of your code in Go, I guess.
(It should be noted that multi-threaded C runtimes have somewhat similar issues -- AFAIK you can technically only use AS-Safe glibc functions after a fork() but that's more of a theoretical concern here. If you just use raw syscalls there isn't an issue.)
As for why use setns() rather than pivot_root(), there are cases where you're operating on a container's image without a running container (think image extraction or snapshotting tools). In those cases, you would need to set up a dummy container process in order to setns() into its namespaces. You are right that setns() would be a better option if you want the truthful state of what mounts the container sees.
[I also don't like the idea of joining the user namespace of a malicious container unless it's necessary but that's probably just needless paranoia more than anything -- since you're not joining the pidns you aren't trivially addressable by a malicious container.]
// Ensure that we are non-dumpable. Together with // commit bfedb589252c, this ensures that container root // can't trace our child once it enters the container. // My patch // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net... // would make this unnecessary, but that patch didn't // land because Eric nacked it (for political reasons, // because people incorrectly claimed that this was a // security fix):
Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks"). If you join a user namespace then processes within that user namespace won't have ptrace_may_access() permissions because your mm is owned by an ancestor user namespace -- only after exec() will you be traceable.
Nope. The code added in bfedb589252c only applies if `get_dumpable(mm) != SUID_DUMP_USER`.
Looking at the current version, you can see that `ptrace_has_cap(tcred->user_ns, mode)` is enough to ptrace a process that is not nondumpable.
We still use PR_SET_DUMPABLE in runc but that's because we support older kernels (and people don't use user namespaces under Docker) but with user namespaces this should not be required anymore.
On 2018-10-01, Jann Horn jannh@google.com wrote:
If this is an issue for AT_THIS_ROOT, I believe this might also be an issue for AT_BENEATH since they are effectively both using the same nd->root trick (so you could similarly trick AT_BENEATH to not error out). So we'd need to figure out how to solve this problem in order for AT_BENEATH to be safe.
Oh, wait, what? I think I didn't notice that the semantics of AT_BENEATH changed like that since the original posting of David Drysdale's O_BENEATH_ONLY patch (https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@googl...). David's patch had nice, straightforward semantics, blocking any form of upwards traversal. Why was that changed? Does anyone actually want to use paths that contain ".." with AT_BENEATH? I would strongly prefer something that blocks any use of "..".
@Al: It looks like this already changed back when you posted https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ ?
Yes, I copied the semantics from Al's patchset. I don't know why he felt strongly that this was the best idea, but in my opinion allowing paths like "a/../b/../c" seems like it's quite useful because otherwise you wouldn't be able to operate on most distribution root filesystems (many have symlinks that have ".." components). Looking at my own (openSUSE) machine there are something like 100k such symlinks (~37% of symlinks on my machine).
While I do understand that the easiest way of solving this problem is to disallow ".." entirely with AT_BENEATH, given that support ".." has utility, I would like to know whether it's actually not possible to have this work safely.
Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit? What if we do more aggressive nd->root checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != current->mnt_ns->root)?
It seems to me like doing that would basically require looking at each node in the path walk twice? And it'd be difficult to guarantee forward progress unless you're willing to do some fairly heavy locking.
I had another idea since I wrote my previous mail -- since the issue (at least the way I understand it) is that ".." can "skip" over nd->root because of the rename, what if we had some sort of is_descendant() check within follow_dotdot()? (FWIW, ".." already has some pretty interesting "hand-over-hand" locking semantics.) This should be effectively similar to how prepend_path() deals with a path that is not reachable from @root (I'm not sure if the locking is acceptable for the namei path though).
Since ".." with AT_THIS_ROOT (or AT_BENEATH) is not going to be the most common component type (and we only need to do these checks for those flags), I would think that the extra cost would not be _that_ awful.
Would this work?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
If you insist on implementing every last bit of your code in Go, I guess.
Fair enough, though I believe this would affect most multi-threaded programs as well (regardless of the language they're written in).
On Tue, Oct 02, 2018 at 02:18:33AM +1000, Aleksa Sarai wrote:
On 2018-10-01, Jann Horn jannh@google.com wrote:
If this is an issue for AT_THIS_ROOT, I believe this might also be an issue for AT_BENEATH since they are effectively both using the same nd->root trick (so you could similarly trick AT_BENEATH to not error out). So we'd need to figure out how to solve this problem in order for AT_BENEATH to be safe.
Oh, wait, what? I think I didn't notice that the semantics of AT_BENEATH changed like that since the original posting of David Drysdale's O_BENEATH_ONLY patch (https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@googl...). David's patch had nice, straightforward semantics, blocking any form of upwards traversal. Why was that changed? Does anyone actually want to use paths that contain ".." with AT_BENEATH? I would strongly prefer something that blocks any use of "..".
@Al: It looks like this already changed back when you posted https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ ?
Yes, I copied the semantics from Al's patchset. I don't know why he felt strongly that this was the best idea, but in my opinion allowing paths like "a/../b/../c" seems like it's quite useful because otherwise you wouldn't be able to operate on most distribution root filesystems (many have symlinks that have ".." components). Looking at my own (openSUSE) machine there are something like 100k such symlinks (~37% of symlinks on my machine).
While I do understand that the easiest way of solving this problem is to disallow ".." entirely with AT_BENEATH, given that support ".." has utility, I would like to know whether it's actually not possible to have this work safely.
Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit? What if we do more aggressive nd->root checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != current->mnt_ns->root)?
It seems to me like doing that would basically require looking at each node in the path walk twice? And it'd be difficult to guarantee forward progress unless you're willing to do some fairly heavy locking.
I had another idea since I wrote my previous mail -- since the issue (at least the way I understand it) is that ".." can "skip" over nd->root because of the rename, what if we had some sort of is_descendant() check within follow_dotdot()? (FWIW, ".." already has some pretty interesting "hand-over-hand" locking semantics.) This should be effectively similar to how prepend_path() deals with a path that is not reachable from @root (I'm not sure if the locking is acceptable for the namei path though).
Since ".." with AT_THIS_ROOT (or AT_BENEATH) is not going to be the most common component type (and we only need to do these checks for those flags), I would think that the extra cost would not be _that_ awful.
Would this work?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
If you insist on implementing every last bit of your code in Go, I guess.
Fair enough, though I believe this would affect most multi-threaded programs as well (regardless of the language they're written in).
(Depends on whether you do any explicit locking and have atfork handlers for your locks and so on. If you do a clone syscall directly to avoid having libc running any additional atfork handlers (flushing streams etc.) it's doable though not ideal.)
On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that.
Please correct me if I'm wrong here (this is the first patch I've written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this -- or does that only handle if a particular path component changes *while* it's being walked through? Is it possible for a path walk to succeed after a path component was unmounted (obviously you can't delete a directory path component since you'd get -ENOTEMPTY)?
If this is an issue for AT_THIS_ROOT, I believe this might also be an issue for AT_BENEATH since they are effectively both using the same nd->root trick (so you could similarly trick AT_BENEATH to not error out). So we'd need to figure out how to solve this problem in order for AT_BENEATH to be safe.
Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit? What if we do more aggressive nd->root checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != current->mnt_ns->root)?
Regarding chroot attacks, I was aware of the trivial chroot-open-chroot-fchdir attack but I was not aware that there was a rename attack for chroot. Thanks for bringing this up!
I believe that the only way to robustly use this would be to point the dirfd at a mount point, such that you know that being moved out of the chroot is impossible because the mount point limits movement of directories under it. (Well, technically, it doesn't, but it ensures that if a directory does dangerously move away, the syscall fails.) It might make sense to hardcode this constraint in the implementation of AT_THIS_ROOT, to keep people from shooting themselves in the foot.
Unless I'm missing something, would this not also affect using a mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through path component) -- or does MS_MOVE cause a rewalk in a way that rename does not?
I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I thought that bind-mounts would be an issue but you also get -EXDEV when trying to rename across bind-mounts even if they are on the same underlying filesystem). But AT_BENEATH might be a more bitter pill to swallow. I'm not sure.
In the usecase of container runtimes, we wouldn't generally be doing resolution of attacker-controlled paths but it still definitely doesn't hurt to consider this part of the threat model -- to avoid foot-gunning as you've said. (There also might be some nested-container cases where you might want to do that.)
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
(It should be noted that multi-threaded C runtimes have somewhat similar issues -- AFAIK you can technically only use AS-Safe glibc functions after a fork() but that's more of a theoretical concern here. If you just use raw syscalls there isn't an issue.)
As for why use setns() rather than pivot_root(), there are cases where you're operating on a container's image without a running container (think image extraction or snapshotting tools). In those cases, you would need to set up a dummy container process in order to setns() into its namespaces. You are right that setns() would be a better option if you want the truthful state of what mounts the container sees.
[I also don't like the idea of joining the user namespace of a malicious container unless it's necessary but that's probably just needless paranoia more than anything -- since you're not joining the pidns you aren't trivially addressable by a malicious container.]
// Ensure that we are non-dumpable. Together with // commit bfedb589252c, this ensures that container root // can't trace our child once it enters the container. // My patch // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net... // would make this unnecessary, but that patch didn't // land because Eric nacked it (for political reasons, // because people incorrectly claimed that this was a // security fix):
Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks"). If you join a user namespace then processes within that user namespace won't have ptrace_may_access() permissions because your mm is owned by an ancestor user namespace -- only after exec() will you be traceable.
That is not _completely_ true. Iirc (Please someone do yell at me if I'm wrong!), this is as follows. You will in fact be dumpable as long as you don't set{g,u}id() to an effective uid that is different from the effective uid of the process that created the task. For example, if you clone(CLONE_NEWUSER) as an unprivileged user with uid and euid 1000 you are in fact dumpable and thus traceable *but* if you do a setuid(0) in the new task then you will end up with old->euid = 1000 and new->euid = 0 at which point the kernel will remove the dumpable flag and the creating process cannot trace you anymore (which has funny consequences for lsm isolation and sending fds around). Iiuc, The same logic applies when you do a setns() to another user namespace.
/* dumpability changes */ if (!uid_eq(old->euid, new->euid) || !gid_eq(old->egid, new->egid) || !uid_eq(old->fsuid, new->fsuid) || !gid_eq(old->fsgid, new->fsgid) || !cred_cap_issubset(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); <<< suid_dumpable == 0 at this point task->pdeath_signal = 0; smp_wmb(); }
We still use PR_SET_DUMPABLE in runc but that's because we support older kernels (and people don't use user namespaces under Docker) but with user namespaces this should not be required anymore.
-- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
On Mon, Oct 1, 2018 at 12:42 PM Christian Brauner christian@brauner.io wrote:
On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that.
Please correct me if I'm wrong here (this is the first patch I've written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this -- or does that only handle if a particular path component changes *while* it's being walked through? Is it possible for a path walk to succeed after a path component was unmounted (obviously you can't delete a directory path component since you'd get -ENOTEMPTY)?
If this is an issue for AT_THIS_ROOT, I believe this might also be an issue for AT_BENEATH since they are effectively both using the same nd->root trick (so you could similarly trick AT_BENEATH to not error out). So we'd need to figure out how to solve this problem in order for AT_BENEATH to be safe.
Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit? What if we do more aggressive nd->root checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != current->mnt_ns->root)?
Regarding chroot attacks, I was aware of the trivial chroot-open-chroot-fchdir attack but I was not aware that there was a rename attack for chroot. Thanks for bringing this up!
I believe that the only way to robustly use this would be to point the dirfd at a mount point, such that you know that being moved out of the chroot is impossible because the mount point limits movement of directories under it. (Well, technically, it doesn't, but it ensures that if a directory does dangerously move away, the syscall fails.) It might make sense to hardcode this constraint in the implementation of AT_THIS_ROOT, to keep people from shooting themselves in the foot.
Unless I'm missing something, would this not also affect using a mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through path component) -- or does MS_MOVE cause a rewalk in a way that rename does not?
I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I thought that bind-mounts would be an issue but you also get -EXDEV when trying to rename across bind-mounts even if they are on the same underlying filesystem). But AT_BENEATH might be a more bitter pill to swallow. I'm not sure.
In the usecase of container runtimes, we wouldn't generally be doing resolution of attacker-controlled paths but it still definitely doesn't hurt to consider this part of the threat model -- to avoid foot-gunning as you've said. (There also might be some nested-container cases where you might want to do that.)
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
(It should be noted that multi-threaded C runtimes have somewhat similar issues -- AFAIK you can technically only use AS-Safe glibc functions after a fork() but that's more of a theoretical concern here. If you just use raw syscalls there isn't an issue.)
As for why use setns() rather than pivot_root(), there are cases where you're operating on a container's image without a running container (think image extraction or snapshotting tools). In those cases, you would need to set up a dummy container process in order to setns() into its namespaces. You are right that setns() would be a better option if you want the truthful state of what mounts the container sees.
[I also don't like the idea of joining the user namespace of a malicious container unless it's necessary but that's probably just needless paranoia more than anything -- since you're not joining the pidns you aren't trivially addressable by a malicious container.]
// Ensure that we are non-dumpable. Together with // commit bfedb589252c, this ensures that container root // can't trace our child once it enters the container. // My patch // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net... // would make this unnecessary, but that patch didn't // land because Eric nacked it (for political reasons, // because people incorrectly claimed that this was a // security fix):
Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks"). If you join a user namespace then processes within that user namespace won't have ptrace_may_access() permissions because your mm is owned by an ancestor user namespace -- only after exec() will you be traceable.
That is not _completely_ true. Iirc (Please someone do yell at me if I'm wrong!), this is as follows. You will in fact be dumpable as long as you don't set{g,u}id() to an effective uid that is different from the effective uid of the process that created the task. For example, if you clone(CLONE_NEWUSER) as an unprivileged user with uid and euid 1000 you are in fact dumpable and thus traceable *but* if you do a setuid(0) in the new task then you will end up with old->euid = 1000 and new->euid = 0 at which point the kernel will remove the dumpable flag and the creating process cannot trace you anymore (which has funny consequences for lsm isolation and sending fds around). Iiuc, The same logic applies when you do a setns() to another user namespace.
(Note that this is only true if your un-namespaced UID actually changes. If you create a user namespace and then write to its uid_map such that your namespaced UID is zero, that won't trigger this logic.)
On Mon, Oct 01, 2018 at 01:29:16PM +0200, Jann Horn wrote:
On Mon, Oct 1, 2018 at 12:42 PM Christian Brauner christian@brauner.io wrote:
On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that.
Please correct me if I'm wrong here (this is the first patch I've written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this -- or does that only handle if a particular path component changes *while* it's being walked through? Is it possible for a path walk to succeed after a path component was unmounted (obviously you can't delete a directory path component since you'd get -ENOTEMPTY)?
If this is an issue for AT_THIS_ROOT, I believe this might also be an issue for AT_BENEATH since they are effectively both using the same nd->root trick (so you could similarly trick AT_BENEATH to not error out). So we'd need to figure out how to solve this problem in order for AT_BENEATH to be safe.
Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit? What if we do more aggressive nd->root checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != current->mnt_ns->root)?
Regarding chroot attacks, I was aware of the trivial chroot-open-chroot-fchdir attack but I was not aware that there was a rename attack for chroot. Thanks for bringing this up!
I believe that the only way to robustly use this would be to point the dirfd at a mount point, such that you know that being moved out of the chroot is impossible because the mount point limits movement of directories under it. (Well, technically, it doesn't, but it ensures that if a directory does dangerously move away, the syscall fails.) It might make sense to hardcode this constraint in the implementation of AT_THIS_ROOT, to keep people from shooting themselves in the foot.
Unless I'm missing something, would this not also affect using a mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through path component) -- or does MS_MOVE cause a rewalk in a way that rename does not?
I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I thought that bind-mounts would be an issue but you also get -EXDEV when trying to rename across bind-mounts even if they are on the same underlying filesystem). But AT_BENEATH might be a more bitter pill to swallow. I'm not sure.
In the usecase of container runtimes, we wouldn't generally be doing resolution of attacker-controlled paths but it still definitely doesn't hurt to consider this part of the threat model -- to avoid foot-gunning as you've said. (There also might be some nested-container cases where you might want to do that.)
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
(It should be noted that multi-threaded C runtimes have somewhat similar issues -- AFAIK you can technically only use AS-Safe glibc functions after a fork() but that's more of a theoretical concern here. If you just use raw syscalls there isn't an issue.)
As for why use setns() rather than pivot_root(), there are cases where you're operating on a container's image without a running container (think image extraction or snapshotting tools). In those cases, you would need to set up a dummy container process in order to setns() into its namespaces. You are right that setns() would be a better option if you want the truthful state of what mounts the container sees.
[I also don't like the idea of joining the user namespace of a malicious container unless it's necessary but that's probably just needless paranoia more than anything -- since you're not joining the pidns you aren't trivially addressable by a malicious container.]
// Ensure that we are non-dumpable. Together with // commit bfedb589252c, this ensures that container root // can't trace our child once it enters the container. // My patch // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net... // would make this unnecessary, but that patch didn't // land because Eric nacked it (for political reasons, // because people incorrectly claimed that this was a // security fix):
Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks"). If you join a user namespace then processes within that user namespace won't have ptrace_may_access() permissions because your mm is owned by an ancestor user namespace -- only after exec() will you be traceable.
That is not _completely_ true. Iirc (Please someone do yell at me if I'm wrong!), this is as follows. You will in fact be dumpable as long as you don't set{g,u}id() to an effective uid that is different from the effective uid of the process that created the task. For example, if you clone(CLONE_NEWUSER) as an unprivileged user with uid and euid 1000 you are in fact dumpable and thus traceable *but* if you do a setuid(0) in the new task then you will end up with old->euid = 1000 and new->euid = 0 at which point the kernel will remove the dumpable flag and the creating process cannot trace you anymore (which has funny consequences for lsm isolation and sending fds around). Iiuc, The same logic applies when you do a setns() to another user namespace.
(Note that this is only true if your un-namespaced UID actually changes. If you create a user namespace and then write to its uid_map such that your namespaced UID is zero, that won't trigger this logic.)
The way I figured this works is that it actually only applies to the case where you're creating a user namespace as an unprivileged user. And even in that case you will retain dumpability in two cases: 1. mapping userns id 0 to <my-unpriv-host-uid> and any identity mapping, i.e. 2. mapping <my-unpriv-host-uid> to <my-unpriv-host-uid>
If you're creating a user namespace as root and set up mappings you should always retain the dumpable flag (I might not be remembering all corner-cases atm.) if you don't muck with capability sets and so on.
On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that.
Please correct me if I'm wrong here (this is the first patch I've written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this
No.
...
Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit?
Lookups and renames can definitely proceed in parallel, and yes I suspect it would be difficult to get good performance and guaranteed forward progress if you required lookup of the full path to be atomic with respect to renames.
--b.
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
I must admit that I’m not very sympathetic to the argument that “Go’s runtime model is incompatible with the simpler solution.”
Anyway, it occurs to me that the real problem is that setns() and chroot() are both overkill for this use case. What’s needed is to start your walk from /proc/pid-in-container/root, with two twists:
1. Do the walk as though rooted at a directory. This is basically just your AT_THIS_ROOT, but the footgun is avoided because the dirfd you use is from a foreign namespace, and, except for symlinks to absolute paths, no amount of .. racing is going to escape the *namespace*.
2. Avoid /proc. It’s not just the *links* — you really don’t want to walk into /proc/self. *Maybe* procfs is already careful enough when mounted in a namespace?
On 2018-10-01, Andy Lutomirski luto@amacapital.net wrote:
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
I must admit that I’m not very sympathetic to the argument that “Go’s runtime model is incompatible with the simpler solution.”
Multi-threaded programs have a similar issue (though with Go it's much worse). If you fork a multi-threaded C program then you can only safely use AS-Safe glibc functions (those that are safe within a signal handler). But if you're just doing three syscalls this shouldn't be as big of a problem as Go where you can't even do said syscalls.
Anyway, it occurs to me that the real problem is that setns() and chroot() are both overkill for this use case.
I agree. My diversion to Go was to explain why it was particularly bad for cri-o/rkt/runc/Docker/etc.
What’s needed is to start your walk from /proc/pid-in-container/root, with two twists:
- Do the walk as though rooted at a directory. This is basically just
your AT_THIS_ROOT, but the footgun is avoided because the dirfd you use is from a foreign namespace, and, except for symlinks to absolute paths, no amount of .. racing is going to escape the *namespace*.
This is quite clever and I'll admit I hadn't thought of this. This definitely fixes the ".." issue, but as you've said it won't handle absolute symlinks (which means userspace has the same races that we currently have even if you assume that you have a container process already running -- CVE-2018-15664 is one of many examples of this).
(AT_THIS_ROOT using /proc/$container/root would in principle fix all of the mentioned issues -- but as I said before I'd like to see whether hardening ".." would be enough to solve the escape problem.)
- Avoid /proc. It’s not just the *links* — you really don’t want to
walk into /proc/self. *Maybe* procfs is already careful enough when mounted in a namespace?
I just tried it and /proc/self gives you -ENOENT. I believe this is because it does a check against the pid namespace that the procfs mount has pinned.
On Tue, Oct 2, 2018 at 12:32 AM Aleksa Sarai cyphar@cyphar.com wrote:
On 2018-10-01, Andy Lutomirski luto@amacapital.net wrote:
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
I must admit that I’m not very sympathetic to the argument that “Go’s runtime model is incompatible with the simpler solution.”
Multi-threaded programs have a similar issue (though with Go it's much worse). If you fork a multi-threaded C program then you can only safely use AS-Safe glibc functions (those that are safe within a signal handler). But if you're just doing three syscalls this shouldn't be as big of a problem as Go where you can't even do said syscalls.
Anyway, it occurs to me that the real problem is that setns() and chroot() are both overkill for this use case.
I agree. My diversion to Go was to explain why it was particularly bad for cri-o/rkt/runc/Docker/etc.
What’s needed is to start your walk from /proc/pid-in-container/root, with two twists:
- Do the walk as though rooted at a directory. This is basically just
your AT_THIS_ROOT, but the footgun is avoided because the dirfd you use is from a foreign namespace, and, except for symlinks to absolute paths, no amount of .. racing is going to escape the *namespace*.
This is quite clever and I'll admit I hadn't thought of this. This definitely fixes the ".." issue, but as you've said it won't handle absolute symlinks (which means userspace has the same races that we currently have even if you assume that you have a container process already running -- CVE-2018-15664 is one of many examples of this).
(AT_THIS_ROOT using /proc/$container/root would in principle fix all of the mentioned issues -- but as I said before I'd like to see whether hardening ".." would be enough to solve the escape problem.)
Hmm. Good point.
* Aleksa Sarai:
On 2018-10-01, Andy Lutomirski luto@amacapital.net wrote:
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
I must admit that I’m not very sympathetic to the argument that “Go’s runtime model is incompatible with the simpler solution.”
Multi-threaded programs have a similar issue (though with Go it's much worse). If you fork a multi-threaded C program then you can only safely use AS-Safe glibc functions (those that are safe within a signal handler). But if you're just doing three syscalls this shouldn't be as big of a problem as Go where you can't even do said syscalls.
The situation is a bit more complicated. There are many programs out there which use malloc and free (at least indirectly) after a fork, and we cannot break them. In glibc, we have a couple of subsystems which are put into a known state before calling the fork/clone system call if the application calls fork. The price we pay for that is a fork which is not POSIX-compliant because it is not async-signal-safe. Admittedly, other libcs chose different trade-offs.
However, what is the same across libcs is this: You cannot call the clone system call directly and get a fully working new process. Some things break. For example, for recursive mutexes, we need to know the TID of the current thread, and we cannot perform a system call to get it for performance reasons. So everyone has a TID cache for that. But the TID cache does not get reset when you bypass the fork implementation in libc, so you end up with subtle corruption bugs on TID reuse.
So I'd say that in most cases, the C situation is pretty much the same as the Go situation. If I recall correctly, the problem for Go is that it cannot call setns from Go code because it fails in the kernel for multi-threaded processes, and Go processes are already multi-threaded when user Go code runs.
On Sat, Oct 6, 2018 at 10:56 PM Florian Weimer fw@deneb.enyo.de wrote:
- Aleksa Sarai:
On 2018-10-01, Andy Lutomirski luto@amacapital.net wrote:
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()?
You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.
I must admit that I’m not very sympathetic to the argument that “Go’s runtime model is incompatible with the simpler solution.”
Multi-threaded programs have a similar issue (though with Go it's much worse). If you fork a multi-threaded C program then you can only safely use AS-Safe glibc functions (those that are safe within a signal handler). But if you're just doing three syscalls this shouldn't be as big of a problem as Go where you can't even do said syscalls.
The situation is a bit more complicated. There are many programs out there which use malloc and free (at least indirectly) after a fork, and we cannot break them. In glibc, we have a couple of subsystems which are put into a known state before calling the fork/clone system call if the application calls fork. The price we pay for that is a fork which is not POSIX-compliant because it is not async-signal-safe. Admittedly, other libcs chose different trade-offs.
However, what is the same across libcs is this: You cannot call the clone system call directly and get a fully working new process. Some things break. For example, for recursive mutexes, we need to know the TID of the current thread, and we cannot perform a system call to get it for performance reasons. So everyone has a TID cache for that. But the TID cache does not get reset when you bypass the fork implementation in libc, so you end up with subtle corruption bugs on TID reuse.
Sure, but recursive mutexes etc. are very specific use-case. I'd even go so far to say that if you use mutexes + threads and then also fork in those threads you're hosed anyway. If you don't things get a little cleaner assuming you don't call library functions that use mutexes internally. Event then you might (sometimes at least) still get around most problems with atfork handlers (thought I really don't like him). But you know more about this then I do. :)
So I'd say that in most cases, the C situation is pretty much the same as the Go situation. If I recall correctly, the problem for Go is that it cannot call setns from Go code because it fails in the kernel for multi-threaded processes, and Go processes are already multi-threaded when user Go code runs.
That is true for *some* namespaces (user, mount) but not for all. For example, setns(CLONE_NEWNET) would be fine from go. But the go runtime thinks it's clever to clone a new thread in between entry and exit of a syscall. If you switch namespaces you might end up with a new thread that belongs to the wrong namespace which is very problematic. So you can either rely on calling some go magic that locks you to a specific os thread but that does only work in later go versions or you go the constructor route, i.e. you e.g. implement a (dummy) subcommand that you can call and that triggers the execution of a C function that is marked with __attribute__((constructor)) that runs before the go runtime and in which you can do setns(), fork() and friends (somewhat) safely. This has very bad performance and is a nasty hack but it's really unavoidable.
On Sat, Sep 29, 2018 at 06:35:17PM +0200, Jann Horn wrote:
+cc linux-api; please keep them in CC for future versions of the patch
On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai cyphar@cyphar.com wrote:
The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths. The already-existing AT_XDEV and AT_NO_PROCLINKS help defend against other potential attacks in a malicious rootfs scenario.
So, I really like the concept for patch 1 of this series (but haven't read the code yet); but I dislike this patch because of its footgun potential.
If this patch landed as-is, the manpage would need some big warning labels. chroot() basically provides no security guarantees at all; and yes, that includes that if you do `chroot(...); chdir("/"); open(attacker_controlled_path, ...);`, you can potentially end up opening a file outside the chroot. See https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-014-2015.txt for an example, where Qubes OS did pretty much that, and ended up with a potentially exploitable security bug because of that, where one VM, while performing a file transfer into another VM, could write outside of the transfer target directory. The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that.
(Yes, this means that if you run an SFTP server with OpenSSH's ChrootDirectory directive, you have to be very careful about these things.)
I believe that the only way to robustly use this would be to point the dirfd at a mount point, such that you know that being moved out of the chroot is impossible because the mount point limits movement of directories under it. (Well, technically, it doesn't, but it ensures that if a directory does dangerously move away, the syscall fails.) It might make sense to hardcode this constraint in the implementation of AT_THIS_ROOT, to keep people from shooting themselves in the foot.
I'm very much in favor of dropping AT_THIS_ROOT from this patch series at least for now and only land the first patch. The first patch is something that we really want and that it seems we can find a good design for. If AT_THIS_ROOT is a feature that still makes sense we can revisit it.
Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container.
Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()? I think something like this should work (except that you should add some error handling - omitted here because I'm lazy), assuming that the container runtime does NOT have CAP_SYS_ADMIN in the init namespace (otherwise it's easier). Of course, this is entirely untested, and probably won't compile because I screwed something up. :P But you should get the idea...
// Ensure that we are non-dumpable. Together with // commit bfedb589252c, this ensures that container root // can't trace our child once it enters the container. // My patch // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net... // would make this unnecessary, but that patch didn't // land because Eric nacked it (for political reasons, // because people incorrectly claimed that this was a // security fix): // https://lore.kernel.org/lkml/8760z7fope.fsf@x220.int.ebiederm.org/ // Note that dumpability is per-mm, not per-process, // so this hack has the unfortunate side effect of preventing // unprivileged debugging of the container runtime. // Oh well. prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE); // Inform gcc that this particular syscall will effectively // return twice, just like vfork() or setjmp(). __attribute__((returns_twice)) long syscall_(long sysno, ...) = (void*)syscall; int result_fd = -1; // CLONE_FILES means we don't need to do fd passing, // we share the file descriptor table. // CLONE_VM means we don't have the cost of duplicating // our VMAs and page tables, and we don't have to mark // all our pagetable entries as readonly for copy-on-write. // CLONE_VFORK is a dirty hack to avoid having to // allocate a child stack. // Lack of SIGCHLD means we don't want to have to wait() // for the child. int child_pid = syscall_(__NR_clone, CLONE_FILES|CLONE_VM|CLONE_VFORK, 0, 0, 0, 0); if (child_pid == 0) { // Enter the container's user namespace; this allows us // to afterwards join its mount namespace even if we're // not capable in the init namespace. // (I believe that it should be possible to change the kernel // such that this is not required if you have set the // no-new-privs flag.) setns(container_user_ns_fd, CLONE_NEWUSER); // Entering the filesystem namespace automatically // moves us to that namespace's filesystem root. setns(container_fs_ns_fd, CLONE_NEWNS); result_fd = open(untrusted_container_path, ...); syscall(__NR_exit, 0); }
The most significant change in semantics with AT_THIS_ROOT is that *at(2) syscalls now no longer have the property that an absolute pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is specified). The reasoning behind this is that AT_THIS_ROOT necessarily has to chroot-scope symlinks with absolute paths to dirfd, and so doing it for the base path seems to be the most consistent behaviour (and also avoids foot-gunning users who want to chroot-scope paths that might be absolute).
Currently this is only enabled for the stat(2) and openat(2) family (the latter has its own flag O_THISROOT with the same semantics). Ideally this flag would be supported by all *at(2) syscalls, but this will require adding flags arguments to many of them (and will be done in a separate patchset).
Cc: Eric Biederman ebiederm@xmission.com Cc: Christian Brauner christian@brauner.io Signed-off-by: Aleksa Sarai cyphar@cyphar.com
fs/fcntl.c | 2 +- fs/namei.c | 121 +++++++++++++++++-------------- fs/open.c | 2 + fs/stat.c | 4 +- include/linux/fcntl.h | 2 +- include/linux/namei.h | 1 + include/uapi/asm-generic/fcntl.h | 3 + include/uapi/linux/fcntl.h | 2 + 8 files changed, 81 insertions(+), 56 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index e343618736f7..4c36c5b9fdb9 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */
BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ !=
BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c index 757dd783771c..1b984f0dbbb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd) } }
+/*
- Configure nd->path based on the nd->dfd. This is only used as part of
- path_init().
- */
+static inline int dirfd_path_init(struct nameidata *nd) +{
if (nd->dfd == AT_FDCWD) {
if (nd->flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;
do {
seq = read_seqcount_begin(&fs->seq);
nd->path = fs->pwd;
nd->inode = nd->path.dentry->d_inode;
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
} while (read_seqcount_retry(&fs->seq, seq));
} else {
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
} else {
/* Caller must check execute permissions on the starting path component */
struct fd f = fdget_raw(nd->dfd);
struct dentry *dentry;
if (!f.file)
return -EBADF;
dentry = f.file->f_path.dentry;
if (*nd->name->name && unlikely(!d_can_lookup(dentry))) {
fdput(f);
return -ENOTDIR;
}
nd->path = f.file->f_path;
if (nd->flags & LOOKUP_RCU) {
nd->inode = nd->path.dentry->d_inode;
nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
} else {
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
fdput(f);
}
if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) {
nd->root = nd->path;
if (!(nd->flags & LOOKUP_RCU))
path_get(&nd->root);
}
return 0;
+}
/* must be paired with terminate_walk() */ static const char *path_init(struct nameidata *nd, unsigned flags) {
int error; const char *s = nd->name->name; if (!*s)
@@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.dentry = NULL;
nd->m_seq = read_seqbegin(&mount_lock);
if (unlikely(flags & LOOKUP_CHROOT)) {
error = dirfd_path_init(nd);
if (unlikely(error))
return ERR_PTR(error);
} if (*s == '/') {
int error;
set_root(nd);
if (likely(!nd->root.mnt))
set_root(nd); error = nd_jump_root(nd); if (unlikely(error)) s = ERR_PTR(error); return s;
} else if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;
do {
seq = read_seqcount_begin(&fs->seq);
nd->path = fs->pwd;
nd->inode = nd->path.dentry->d_inode;
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
} while (read_seqcount_retry(&fs->seq, seq));
} else {
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
if (unlikely(flags & LOOKUP_BENEATH)) {
nd->root = nd->path;
if (!(flags & LOOKUP_RCU))
path_get(&nd->root);
}
return s;
} else {
/* Caller must check execute permissions on the starting path component */
struct fd f = fdget_raw(nd->dfd);
struct dentry *dentry;
if (!f.file)
return ERR_PTR(-EBADF);
dentry = f.file->f_path.dentry;
if (*s && unlikely(!d_can_lookup(dentry))) {
fdput(f);
return ERR_PTR(-ENOTDIR);
}
nd->path = f.file->f_path;
if (flags & LOOKUP_RCU) {
nd->inode = nd->path.dentry->d_inode;
nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
} else {
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
if (unlikely(flags & LOOKUP_BENEATH)) {
nd->root = nd->path;
if (!(flags & LOOKUP_RCU))
path_get(&nd->root);
}
fdput(f);
return s; }
if (likely(!nd->path.mnt)) {
error = dirfd_path_init(nd);
if (unlikely(error))
return ERR_PTR(error);
}
return s;
}
static const char *trailing_symlink(struct nameidata *nd) diff --git a/fs/open.c b/fs/open.c index 80f5f566a5ff..81d148f626cd 100644 --- a/fs/open.c +++ b/fs/open.c @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_NO_PROCLINKS; if (flags & O_NOSYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS;
if (flags & O_THISROOT)
lookup_flags |= LOOKUP_CHROOT; op->lookup_flags = lookup_flags; return 0;
} diff --git a/fs/stat.c b/fs/stat.c index 791e61b916ae..e8366e4812c3 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags,
if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH | KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV |
AT_NO_PROCLINKS | AT_NO_SYMLINKS))
AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT)) return -EINVAL; if (flags & AT_SYMLINK_NOFOLLOW)
@@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags, lookup_flags |= LOOKUP_NO_PROCLINKS; if (flags & AT_NO_SYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS;
if (flags & AT_THIS_ROOT)
lookup_flags |= LOOKUP_CHROOT;
retry: error = user_path_at(dfd, filename, lookup_flags, &path); diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index ad5bba4b5b12..95480cd4c09d 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -10,7 +10,7 @@ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \
O_NOPROCLINKS | O_NOSYMLINKS)
O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT)
#ifndef force_o_largefile #define force_o_largefile() (BITS_PER_LONG != 32) diff --git a/include/linux/namei.h b/include/linux/namei.h index 5ff7f3362d1b..7ec9e2d84649 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. Implies LOOKUP_NO_PROCLINKS. */ +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */
extern int path_pts(struct path *path);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index c2bf5983e46a..11206b0e927c 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -113,6 +113,9 @@ #ifndef O_NOSYMLINKS #define O_NOSYMLINKS 01000000000 #endif +#ifndef O_THISROOT +#define O_THISROOT 02000000000 +#endif
#define F_DUPFD 0 /* dup */ #define F_GETFD 1 /* get close_on_exec */ diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 551a9e2166a8..ea978457b68f 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -99,6 +99,8 @@ #define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */ #define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*. Implies AT_NO_PROCLINKS. */ +#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though
it is chroot-ed into dirfd. */
#endif /* _UAPI_LINUX_FCNTL_H */
2.19.0
On 2018-09-29, Jann Horn jannh@google.com wrote:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
I've been playing with this and I have the following patch, which according to my testing protects against attacks where ".." skips over nd->root. It abuses __d_path to figure out if nd->path can be resolved from nd->root (obviously a proper version of this patch would refactor __d_path so it could be used like this -- and would not return -EMULTIHOP).
I've also attached my reproducer. With it, I was seeing fairly constant breakouts before this patch and after it I didn't see a single breakout after running it overnight. Obviously this is not conclusive, but I'm hoping that it can show what my idea for protecting against ".." was.
Does this patch make sense? Or is there something wrong with it that I'm not seeing?
--8<-------------------------------------------------------------------
There is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root.
thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases will be detected during ".." resolution (which is the weak point of chroot(2) -- since walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root).
The use of __d_path here might seem suspect, however we don't mind if a path is moved from within the chroot to outside the chroot and we incorrectly decide it is safe (because at that point we are still within the set of files which were accessible at the beginning of resolution). However, we can fail resolution on the next path component if it remains outside of the root. A path which has always been outside nd->root during resolution will never be resolveable from nd->root and thus will always be blocked.
DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, purely as a debugging measure (so that you can see that the protection actually does something). Obviously in the proper patch this will return -EXDEV.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namei.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..c8349693d47b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -53,8 +53,8 @@ * The new code replaces the old recursive symlink resolution with * an iterative one (in case of non-nested symlink chains). It does * this with calls to <fs>_follow_link(). - * As a side effect, dir_namei(), _namei() and follow_link() are now - * replaced with a single function lookup_dentry() that can handle all + * As a side effect, dir_namei(), _namei() and follow_link() are now + * replaced with a single function lookup_dentry() that can handle all * the special cases of the former code. * * With the new dcache, the pathname is stored at each inode, at least as @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; } + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { + char *pathbuf, *pathptr; + + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); + if (!pathbuf) + return -ECHILD; + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); + kfree(pathbuf); + if (IS_ERR_OR_NULL(pathptr)) { + if (!pathptr) + pathptr = ERR_PTR(-EMULTIHOP); + return PTR_ERR(pathptr); + } + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1510,6 +1524,20 @@ static int follow_dotdot(struct nameidata *nd) return -EXDEV; break; } + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { + char *pathbuf, *pathptr; + + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); + if (!pathbuf) + return -ENOMEM; + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); + kfree(pathbuf); + if (IS_ERR_OR_NULL(pathptr)) { + if (!pathptr) + pathptr = ERR_PTR(-EMULTIHOP); + return PTR_ERR(pathptr); + } + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret)
On Fri, Oct 05, 2018 at 02:26:11AM +1000, Aleksa Sarai wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
I've been playing with this and I have the following patch, which according to my testing protects against attacks where ".." skips over nd->root. It abuses __d_path to figure out if nd->path can be resolved from nd->root (obviously a proper version of this patch would refactor __d_path so it could be used like this -- and would not return -EMULTIHOP).
I've also attached my reproducer. With it, I was seeing fairly constant breakouts before this patch and after it I didn't see a single breakout after running it overnight. Obviously this is not conclusive, but I'm hoping that it can show what my idea for protecting against ".." was.
Does this patch make sense? Or is there something wrong with it that I'm not seeing?
Interesting. Apart from the abuse of __d_path() :) the question I'd have is whether this just minimizes the race window or if you can provide a sound argument that this actually can't happen anymore with this patch.
--8<-------------------------------------------------------------------
There is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root.
thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases will be detected during ".." resolution (which is the weak point of chroot(2) -- since walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root).
The use of __d_path here might seem suspect, however we don't mind if a path is moved from within the chroot to outside the chroot and we incorrectly decide it is safe (because at that point we are still within the set of files which were accessible at the beginning of resolution). However, we can fail resolution on the next path component if it remains outside of the root. A path which has always been outside nd->root during resolution will never be resolveable from nd->root and thus will always be blocked.
DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, purely as a debugging measure (so that you can see that the protection actually does something). Obviously in the proper patch this will return -EXDEV.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
fs/namei.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..c8349693d47b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -53,8 +53,8 @@
- The new code replaces the old recursive symlink resolution with
- an iterative one (in case of non-nested symlink chains). It does
- this with calls to <fs>_follow_link().
- As a side effect, dir_namei(), _namei() and follow_link() are now
- replaced with a single function lookup_dentry() that can handle all
- As a side effect, dir_namei(), _namei() and follow_link() are now
- replaced with a single function lookup_dentry() that can handle all
- the special cases of the former code.
- With the new dcache, the pathname is stored at each inode, at least as
@@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; }
if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
char *pathbuf, *pathptr;
pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
if (!pathbuf)
return -ECHILD;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
if (IS_ERR_OR_NULL(pathptr)) {
if (!pathptr)
pathptr = ERR_PTR(-EMULTIHOP);
return PTR_ERR(pathptr);
}
if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent;}
@@ -1510,6 +1524,20 @@ static int follow_dotdot(struct nameidata *nd) return -EXDEV; break; }
if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
char *pathbuf, *pathptr;
pathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
if (!pathbuf)
return -ENOMEM;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
if (IS_ERR_OR_NULL(pathptr)) {
if (!pathptr)
pathptr = ERR_PTR(-EMULTIHOP);
return PTR_ERR(pathptr);
}
if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret)}
-- 2.19.0
-- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai cyphar@cyphar.com wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
I've been playing with this and I have the following patch, which according to my testing protects against attacks where ".." skips over nd->root. It abuses __d_path to figure out if nd->path can be resolved from nd->root (obviously a proper version of this patch would refactor __d_path so it could be used like this -- and would not return -EMULTIHOP).
I've also attached my reproducer. With it, I was seeing fairly constant breakouts before this patch and after it I didn't see a single breakout after running it overnight. Obviously this is not conclusive, but I'm hoping that it can show what my idea for protecting against ".." was.
Does this patch make sense? Or is there something wrong with it that I'm not seeing?
--8<-------------------------------------------------------------------
There is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root.
thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases will be detected during ".." resolution (which is the weak point of chroot(2) -- since walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root).
The use of __d_path here might seem suspect, however we don't mind if a path is moved from within the chroot to outside the chroot and we incorrectly decide it is safe (because at that point we are still within the set of files which were accessible at the beginning of resolution). However, we can fail resolution on the next path component if it remains outside of the root. A path which has always been outside nd->root during resolution will never be resolveable from nd->root and thus will always be blocked.
DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, purely as a debugging measure (so that you can see that the protection actually does something). Obviously in the proper patch this will return -EXDEV.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
fs/namei.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..c8349693d47b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -53,8 +53,8 @@
- The new code replaces the old recursive symlink resolution with
- an iterative one (in case of non-nested symlink chains). It does
- this with calls to <fs>_follow_link().
- As a side effect, dir_namei(), _namei() and follow_link() are now
- replaced with a single function lookup_dentry() that can handle all
- As a side effect, dir_namei(), _namei() and follow_link() are now
- replaced with a single function lookup_dentry() that can handle all
- the special cases of the former code.
- With the new dcache, the pathname is stored at each inode, at least as
@@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; }
if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
char *pathbuf, *pathptr;
pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
if (!pathbuf)
return -ECHILD;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
if (IS_ERR_OR_NULL(pathptr)) {
if (!pathptr)
pathptr = ERR_PTR(-EMULTIHOP);
return PTR_ERR(pathptr);
}
}
One somewhat problematic thing about this approach is that if someone tries to lookup "a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some reason, you'll have quadratic runtime: For each "..", you'll have to walk up to the root.
On 2018-10-04, Jann Horn jannh@google.com wrote:
On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai cyphar@cyphar.com wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
I've been playing with this and I have the following patch, which according to my testing protects against attacks where ".." skips over nd->root. It abuses __d_path to figure out if nd->path can be resolved from nd->root (obviously a proper version of this patch would refactor __d_path so it could be used like this -- and would not return -EMULTIHOP).
I've also attached my reproducer. With it, I was seeing fairly constant breakouts before this patch and after it I didn't see a single breakout after running it overnight. Obviously this is not conclusive, but I'm hoping that it can show what my idea for protecting against ".." was.
Does this patch make sense? Or is there something wrong with it that I'm not seeing?
--8<-------------------------------------------------------------------
There is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root.
thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases will be detected during ".." resolution (which is the weak point of chroot(2) -- since walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root).
The use of __d_path here might seem suspect, however we don't mind if a path is moved from within the chroot to outside the chroot and we incorrectly decide it is safe (because at that point we are still within the set of files which were accessible at the beginning of resolution). However, we can fail resolution on the next path component if it remains outside of the root. A path which has always been outside nd->root during resolution will never be resolveable from nd->root and thus will always be blocked.
DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, purely as a debugging measure (so that you can see that the protection actually does something). Obviously in the proper patch this will return -EXDEV.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
fs/namei.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..c8349693d47b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -53,8 +53,8 @@
- The new code replaces the old recursive symlink resolution with
- an iterative one (in case of non-nested symlink chains). It does
- this with calls to <fs>_follow_link().
- As a side effect, dir_namei(), _namei() and follow_link() are now
- replaced with a single function lookup_dentry() that can handle all
- As a side effect, dir_namei(), _namei() and follow_link() are now
- replaced with a single function lookup_dentry() that can handle all
- the special cases of the former code.
- With the new dcache, the pathname is stored at each inode, at least as
@@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; }
if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
char *pathbuf, *pathptr;
pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
if (!pathbuf)
return -ECHILD;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
if (IS_ERR_OR_NULL(pathptr)) {
if (!pathptr)
pathptr = ERR_PTR(-EMULTIHOP);
return PTR_ERR(pathptr);
}
}
One somewhat problematic thing about this approach is that if someone tries to lookup "a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some reason, you'll have quadratic runtime: For each "..", you'll have to walk up to the root.
What if we took rename_lock (call it nd->r_seq) at the start of the resolution, and then only tried the __d_path-style check
if (read_seqretry(&rename_lock, nd->r_seq) || read_seqretry(&mount_lock, nd->m_seq)) /* do the __d_path lookup. */
That way you would only hit the slow path if there were concurrent renames or mounts *and* you are doing a path resolution with AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does this (and after some testing it also appears to work).
I'm not sure if there's a way to always avoid the quadratic lookup without (significantly and probably unreasonably) changing how dcache invalidation works. And obviously using this slow path if there was _any_ rename on the _entire_ system is suboptimal, but I think it is a significant improvement.
Another possibility is to expand on Andy's suggestion to use /proc/$pid/root, and instead require AT_THIS_ROOT to use the root of a namespace as its dirfd (I'm not sure if there's a trivial way to detect this though). This wouldn't help with AT_BENEATH, but it should protect against ".." shenanigans without any ".." handling changes. (This is less ideal because it requires a container process, but it is another way of dealing with the issue.)
--- fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..12c9be175cb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -493,7 +493,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags; - unsigned seq, m_seq; + unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count; @@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; } + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) && + (read_seqretry(&rename_lock, nd->r_seq) || + read_seqretry(&mount_lock, nd->m_seq)))) { + char *pathbuf, *pathptr; + + nd->r_seq = read_seqbegin(&rename_lock); + /* Cannot take m_seq here. */ + + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); + if (!pathbuf) + return -ECHILD; + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); + kfree(pathbuf); + if (IS_ERR_OR_NULL(pathptr)) { + int error = PTR_ERR_OR_ZERO(pathptr); + + if (!error) + error = nd_jump_root(nd); + return error; + } + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1510,6 +1531,27 @@ static int follow_dotdot(struct nameidata *nd) return -EXDEV; break; } + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) && + (read_seqretry(&rename_lock, nd->r_seq) || + read_seqretry(&mount_lock, nd->m_seq)))) { + char *pathbuf, *pathptr; + + nd->r_seq = read_seqbegin(&rename_lock); + /* Cannot take m_seq here. */ + + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); + if (!pathbuf) + return -ENOMEM; + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); + kfree(pathbuf); + if (IS_ERR_OR_NULL(pathptr)) { + int error = PTR_ERR_OR_ZERO(pathptr); + + if (!error) + error = nd_jump_root(nd); + return error; + } + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret) @@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0; + nd->m_seq = read_seqbegin(&mount_lock); + nd->r_seq = read_seqbegin(&rename_lock); + if (flags & LOOKUP_ROOT) { struct dentry *root = nd->root.dentry; struct inode *inode = root->d_inode; @@ -2279,7 +2324,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) if (flags & LOOKUP_RCU) { nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); nd->root_seq = nd->seq; - nd->m_seq = read_seqbegin(&mount_lock); } else { path_get(&nd->path); } @@ -2290,7 +2334,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.mnt = NULL; nd->path.dentry = NULL;
- nd->m_seq = read_seqbegin(&mount_lock); if (unlikely(flags & (LOOKUP_CHROOT | LOOKUP_XDEV))) { error = dirfd_path_init(nd); if (unlikely(error))
On Fri, Oct 5, 2018 at 5:07 PM Aleksa Sarai cyphar@cyphar.com wrote:
On 2018-10-04, Jann Horn jannh@google.com wrote:
On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai cyphar@cyphar.com wrote:
On 2018-09-29, Jann Horn jannh@google.com wrote:
You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following:
- You start the path walk and reach /A/B/C.
- The other process moves /A/B/C to /A/C. Your path walk is now at /A/C.
- Your path walk follows the first ".." up into /A. This is outside
the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot.
I've been playing with this and I have the following patch, which according to my testing protects against attacks where ".." skips over nd->root. It abuses __d_path to figure out if nd->path can be resolved from nd->root (obviously a proper version of this patch would refactor __d_path so it could be used like this -- and would not return -EMULTIHOP).
I've also attached my reproducer. With it, I was seeing fairly constant breakouts before this patch and after it I didn't see a single breakout after running it overnight. Obviously this is not conclusive, but I'm hoping that it can show what my idea for protecting against ".." was.
Does this patch make sense? Or is there something wrong with it that I'm not seeing?
--8<-------------------------------------------------------------------
There is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root.
thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat(dirb, "b/c/../../etc/shadow", O_THISROOT);
With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases will be detected during ".." resolution (which is the weak point of chroot(2) -- since walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root).
The use of __d_path here might seem suspect, however we don't mind if a path is moved from within the chroot to outside the chroot and we incorrectly decide it is safe (because at that point we are still within the set of files which were accessible at the beginning of resolution). However, we can fail resolution on the next path component if it remains outside of the root. A path which has always been outside nd->root during resolution will never be resolveable from nd->root and thus will always be blocked.
DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, purely as a debugging measure (so that you can see that the protection actually does something). Obviously in the proper patch this will return -EXDEV.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
fs/namei.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..c8349693d47b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -53,8 +53,8 @@
- The new code replaces the old recursive symlink resolution with
- an iterative one (in case of non-nested symlink chains). It does
- this with calls to <fs>_follow_link().
- As a side effect, dir_namei(), _namei() and follow_link() are now
- replaced with a single function lookup_dentry() that can handle all
- As a side effect, dir_namei(), _namei() and follow_link() are now
- replaced with a single function lookup_dentry() that can handle all
- the special cases of the former code.
- With the new dcache, the pathname is stored at each inode, at least as
@@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; }
if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) {
char *pathbuf, *pathptr;
pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
if (!pathbuf)
return -ECHILD;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
if (IS_ERR_OR_NULL(pathptr)) {
if (!pathptr)
pathptr = ERR_PTR(-EMULTIHOP);
return PTR_ERR(pathptr);
}
}
One somewhat problematic thing about this approach is that if someone tries to lookup "a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some reason, you'll have quadratic runtime: For each "..", you'll have to walk up to the root.
What if we took rename_lock (call it nd->r_seq) at the start of the resolution, and then only tried the __d_path-style check
if (read_seqretry(&rename_lock, nd->r_seq) || read_seqretry(&mount_lock, nd->m_seq)) /* do the __d_path lookup. */
That way you would only hit the slow path if there were concurrent renames or mounts *and* you are doing a path resolution with AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does this (and after some testing it also appears to work).
Yeah, I think that might do the job.
I'm not sure if there's a way to always avoid the quadratic lookup without (significantly and probably unreasonably) changing how dcache invalidation works. And obviously using this slow path if there was _any_ rename on the _entire_ system is suboptimal, but I think it is a significant improvement.
Yeah, I think this is much better.
Another possibility is to expand on Andy's suggestion to use /proc/$pid/root, and instead require AT_THIS_ROOT to use the root of a namespace as its dirfd (I'm not sure if there's a trivial way to detect this though). This wouldn't help with AT_BENEATH, but it should protect against ".." shenanigans without any ".." handling changes. (This is less ideal because it requires a container process, but it is another way of dealing with the issue.)
(For container usecases, but not for a web server that uses AT_BENEATH.)
fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..12c9be175cb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -493,7 +493,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags;
unsigned seq, m_seq;
unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count;
@@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; }
if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
(read_seqretry(&rename_lock, nd->r_seq) ||
read_seqretry(&mount_lock, nd->m_seq)))) {
char *pathbuf, *pathptr;
nd->r_seq = read_seqbegin(&rename_lock);
/* Cannot take m_seq here. */
pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
if (!pathbuf)
return -ECHILD;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
You're doing this check before actually looking up the parent, right? So as long as I don't trigger the "path_equal(&nd->path, &nd->root)" check that you do for O_BENEATH, escaping up by one level is possible, right? You should probably move this check so that it happens after following "..".
(Also: I assume that you're going to get rid of that memory allocation in a future version.)
if (IS_ERR_OR_NULL(pathptr)) {
int error = PTR_ERR_OR_ZERO(pathptr);
if (!error)
error = nd_jump_root(nd);
return error;
}
} if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent;
@@ -1510,6 +1531,27 @@ static int follow_dotdot(struct nameidata *nd) return -EXDEV; break; }
if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
(read_seqretry(&rename_lock, nd->r_seq) ||
read_seqretry(&mount_lock, nd->m_seq)))) {
char *pathbuf, *pathptr;
nd->r_seq = read_seqbegin(&rename_lock);
/* Cannot take m_seq here. */
pathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
if (!pathbuf)
return -ENOMEM;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
if (IS_ERR_OR_NULL(pathptr)) {
int error = PTR_ERR_OR_ZERO(pathptr);
if (!error)
error = nd_jump_root(nd);
return error;
}
}
Same problem as in the RCU case above.
if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret)
@@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0;
nd->m_seq = read_seqbegin(&mount_lock);
nd->r_seq = read_seqbegin(&rename_lock);
This means that now, attempting to perform a lookup while something is holding the rename_lock will spin on the lock. I don't know whether that's a problem in practice though. Does anyone on this thread know whether this is problematic?
On 2018-10-05, Jann Horn jannh@google.com wrote:
What if we took rename_lock (call it nd->r_seq) at the start of the resolution, and then only tried the __d_path-style check
if (read_seqretry(&rename_lock, nd->r_seq) || read_seqretry(&mount_lock, nd->m_seq)) /* do the __d_path lookup. */
That way you would only hit the slow path if there were concurrent renames or mounts *and* you are doing a path resolution with AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does this (and after some testing it also appears to work).
Yeah, I think that might do the job.
*phew* I was all out of other ideas. :P
fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..12c9be175cb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -493,7 +493,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags;
unsigned seq, m_seq;
unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count;
@@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; }
if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
(read_seqretry(&rename_lock, nd->r_seq) ||
read_seqretry(&mount_lock, nd->m_seq)))) {
char *pathbuf, *pathptr;
nd->r_seq = read_seqbegin(&rename_lock);
/* Cannot take m_seq here. */
pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
if (!pathbuf)
return -ECHILD;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
You're doing this check before actually looking up the parent, right? So as long as I don't trigger the "path_equal(&nd->path, &nd->root)" check that you do for O_BENEATH, escaping up by one level is possible, right? You should probably move this check so that it happens after following "..".
Yup, you're right. I'll do that.
(Also: I assume that you're going to get rid of that memory allocation in a future version.)
Sure. Would you prefer adding some scratch space in nameidata, or that I change __d_path so it accepts NULL as the buffer (and thus it doesn't actually do any string operations)?
if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret)
@@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0;
nd->m_seq = read_seqbegin(&mount_lock);
nd->r_seq = read_seqbegin(&rename_lock);
This means that now, attempting to perform a lookup while something is holding the rename_lock will spin on the lock. I don't know whether that's a problem in practice though. Does anyone on this thread know whether this is problematic?
I could make it so that we only take &rename_lock if (nd->flags & (FOLLOW_BENEATH | FOLLOW_CHROOT)), since it's not used outside of that path.
On Sat, Oct 6, 2018 at 4:10 AM Aleksa Sarai cyphar@cyphar.com wrote:
On 2018-10-05, Jann Horn jannh@google.com wrote:
What if we took rename_lock (call it nd->r_seq) at the start of the resolution, and then only tried the __d_path-style check
if (read_seqretry(&rename_lock, nd->r_seq) || read_seqretry(&mount_lock, nd->m_seq)) /* do the __d_path lookup. */
That way you would only hit the slow path if there were concurrent renames or mounts *and* you are doing a path resolution with AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does this (and after some testing it also appears to work).
Yeah, I think that might do the job.
*phew* I was all out of other ideas. :P
fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..12c9be175cb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -493,7 +493,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags;
unsigned seq, m_seq;
unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count;
@@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; }
if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) &&
(read_seqretry(&rename_lock, nd->r_seq) ||
read_seqretry(&mount_lock, nd->m_seq)))) {
char *pathbuf, *pathptr;
nd->r_seq = read_seqbegin(&rename_lock);
/* Cannot take m_seq here. */
pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC);
if (!pathbuf)
return -ECHILD;
pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX);
kfree(pathbuf);
You're doing this check before actually looking up the parent, right? So as long as I don't trigger the "path_equal(&nd->path, &nd->root)" check that you do for O_BENEATH, escaping up by one level is possible, right? You should probably move this check so that it happens after following "..".
Yup, you're right. I'll do that.
(Also: I assume that you're going to get rid of that memory allocation in a future version.)
Sure. Would you prefer adding some scratch space in nameidata, or that I change __d_path so it accepts NULL as the buffer (and thus it doesn't actually do any string operations)?
Well, I think accepting a NULL buffer would be much cleaner; but keep in mind that I'm just someone making suggestions, Al Viro is the one who has to like your code. :P
if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret)
@@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0;
nd->m_seq = read_seqbegin(&mount_lock);
nd->r_seq = read_seqbegin(&rename_lock);
This means that now, attempting to perform a lookup while something is holding the rename_lock will spin on the lock. I don't know whether that's a problem in practice though. Does anyone on this thread know whether this is problematic?
I could make it so that we only take &rename_lock if (nd->flags & (FOLLOW_BENEATH | FOLLOW_CHROOT)), since it's not used outside of that path.
I think that might be a sensible change; but as I said, I don't actually know whether it's necessary, and it would be very helpful if someone who actually knows commented on this.
On Sep 29, 2018, at 3:34 AM, Aleksa Sarai cyphar@cyphar.com wrote:
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
Seems useful.
- Mountpoint crossings are blocked by AT_XDEV.
Seems useful.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
So how do I disable following symlinks? ISTM the most natural way would be to have AT_NO_SYMLINKS, and to have that flag disable proc links.
On 2018-09-29, Andy Lutomirski luto@amacapital.net wrote:
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
Seems useful.
- Mountpoint crossings are blocked by AT_XDEV.
Seems useful.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
So how do I disable following symlinks? ISTM the most natural way would be to have AT_NO_SYMLINKS, and to have that flag disable proc links.
So, this patchset has both AT_NO_SYMLINKS and AT_NO_PROCLINKS.
* AT_NO_SYMLINKS blocks *all* symlinks (which is something Linus requested in the original thread[2] -- apparently this is something that would be useful to git even if wouldn't violate AT_BENEATH). This implies AT_NO_PROCLINKS.
* AT_NO_PROCLINKS only blocks procfs-style "symlinks" (filesystem "symlinks" that call nd_jump_link() themselves -- currently only procfs and nsfs).
The reason why we need AT_NO_PROCLINKS is that "proclinks"[*] allow for breaking-out of nd->root without a trivial way of detecting it (since the filesystem can manipulate nd->path almost arbitrarily outside of the control of VFS). Al Viro's original patchset[1] also blocked these but it was all included within AT_NO_JUMPS.
Requiring you to block *all* symlinks in order to block "proclinks" seems to be a bit overkill to me (especially if consider that AT_THIS_ROOT|AT_NO_PROCLINKS is definitely a usecase most container runtimes would be _very_ interested in -- while AT_NO_SYMLINKS will cause issues with most distribution images).
[*]: Sorry for the awful naming, I'm not sure what the correct name is (I've called them "super symlinks" in the past) -- if you have a better name please let me know!
[1]: https://lwn.net/Articles/721443/ [2]: https://marc.info/?l=linux-kernel&m=149394765324531&w=2
On Sep 29, 2018, at 8:45 AM, Aleksa Sarai cyphar@cyphar.com wrote:
On 2018-09-29, Andy Lutomirski luto@amacapital.net wrote:
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
Seems useful.
- Mountpoint crossings are blocked by AT_XDEV.
Seems useful.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
So how do I disable following symlinks? ISTM the most natural way would be to have AT_NO_SYMLINKS, and to have that flag disable proc links.
So, this patchset has both AT_NO_SYMLINKS and AT_NO_PROCLINKS.
And AT_THIS_ROOT, which is neat. Want to update your cover letter to include all of this? Or at I just reading the wrong thing?
- AT_NO_SYMLINKS blocks *all* symlinks (which is something Linus requested
in the original thread[2] -- apparently this is something that would be useful to git even if wouldn't violate AT_BENEATH). This implies AT_NO_PROCLINKS.
- AT_NO_PROCLINKS only blocks procfs-style "symlinks" (filesystem
"symlinks" that call nd_jump_link() themselves -- currently only procfs and nsfs).
Hmm. I’m not sure that blocking nsfs links is always what the container runtime wants, but the overall concept sounds quite useful. Maybe call it AT_NO_TELEPORT? Or AT_NO_MAGIC_LINKS?
Also, as a perhaps-silly suggestion: if you end up adding a new syscall, I can see a use for a mode that does the path walk but, rather than failing on a disallowed link, stops early and indicates where it stopped. Then web servers, samba, etc can more efficiently implement custom behavior when links are encountered. And it may also be useful to have a variant of AT_THIS_ROOT where trying to escape is an error instead of having it just get stuck at the root.
On Sat, Sep 29, 2018 at 09:34:24AM -0700, Andy Lutomirski wrote:
Also, as a perhaps-silly suggestion: if you end up adding a new syscall, I can see a use for a mode that does the path walk but, rather than failing on a disallowed link, stops early and indicates where it stopped. Then web servers, samba, etc can more efficiently implement custom behavior when links are encountered. And it may also be useful to have a variant of AT_THIS_ROOT where trying to escape is an error instead of having it just get stuck at the root.
AT_USER_LINKS indicating that userspace wants to resolve symlinks themselves?
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote:
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
- Mountpoint crossings are blocked by AT_XDEV.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS path didn't consider "/tmp/.." as a mountpoint crossing -- this patch blocks this as well (feel free to ask me to remove it if you feel this is not sane).
Imho, these flags are very much needed and they all are pretty useful not just for container runtimes but in general.
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
We should really make sure that we can't make due with openat() alone before adding a bunch of new syscalls. So there's no need to rush into this. :)
One additional feature I've implemented is AT_THIS_ROOT (I imagine this is probably going to be more contentious than the refresh of AT_NO_JUMPS, so I've included it in a separate patch). The patch itself describes my reasoning, but the shortened version of the premise is that continer runtimes need to have a way to resolve paths within a potentially malicious rootfs. Container runtimes currently do this in userspace[2] which has implicit race conditions that are not resolvable in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for path resolution, which would be invaluable for us -- and the implementation is basically identical to AT_BENEATH (except that we don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
Aleksa Sarai (3): namei: implement O_BENEATH-style AT_* flags namei: implement AT_THIS_ROOT chroot-like path resolution selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +- fs/namei.c | 158 ++++++++++++------ fs/open.c | 10 ++ fs/stat.c | 15 +- include/linux/fcntl.h | 3 +- include/linux/namei.h | 8 + include/uapi/asm-generic/fcntl.h | 20 +++ include/uapi/linux/fcntl.h | 10 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 ++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++ 19 files changed, 707 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
-- 2.19.0
On 2018-09-29, Christian Brauner christian@brauner.io wrote:
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
We should really make sure that we can't make due with openat() alone before adding a bunch of new syscalls. So there's no need to rush into this. :)
Yeah, I think that we could (mostly) make do with openat(2). We might need to have renameat(2) and a few others, but if we had more support for AT_EMPTY_PATH you should be able to just O_PATH|O_{BENEATH,XDEV,...} and then operate on the O_PATH fd.
On Sat, Sep 29, 2018 at 12:35 PM Aleksa Sarai cyphar@cyphar.com wrote:
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
- Mountpoint crossings are blocked by AT_XDEV.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
It seems quite useful to me.
An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS path didn't consider "/tmp/.." as a mountpoint crossing -- this patch blocks this as well (feel free to ask me to remove it if you feel this is not sane).
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
What do you think of an equivalent feature AT_NO_SYMLINKS flag for mount()?
I guess that would have made the fix for CVE-2017-1002101 in Kubernetes easier to write: https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/
One additional feature I've implemented is AT_THIS_ROOT (I imagine this is probably going to be more contentious than the refresh of AT_NO_JUMPS, so I've included it in a separate patch). The patch itself describes my reasoning, but the shortened version of the premise is that continer runtimes need to have a way to resolve paths within a potentially malicious rootfs. Container runtimes currently do this in userspace[2] which has implicit race conditions that are not resolvable in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for path resolution, which would be invaluable for us -- and the implementation is basically identical to AT_BENEATH (except that we don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
Aleksa Sarai (3): namei: implement O_BENEATH-style AT_* flags namei: implement AT_THIS_ROOT chroot-like path resolution selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +- fs/namei.c | 158 ++++++++++++------ fs/open.c | 10 ++ fs/stat.c | 15 +- include/linux/fcntl.h | 3 +- include/linux/namei.h | 8 + include/uapi/asm-generic/fcntl.h | 20 +++ include/uapi/linux/fcntl.h | 10 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 ++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++ 19 files changed, 707 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
-- 2.19.0
On September 30, 2018 3:54:31 PM GMT+02:00, Alban Crequy alban@kinvolk.io wrote:
On Sat, Sep 29, 2018 at 12:35 PM Aleksa Sarai cyphar@cyphar.com wrote:
The need for some sort of control over VFS's path resolution (to
avoid
malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is
a
revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few
additions.
The most obvious change is that AT_NO_JUMPS has been split as
dicussed
in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
- Mountpoint crossings are blocked by AT_XDEV.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link()
because it
allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS).
At
Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
It seems quite useful to me.
An additional improvement was made to AT_XDEV. The original
AT_NO_JUMPS
path didn't consider "/tmp/.." as a mountpoint crossing -- this patch blocks this as well (feel free to ask me to remove it if you feel
this
is not sane).
Currently I've only enabled these for openat(2) and the stat(2)
family.
I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to
send
those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of
*at(2)
syscalls.
What do you think of an equivalent feature AT_NO_SYMLINKS flag for mount()?
That's something we discussed but that would need to be part of the new mount API work by David. The current mount API doesn't take AT_* flags since it doesn't operate on fds and we're (sort of) out of mount flags.
I guess that would have made the fix for CVE-2017-1002101 in Kubernetes easier to write: https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/
One additional feature I've implemented is AT_THIS_ROOT (I imagine
this
is probably going to be more contentious than the refresh of AT_NO_JUMPS, so I've included it in a separate patch). The patch
itself
describes my reasoning, but the shortened version of the premise is
that
continer runtimes need to have a way to resolve paths within a potentially malicious rootfs. Container runtimes currently do this in userspace[2] which has implicit race conditions that are not
resolvable
in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics
for
path resolution, which would be invaluable for us -- and the implementation is basically identical to AT_BENEATH (except that we don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are
no
other VFS tests in selftests, while there are some tests that look
like
generic VFS tests in xfstests). If you'd prefer them to be included
in
xfstests, let me know.
Aleksa Sarai (3): namei: implement O_BENEATH-style AT_* flags namei: implement AT_THIS_ROOT chroot-like path resolution selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +- fs/namei.c | 158
++++++++++++------
fs/open.c | 10 ++ fs/stat.c | 15 +- include/linux/fcntl.h | 3 +- include/linux/namei.h | 8 + include/uapi/asm-generic/fcntl.h | 20 +++ include/uapi/linux/fcntl.h | 10 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 ++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154
+++++++++++++++++
19 files changed, 707 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755
tools/testing/selftests/vfs/tests/0001_at_beneath.sh
create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755
tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
create mode 100755
tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
create mode 100755
tools/testing/selftests/vfs/tests/0005_at_this_root.sh
create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
-- 2.19.0
As a side note, I'm still working on Landlock which can achieve the same goal but in a more flexible and dynamic way: https://landlock.io
Regards, Mickaël
On 9/29/18 12:34, Aleksa Sarai wrote:
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
- Mountpoint crossings are blocked by AT_XDEV.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS path didn't consider "/tmp/.." as a mountpoint crossing -- this patch blocks this as well (feel free to ask me to remove it if you feel this is not sane).
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
One additional feature I've implemented is AT_THIS_ROOT (I imagine this is probably going to be more contentious than the refresh of AT_NO_JUMPS, so I've included it in a separate patch). The patch itself describes my reasoning, but the shortened version of the premise is that continer runtimes need to have a way to resolve paths within a potentially malicious rootfs. Container runtimes currently do this in userspace[2] which has implicit race conditions that are not resolvable in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for path resolution, which would be invaluable for us -- and the implementation is basically identical to AT_BENEATH (except that we don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
Aleksa Sarai (3): namei: implement O_BENEATH-style AT_* flags namei: implement AT_THIS_ROOT chroot-like path resolution selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +- fs/namei.c | 158 ++++++++++++------ fs/open.c | 10 ++ fs/stat.c | 15 +- include/linux/fcntl.h | 3 +- include/linux/namei.h | 8 + include/uapi/asm-generic/fcntl.h | 20 +++ include/uapi/linux/fcntl.h | 10 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 ++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++ 19 files changed, 707 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
On Sun, Sep 30, 2018 at 10:39 PM Mickaël Salaün mic@digikod.net wrote:
As a side note, I'm still working on Landlock which can achieve the same goal but in a more flexible and dynamic way: https://landlock.io
Isn't Landlock mostly intended for userspace that wants to impose a custom Mandatory Access Control policy on itself, restricting the whole process?
As far as I can tell, a major usecase for AT_BENEATH are privileged processes that do not want to restrict all filesystem operations they perform, but want to sometimes impose limits on filesystem traversal for the duration of a single system call. For example, a process might want to first open a file from an untrusted filesystem area with AT_BENEATH, and afterwards open a configuration file without AT_BENEATH.
How would you do this in Landlock? Use a BPF map to store per-thread filesystem restrictions, and then do bpf() calls before and after every restricted filesystem access to set and unset the policy for the current syscall?
On 9/29/18 12:34, Aleksa Sarai wrote:
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
- Mountpoint crossings are blocked by AT_XDEV.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS path didn't consider "/tmp/.." as a mountpoint crossing -- this patch blocks this as well (feel free to ask me to remove it if you feel this is not sane).
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
One additional feature I've implemented is AT_THIS_ROOT (I imagine this is probably going to be more contentious than the refresh of AT_NO_JUMPS, so I've included it in a separate patch). The patch itself describes my reasoning, but the shortened version of the premise is that continer runtimes need to have a way to resolve paths within a potentially malicious rootfs. Container runtimes currently do this in userspace[2] which has implicit race conditions that are not resolvable in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for path resolution, which would be invaluable for us -- and the implementation is basically identical to AT_BENEATH (except that we don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
Aleksa Sarai (3): namei: implement O_BENEATH-style AT_* flags namei: implement AT_THIS_ROOT chroot-like path resolution selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +- fs/namei.c | 158 ++++++++++++------ fs/open.c | 10 ++ fs/stat.c | 15 +- include/linux/fcntl.h | 3 +- include/linux/namei.h | 8 + include/uapi/asm-generic/fcntl.h | 20 +++ include/uapi/linux/fcntl.h | 10 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 ++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++ 19 files changed, 707 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
On 9/30/18 23:46, Jann Horn wrote:
On Sun, Sep 30, 2018 at 10:39 PM Mickaël Salaün mic@digikod.net wrote:
As a side note, I'm still working on Landlock which can achieve the same goal but in a more flexible and dynamic way: https://landlock.io
Isn't Landlock mostly intended for userspace that wants to impose a custom Mandatory Access Control policy on itself, restricting the whole process?
As far as I can tell, a major usecase for AT_BENEATH are privileged processes that do not want to restrict all filesystem operations they perform, but want to sometimes impose limits on filesystem traversal for the duration of a single system call. For example, a process might want to first open a file from an untrusted filesystem area with AT_BENEATH, and afterwards open a configuration file without AT_BENEATH.
I didn't realized this was the main use case for AT_BENEATH. Landlock is indeed dedicated to apply a security policy on a set of processes. This set can be a process and its children (seccomp-like), or another set of processes that may be identified with a cgroup.
How would you do this in Landlock? Use a BPF map to store per-thread filesystem restrictions, and then do bpf() calls before and after every restricted filesystem access to set and unset the policy for the current syscall?
Another way to apply a security policy could be to tied it to a file descriptor, similarly to Capsicum, which could enable to create programmable (real) capabilities. This way, it would be possible to "wrap" a file descriptor with a Landlock program and use it with FD-based syscalls or pass it to other processes. This would not require changes to the FS subsystem, but only the Landlock LSM code. This isn't done yet but I plan to add this new way to restrict operations on file descriptors.
Anyway, for the use case you mentioned, the AT_BENEATH flag(s) should be simple to use and enough for now. We must be careful of the hardcoded policy though.
On 9/29/18 12:34, Aleksa Sarai wrote:
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
- Mountpoint crossings are blocked by AT_XDEV.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS path didn't consider "/tmp/.." as a mountpoint crossing -- this patch blocks this as well (feel free to ask me to remove it if you feel this is not sane).
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
One additional feature I've implemented is AT_THIS_ROOT (I imagine this is probably going to be more contentious than the refresh of AT_NO_JUMPS, so I've included it in a separate patch). The patch itself describes my reasoning, but the shortened version of the premise is that continer runtimes need to have a way to resolve paths within a potentially malicious rootfs. Container runtimes currently do this in userspace[2] which has implicit race conditions that are not resolvable in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for path resolution, which would be invaluable for us -- and the implementation is basically identical to AT_BENEATH (except that we don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
Aleksa Sarai (3): namei: implement O_BENEATH-style AT_* flags namei: implement AT_THIS_ROOT chroot-like path resolution selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +- fs/namei.c | 158 ++++++++++++------ fs/open.c | 10 ++ fs/stat.c | 15 +- include/linux/fcntl.h | 3 +- include/linux/namei.h | 8 + include/uapi/asm-generic/fcntl.h | 20 +++ include/uapi/linux/fcntl.h | 10 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 ++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++ 19 files changed, 707 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
On Mon, 1 Oct 2018, Mickaël Salaün wrote:
Another way to apply a security policy could be to tied it to a file descriptor, similarly to Capsicum, which could enable to create programmable (real) capabilities. This way, it would be possible to "wrap" a file descriptor with a Landlock program and use it with FD-based syscalls or pass it to other processes. This would not require changes to the FS subsystem, but only the Landlock LSM code. This isn't done yet but I plan to add this new way to restrict operations on file descriptors.
Very interesting!
This could possibly be an LSM which stacks/integrates with other LSMs to enforce MAC of object capabilities.
Anyway, for the use case you mentioned, the AT_BENEATH flag(s) should be simple to use and enough for now. We must be careful of the hardcoded policy though.
On 9/29/18 12:34, Aleksa Sarai wrote:
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
- Mountpoint crossings are blocked by AT_XDEV.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS path didn't consider "/tmp/.." as a mountpoint crossing -- this patch blocks this as well (feel free to ask me to remove it if you feel this is not sane).
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
One additional feature I've implemented is AT_THIS_ROOT (I imagine this is probably going to be more contentious than the refresh of AT_NO_JUMPS, so I've included it in a separate patch). The patch itself describes my reasoning, but the shortened version of the premise is that continer runtimes need to have a way to resolve paths within a potentially malicious rootfs. Container runtimes currently do this in userspace[2] which has implicit race conditions that are not resolvable in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for path resolution, which would be invaluable for us -- and the implementation is basically identical to AT_BENEATH (except that we don't return errors when someone actually hits the root).
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
Aleksa Sarai (3): namei: implement O_BENEATH-style AT_* flags namei: implement AT_THIS_ROOT chroot-like path resolution selftests: vfs: add AT_* path resolution tests
fs/fcntl.c | 2 +- fs/namei.c | 158 ++++++++++++------ fs/open.c | 10 ++ fs/stat.c | 15 +- include/linux/fcntl.h | 3 +- include/linux/namei.h | 8 + include/uapi/asm-generic/fcntl.h | 20 +++ include/uapi/linux/fcntl.h | 10 ++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vfs/.gitignore | 1 + tools/testing/selftests/vfs/Makefile | 13 ++ tools/testing/selftests/vfs/at_flags.h | 40 +++++ tools/testing/selftests/vfs/common.sh | 37 ++++ .../selftests/vfs/tests/0001_at_beneath.sh | 72 ++++++++ .../selftests/vfs/tests/0002_at_xdev.sh | 54 ++++++ .../vfs/tests/0003_at_no_proclinks.sh | 50 ++++++ .../vfs/tests/0004_at_no_symlinks.sh | 49 ++++++ .../selftests/vfs/tests/0005_at_this_root.sh | 66 ++++++++ tools/testing/selftests/vfs/vfs_helper.c | 154 +++++++++++++++++ 19 files changed, 707 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/vfs/.gitignore create mode 100644 tools/testing/selftests/vfs/Makefile create mode 100644 tools/testing/selftests/vfs/at_flags.h create mode 100644 tools/testing/selftests/vfs/common.sh create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote:
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
xfstests, please. That way the new functionality will get immediate coverage by all the main filesystem development and distro QA teams....
Cheers,
Dave.
On 2018-10-01, Dave Chinner david@fromorbit.com wrote:
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
xfstests, please. That way the new functionality will get immediate coverage by all the main filesystem development and distro QA teams....
Sure, will do. Do you want me to submit them in parallel -- and what is the correct ML to submit changes to xfstests? Sorry for the silly questions. :P
On Mon, Oct 01, 2018 at 03:47:23PM +1000, Aleksa Sarai wrote:
On 2018-10-01, Dave Chinner david@fromorbit.com wrote:
I've added some selftests for this, but it's not clear to me whether they should live here or in xfstests (as far as I can tell there are no other VFS tests in selftests, while there are some tests that look like generic VFS tests in xfstests). If you'd prefer them to be included in xfstests, let me know.
xfstests, please. That way the new functionality will get immediate coverage by all the main filesystem development and distro QA teams....
Sure, will do. Do you want me to submit them in parallel --
That's usually the way we do things, but we don't tend to commit the fstests changes until the thing it is testing has landed upstream.
and what is the correct ML to submit changes to xfstests?
fstests@vger.kernel.org
Sorry for the silly questions. :P
You're going to have many more of them when you start moving the tests across to fstests :P
Cheers,
Dave.
From: Aleksa Sarai
Sent: 29 September 2018 11:35
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
You may need to allow absolute paths that refer to items inside the controlled area. (Even if done by a textual replacement based on the expected name of the base directory.)
- Mountpoint crossings are blocked by AT_XDEV.
You might want a mountpoint flag that allows crossing into the mounted filesystem (you may need to get out in order to do pwd()).
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
Or 'fix' the /proc/$pid/fd/$fd code to open the actual vnode rather than being a symlink (although this might still let you get a directory vnode). FWIW this is what NetBSD does - you can link the open file back into the filesystem!
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
What about allowing 'trivial' symlinks?
...
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
If you make the flags a property of the directory vnode (perhaps as well as any syscall flags), and make it inherited by vnode lookup then it can be used to stop library functions (or entire binaries) using blocked paths. You'd then only need to add an fcntl() call to set the flags (but never clear them) to get the restriction applied to every lookup. ...
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 2018-10-01, David Laight David.Laight@ACULAB.COM wrote:
The need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
The most obvious change is that AT_NO_JUMPS has been split as dicussed in the original thread, along with a further split of AT_NO_PROCLINKS which means that each individual property of AT_NO_JUMPS is now a separate flag:
- Path-based escapes from the starting-point using "/" or ".." are blocked by AT_BENEATH.
You may need to allow absolute paths that refer to items inside the controlled area. (Even if done by a textual replacement based on the expected name of the base directory.)
This is sort of what AT_THIS_ROOT does. I didn't want to include it for AT_BENEATH because it would be just as contentious as AT_THIS_ROOT currently is. :P
- Mountpoint crossings are blocked by AT_XDEV.
You might want a mountpoint flag that allows crossing into the mounted filesystem (you may need to get out in order to do pwd()).
Like a mount flag? I'm not sure how I feel about that. The intention is to allow for a process to have control over how path lookups are handled, and tying it to a mount flag means that it's no longer entirely up to the process.
- /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
correctly it actually blocks any user of nd_jump_link() because it allows out-of-VFS path resolution manipulation).
Or 'fix' the /proc/$pid/fd/$fd code to open the actual vnode rather than being a symlink (although this might still let you get a directory vnode). FWIW this is what NetBSD does - you can link the open file back into the filesystem!
Isn't this how it works currently? The /proc/$pid/fd/$fd "symlinks" are actually references to the underlying file (they can even escape a pivot_root()) -- you can re-open them or do any number of other dodgy things through /proc with them (we definitely abuse this in container runtimes -- and I'm sure plenty of other people do as well).
AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At Linus' suggestion in the original thread, I've also implemented AT_NO_SYMLINKS which just denies _all_ symlink resolution (including "proclink" resolution).
What about allowing 'trivial' symlinks?
The use-case of AT_NO_SYMLINKS that Linus pitched[1] is that git wants to have a unique name for every object and so allowing trivial symlinks is a no-go. I assume "trivial" here means "no-'..' components"?
Currently I've only enabled these for openat(2) and the stat(2) family. I would hope we could enable it for basically every *at(2) syscall -- but many of them appear to not have a @flags argument and thus we'll need to add several new syscalls to do this. I'm more than happy to send those patches, but I'd prefer to know that this preliminary work is acceptable before doing a bunch of copy-paste to add new sets of *at(2) syscalls.
If you make the flags a property of the directory vnode (perhaps as well as any syscall flags), and make it inherited by vnode lookup then it can be used to stop library functions (or entire binaries) using blocked paths. You'd then only need to add an fcntl() call to set the flags (but never clear them) to get the restriction applied to every lookup.
This seems like it might be useful, but it could always be done as a follow-up patch by just setting LOOKUP_BLAH if the dirfd has the flag set. I'm also a little bit concerned that (because fd flags are set on the 'struct file') if you start sharing fds then you can no longer use the lookup scoping for security (a racing process could remove the flags while the management process resolves through it).
From: Aleksa Sarai
Sent: 01 October 2018 17:16
On 2018-10-01, David Laight David.Laight@ACULAB.COM wrote:
...
- Mountpoint crossings are blocked by AT_XDEV.
You might want a mountpoint flag that allows crossing into the mounted filesystem (you may need to get out in order to do pwd()).
Like a mount flag? I'm not sure how I feel about that. The intention is to allow for a process to have control over how path lookups are handled, and tying it to a mount flag means that it's no longer entirely up to the process.
Right, but you may have some mount points that you don't want to cross and others that it is perfectly fine to cross. For example you might want to be able to cross into a 'tmp' filesystem.
...
If you make the flags a property of the directory vnode (perhaps as well as any syscall flags), and make it inherited by vnode lookup then it can be used to stop library functions (or entire binaries) using blocked paths. You'd then only need to add an fcntl() call to set the flags (but never clear them) to get the restriction applied to every lookup.
This seems like it might be useful, but it could always be done as a follow-up patch by just setting LOOKUP_BLAH if the dirfd has the flag set. I'm also a little bit concerned that (because fd flags are set on the 'struct file') if you start sharing fds then you can no longer use the lookup scoping for security (a racing process could remove the flags while the management process resolves through it).
I was thinking that the flags would never be removable. A management process might have to flip its cwd back and forth in order to clear the flags (opendir(".") should give a different struct file).
This all gets tied up with the slight requirement for per-thread cwd.
I had another thought that the crudentials structure used for a file lookup could also be taken from the cwd (not sure how it would get there - especially if you need the correct group list). That would allow a 'management' process to open a file in the context of the target user process.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
linux-kselftest-mirror@lists.linaro.org