On Thu, Jul 24, 2025 at 4:00 PM Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
Hi Al and Christian,
The commit 12f147ddd6de ("do_change_type(): refuse to operate on unmounted/not ours mounts") introduced an ABI backward compatibility break. CRIU depends on the previous behavior, and users are now reporting criu restore failures following the kernel update. This change has been propagated to stable kernels. Is this check strictly required?
Yes.
Would it be possible to check only if the current process has CAP_SYS_ADMIN within the mount user namespace?
Not enough, both in terms of permissions *and* in terms of "thou shalt not bugger the kernel data structures - nobody's priveleged enough for that".
Al,
I am still thinking in terms of "Thou shalt not break userspace"...
Seriously though, this original behavior has been in the kernel for 20 years, and it hasn't triggered any corruptions in all that time. I understand this change might be necessary in its current form, and that some collateral damage could be unavoidable. But if that's the case, I'd expect a detailed explanation of why it had to be so and why userspace breakage is unavoidable.
The original change was merged two decades ago. We need to consider that some applications might rely on that behavior. I'm not questioning the security aspect - that must be addressed. But for anything else, we need to minimize the impact on user applications that don't violate security.
We can consider a cleaner fix for the upstream kernel, but when we are talking about stable kernels, the user-space backward compatibility aspect should be even more critical.
Thanks, Andrei
On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
On Thu, Jul 24, 2025 at 4:00 PM Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
Hi Al and Christian,
The commit 12f147ddd6de ("do_change_type(): refuse to operate on unmounted/not ours mounts") introduced an ABI backward compatibility break. CRIU depends on the previous behavior, and users are now reporting criu restore failures following the kernel update. This change has been propagated to stable kernels. Is this check strictly required?
Yes.
Would it be possible to check only if the current process has CAP_SYS_ADMIN within the mount user namespace?
Not enough, both in terms of permissions *and* in terms of "thou shalt not bugger the kernel data structures - nobody's priveleged enough for that".
Al,
I am still thinking in terms of "Thou shalt not break userspace"...
Seriously though, this original behavior has been in the kernel for 20 years, and it hasn't triggered any corruptions in all that time.
For a very mild example of fun to be had there: mount("none", "/mnt", "tmpfs", 0, ""); chdir("/mnt"); umount2(".", MNT_DETACH); mount(NULL, ".", NULL, MS_SHARED, NULL); Repeat in a loop, watch mount group id leak. That's a trivial example of violating the assertion ("a mount that had been through umount_tree() is out of propagation graph and related data structures for good").
As for the "CAP_SYS_ADMIN within the mount user namespace" - which userns do you have in mind?
On Sat, Jul 26, 2025 at 10:53 AM Al Viro viro@zeniv.linux.org.uk wrote:
On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
On Thu, Jul 24, 2025 at 4:00 PM Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
Hi Al and Christian,
The commit 12f147ddd6de ("do_change_type(): refuse to operate on unmounted/not ours mounts") introduced an ABI backward compatibility break. CRIU depends on the previous behavior, and users are now reporting criu restore failures following the kernel update. This change has been propagated to stable kernels. Is this check strictly required?
Yes.
Would it be possible to check only if the current process has CAP_SYS_ADMIN within the mount user namespace?
Not enough, both in terms of permissions *and* in terms of "thou shalt not bugger the kernel data structures - nobody's priveleged enough for that".
Al,
I am still thinking in terms of "Thou shalt not break userspace"...
Seriously though, this original behavior has been in the kernel for 20 years, and it hasn't triggered any corruptions in all that time.
For a very mild example of fun to be had there: mount("none", "/mnt", "tmpfs", 0, ""); chdir("/mnt"); umount2(".", MNT_DETACH); mount(NULL, ".", NULL, MS_SHARED, NULL); Repeat in a loop, watch mount group id leak. That's a trivial example of violating the assertion ("a mount that had been through umount_tree() is out of propagation graph and related data structures for good").
I wasn't referring to detached mounts. CRIU modifies mounts from non-current namespaces.
As for the "CAP_SYS_ADMIN within the mount user namespace" - which userns do you have in mind?
The user namespace of the target mount: ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)
If detached mounts are our only concern, it looks like the check instead of:
if (!check_mnt(mnt)) { err = -EINVAL; goto out_unlock; }
could've been a more relaxed one:
if (mnt_detached(mnt)) { err = -EINVAL; goto out_unlock; }
bool mnt_detached(struct mount *mnt) { return !mnt->mnt_ns; }
not to allow propagation change only on detached mounts. (As umount_tree sets mnt_ns to NULL.)
Also in do_mount_setattr we have a more relaxed check too:
if ((mnt_has_parent(mnt) || !is_anon_ns(mnt->mnt_ns)) && !check_mnt(mnt)) goto out;
Best Regards, Tikhomirov Pavel.
On Sun, Jul 27, 2025 at 5:01 AM Andrei Vagin avagin@google.com wrote:
On Sat, Jul 26, 2025 at 10:53 AM Al Viro viro@zeniv.linux.org.uk wrote:
On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
On Thu, Jul 24, 2025 at 4:00 PM Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
Hi Al and Christian,
The commit 12f147ddd6de ("do_change_type(): refuse to operate on unmounted/not ours mounts") introduced an ABI backward compatibility break. CRIU depends on the previous behavior, and users are now reporting criu restore failures following the kernel update. This change has been propagated to stable kernels. Is this check strictly required?
Yes.
Would it be possible to check only if the current process has CAP_SYS_ADMIN within the mount user namespace?
Not enough, both in terms of permissions *and* in terms of "thou shalt not bugger the kernel data structures - nobody's priveleged enough for that".
Al,
I am still thinking in terms of "Thou shalt not break userspace"...
Seriously though, this original behavior has been in the kernel for 20 years, and it hasn't triggered any corruptions in all that time.
For a very mild example of fun to be had there: mount("none", "/mnt", "tmpfs", 0, ""); chdir("/mnt"); umount2(".", MNT_DETACH); mount(NULL, ".", NULL, MS_SHARED, NULL); Repeat in a loop, watch mount group id leak. That's a trivial example of violating the assertion ("a mount that had been through umount_tree() is out of propagation graph and related data structures for good").
I wasn't referring to detached mounts. CRIU modifies mounts from non-current namespaces.
As for the "CAP_SYS_ADMIN within the mount user namespace" - which userns do you have in mind?
The user namespace of the target mount: ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)
On Thu, Jul 31, 2025 at 10:40:40AM +0800, Pavel Tikhomirov wrote:
If detached mounts are our only concern, it looks like the check instead of:
if (!check_mnt(mnt)) { err = -EINVAL; goto out_unlock; }
could've been a more relaxed one:
if (mnt_detached(mnt)) { err = -EINVAL; goto out_unlock; }
bool mnt_detached(struct mount *mnt) { return !mnt->mnt_ns; }
not to allow propagation change only on detached mounts. (As umount_tree sets mnt_ns to NULL.)
Changing propagation settings on detached mounts is fine and shoud work? Changing propagation settings on unmounted mounts not so much...
Also in do_mount_setattr we have a more relaxed check too:
if ((mnt_has_parent(mnt) || !is_anon_ns(mnt->mnt_ns)) && !check_mnt(mnt)) goto out;
Best Regards, Tikhomirov Pavel.
On Sun, Jul 27, 2025 at 5:01 AM Andrei Vagin avagin@google.com wrote:
On Sat, Jul 26, 2025 at 10:53 AM Al Viro viro@zeniv.linux.org.uk wrote:
On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
On Thu, Jul 24, 2025 at 4:00 PM Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
Hi Al and Christian,
The commit 12f147ddd6de ("do_change_type(): refuse to operate on unmounted/not ours mounts") introduced an ABI backward compatibility break. CRIU depends on the previous behavior, and users are now reporting criu restore failures following the kernel update. This change has been propagated to stable kernels. Is this check strictly required?
Yes.
Would it be possible to check only if the current process has CAP_SYS_ADMIN within the mount user namespace?
Not enough, both in terms of permissions *and* in terms of "thou shalt not bugger the kernel data structures - nobody's priveleged enough for that".
Al,
I am still thinking in terms of "Thou shalt not break userspace"...
Seriously though, this original behavior has been in the kernel for 20 years, and it hasn't triggered any corruptions in all that time.
For a very mild example of fun to be had there: mount("none", "/mnt", "tmpfs", 0, ""); chdir("/mnt"); umount2(".", MNT_DETACH); mount(NULL, ".", NULL, MS_SHARED, NULL); Repeat in a loop, watch mount group id leak. That's a trivial example of violating the assertion ("a mount that had been through umount_tree() is out of propagation graph and related data structures for good").
I wasn't referring to detached mounts. CRIU modifies mounts from non-current namespaces.
As for the "CAP_SYS_ADMIN within the mount user namespace" - which userns do you have in mind?
The user namespace of the target mount: ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)
On Thu, Jul 31, 2025 at 3:53 PM Christian Brauner brauner@kernel.org wrote:
On Thu, Jul 31, 2025 at 10:40:40AM +0800, Pavel Tikhomirov wrote:
If detached mounts are our only concern, it looks like the check instead of:
if (!check_mnt(mnt)) { err = -EINVAL; goto out_unlock; }
could've been a more relaxed one:
if (mnt_detached(mnt)) { err = -EINVAL; goto out_unlock; }
bool mnt_detached(struct mount *mnt) { return !mnt->mnt_ns; }
not to allow propagation change only on detached mounts. (As umount_tree sets mnt_ns to NULL.)
Changing propagation settings on detached mounts is fine and shoud work? Changing propagation settings on unmounted mounts not so much...
Sorry, it's my confused terminology, here by "detached" mounts I mean mounts which were unmounted with umount2(MNT_DETACH), maybe I should call them "unmounted" (e.g. s/mnt_detached/mnt_unmounted/).
And yes, I understand why we need to allow changing propagation on mounts in anonymous namespace without being inside it, because one can't just enter anonymous namespace.
I don't think that we need to change anything, just sharing my thoughts that it could be more relaxed and will still protect us from propagation setting on unmounted mounts.
Also in do_mount_setattr we have a more relaxed check too:
if ((mnt_has_parent(mnt) || !is_anon_ns(mnt->mnt_ns)) && !check_mnt(mnt)) goto out;
Best Regards, Tikhomirov Pavel.
On Sun, Jul 27, 2025 at 5:01 AM Andrei Vagin avagin@google.com wrote:
On Sat, Jul 26, 2025 at 10:53 AM Al Viro viro@zeniv.linux.org.uk wrote:
On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
On Thu, Jul 24, 2025 at 4:00 PM Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote: > Hi Al and Christian, > > The commit 12f147ddd6de ("do_change_type(): refuse to operate on > unmounted/not ours mounts") introduced an ABI backward compatibility > break. CRIU depends on the previous behavior, and users are now > reporting criu restore failures following the kernel update. This change > has been propagated to stable kernels. Is this check strictly required?
Yes.
> Would it be possible to check only if the current process has > CAP_SYS_ADMIN within the mount user namespace?
Not enough, both in terms of permissions *and* in terms of "thou shalt not bugger the kernel data structures - nobody's priveleged enough for that".
Al,
I am still thinking in terms of "Thou shalt not break userspace"...
Seriously though, this original behavior has been in the kernel for 20 years, and it hasn't triggered any corruptions in all that time.
For a very mild example of fun to be had there: mount("none", "/mnt", "tmpfs", 0, ""); chdir("/mnt"); umount2(".", MNT_DETACH); mount(NULL, ".", NULL, MS_SHARED, NULL); Repeat in a loop, watch mount group id leak. That's a trivial example of violating the assertion ("a mount that had been through umount_tree() is out of propagation graph and related data structures for good").
I wasn't referring to detached mounts. CRIU modifies mounts from non-current namespaces.
As for the "CAP_SYS_ADMIN within the mount user namespace" - which userns do you have in mind?
The user namespace of the target mount: ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)
On Sat, Jul 26, 2025 at 02:01:20PM -0700, Andrei Vagin wrote:
For a very mild example of fun to be had there: mount("none", "/mnt", "tmpfs", 0, ""); chdir("/mnt"); umount2(".", MNT_DETACH); mount(NULL, ".", NULL, MS_SHARED, NULL); Repeat in a loop, watch mount group id leak. That's a trivial example of violating the assertion ("a mount that had been through umount_tree() is out of propagation graph and related data structures for good").
I wasn't referring to detached mounts. CRIU modifies mounts from non-current namespaces.
As for the "CAP_SYS_ADMIN within the mount user namespace" - which userns do you have in mind?
The user namespace of the target mount: ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)
To bring that thread back: how about the following? If nobody objects, I'm going to throw it into viro/vfs.git #fixes...
[PATCH] use uniform permission checks for all mount propagation changes
do_change_type() and do_set_group() are operating on different aspects of the same thing - propagation graph. The latter asks for mounts involved to be mounted in namespace(s) the caller has CAP_SYS_ADMIN for. The former is a mess - originally it didn't even check that mount *is* mounted. That got fixed, but the resulting check turns out to be too strict for userland - in effect, we check that mount is in our namespace, having already checked that we have CAP_SYS_ADMIN there.
What we really need (in both cases) is * we only touch mounts that are mounted. Hard requirement, data corruption if that's get violated. * we don't allow to mess with a namespace unless you already have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns).
That's an equivalent of what do_set_group() does; let's extract that into a helper (may_change_propagation()) and use it in both do_set_group() and do_change_type().
Fixes: 12f147ddd6de "do_change_type(): refuse to operate on unmounted/not ours mounts" Signed-off-by: Al Viro viro@zeniv.linux.org.uk --- diff --git a/fs/namespace.c b/fs/namespace.c index ddfd4457d338..e7d9b23f1e9e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2862,6 +2862,19 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) return attach_recursive_mnt(mnt, p, mp); }
+static int may_change_propagation(const struct mount *m) +{ + struct mnt_namespace *ns = m->mnt_ns; + + // it must be mounted in some namespace + if (IS_ERR_OR_NULL(ns)) // is_mounted() + return -EINVAL; + // and the caller must be admin in userns of that namespace + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) + return -EPERM; + return 0; +} + /* * Sanity check the flags to change_mnt_propagation. */ @@ -2898,10 +2911,10 @@ static int do_change_type(struct path *path, int ms_flags) return -EINVAL;
namespace_lock(); - if (!check_mnt(mnt)) { - err = -EINVAL; + err = may_change_propagation(mnt); + if (err) goto out_unlock; - } + if (type == MS_SHARED) { err = invent_group_ids(mnt, recurse); if (err) @@ -3347,18 +3360,11 @@ static int do_set_group(struct path *from_path, struct path *to_path)
namespace_lock();
- err = -EINVAL; - /* To and From must be mounted */ - if (!is_mounted(&from->mnt)) - goto out; - if (!is_mounted(&to->mnt)) - goto out; - - err = -EPERM; - /* We should be allowed to modify mount namespaces of both mounts */ - if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN)) + err = may_change_propagation(from); + if (err) goto out; - if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN)) + err = may_change_propagation(from); + if (err) goto out;
err = -EINVAL;
On Wed, Aug 13, 2025 at 07:56:01PM +0100, Al Viro wrote:
@@ -3347,18 +3360,11 @@ static int do_set_group(struct path *from_path, struct path *to_path) namespace_lock();
- err = -EINVAL;
- /* To and From must be mounted */
- if (!is_mounted(&from->mnt))
goto out;
- if (!is_mounted(&to->mnt))
goto out;
- err = -EPERM;
- /* We should be allowed to modify mount namespaces of both mounts */
- if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
- err = may_change_propagation(from);
- if (err) goto out;
- if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
- err = may_change_propagation(from);
Just driving by, but I guess you mean "to" here.
Tycho
On Wed, Aug 13, 2025 at 01:09:27PM -0600, Tycho Andersen wrote:
On Wed, Aug 13, 2025 at 07:56:01PM +0100, Al Viro wrote:
@@ -3347,18 +3360,11 @@ static int do_set_group(struct path *from_path, struct path *to_path) namespace_lock();
- err = -EINVAL;
- /* To and From must be mounted */
- if (!is_mounted(&from->mnt))
goto out;
- if (!is_mounted(&to->mnt))
goto out;
- err = -EPERM;
- /* We should be allowed to modify mount namespaces of both mounts */
- if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
- err = may_change_propagation(from);
- if (err) goto out;
- if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
- err = may_change_propagation(from);
Just driving by, but I guess you mean "to" here.
D'oh... Yes, of course. Fun question: would our selftests have caught that? [checks] move_mount_set_group_test.c doesn't have anything in that area, nothing in LTP or xfstests either, AFAICS... And I don't see anything in https://github.com/checkpoint-restore/criu either - there are uses of MOVE_MOUNT_SET_GROUP, but they are well-buried and I don't see anything in their tests that would even try to poke into that thing...
Before we go and try to cobble something up, does anybody know of a place where regression tests for MOVE_MOUNT_SET_GROUP could be picked from?
On Thu, Aug 14, 2025 at 3:41 AM Al Viro viro@zeniv.linux.org.uk wrote:
On Wed, Aug 13, 2025 at 01:09:27PM -0600, Tycho Andersen wrote:
On Wed, Aug 13, 2025 at 07:56:01PM +0100, Al Viro wrote:
@@ -3347,18 +3360,11 @@ static int do_set_group(struct path *from_path, struct path *to_path)
namespace_lock();
- err = -EINVAL;
- /* To and From must be mounted */
- if (!is_mounted(&from->mnt))
goto out;
- if (!is_mounted(&to->mnt))
goto out;
- err = -EPERM;
- /* We should be allowed to modify mount namespaces of both mounts */
- if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
- err = may_change_propagation(from);
- if (err) goto out;
- if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
- err = may_change_propagation(from);
Just driving by, but I guess you mean "to" here.
D'oh... Yes, of course. Fun question: would our selftests have caught that? [checks] move_mount_set_group_test.c doesn't have anything in that area, nothing in LTP or xfstests either, AFAICS...
Yes, selftest is very simple and is not covering userns checks.
And I don't see anything in https://github.com/checkpoint-restore/criu either - there are uses of MOVE_MOUNT_SET_GROUP, but they are well-buried and I don't see anything in their tests that would even try to poke into that thing...
Before we go and try to cobble something up, does anybody know of a place where regression tests for MOVE_MOUNT_SET_GROUP could be picked from?
Basically each CRIU test that is run by zdtm (if it is in ns/uns flavor (which are most of them)), tests mounts checkpoint/restore. And each test which has shared/slave moutns leads to MOVE_MOUNT_SET_GROUP being used and thus tested. We have a mountinfo comparison in zdtm which checks that propagation is topologically the same after c/r.
But, yes, we do not cover userns checks, as in CRIU case, CRIU is expected to run in userns which has all capabilities over restored container, and should always pass those checks.
JFYI:
The use of MOVE_MOUNT_SET_GROUP in CRIU is well-buried in:
https://github.com/checkpoint-restore/criu/blob/116e56ba46382c05066d33a8bbad...
+-< move_mount_set_group +-< restore_one_sharing +-< restore_one_sharing_group +-< restore_mount_sharing_options +-< prepare_mnt_ns_v2
This stack already has a set of precreated mounts and walks over their sharing groups saved in CRIU image files and assigns them accordingly.
And we have a bunch of tests with different sharing configurations to test propagation c/r specifically:
git grep -l "SHARING|SLAVE" test/zdtm/static test/zdtm/static/mnt_ext_auto.c test/zdtm/static/mnt_ext_master.c test/zdtm/static/mnt_ext_multiple.c test/zdtm/static/mnt_root_ext.c test/zdtm/static/mntns_overmount.c test/zdtm/static/mntns_shared_bind03.c test/zdtm/static/mount_complex_sharing.c test/zdtm/static/mountpoints.c test/zdtm/static/shared_slave_mount_children.c
It should be enough to run a zdtm test-suit to check that change does not break something for CRIU (will do).
On Thu, Aug 14, 2025 at 12:08:49PM +0800, Pavel Tikhomirov wrote:
Yes, selftest is very simple and is not covering userns checks.
FWIW, see below for what I've got here at the moment for MOVE_MOUNT_SET_GROUP; no tests for cross-filesystem and not-a-subtree yet. At least it does catch that braino when run on a kernel that doesn't have it fixed ;-) No do_change_type() tests either yet...
// link with -lcap, assumes userns enabled // can run both as root and as regular user #define _GNU_SOURCE #include <sched.h> #include <sys/capability.h> #include <sys/mount.h> #include <sys/stat.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <stdbool.h>
_Bool drop_caps(void) { cap_value_t cap_value[] = { CAP_SYS_ADMIN }; cap_t cap = cap_get_proc(); if (!cap) { perror("cap_get_proc"); return false; } return true; }
void do_unshare(void) { FILE *f; uid_t uid = geteuid(); gid_t gid = getegid(); unshare(CLONE_NEWNS|CLONE_NEWUSER); f = fopen("/proc/self/uid_map", "w"); fprintf(f, "0 %d 1", uid); fclose(f); f = fopen("/proc/self/setgroups", "w"); fprintf(f, "deny"); fclose(f); f = fopen("/proc/self/gid_map", "w"); fprintf(f, "0 %d 1", gid); fclose(f); mount(NULL, "/", NULL, MS_REC|MS_PRIVATE, NULL); }
void bind(char *p) { mount(p, p, NULL, MS_BIND, NULL); }
void test_it(int fd1, char *p1, int fd2, char *p2, int expected) { int flags = MOVE_MOUNT_SET_GROUP; int n;
if (!p1) { p1 = ""; flags |= MOVE_MOUNT_F_EMPTY_PATH; } if (!p2) { p2 = ""; flags |= MOVE_MOUNT_T_EMPTY_PATH; } n = move_mount(fd1, p1, fd2, p2, flags); if (!n) errno = 0; if (expected != errno) printf(" failed: %d != %d\n", expected, errno); else printf(" OK\n"); }
int main() { int pipe1[2], pipe2[2]; char path[40]; pid_t child; int root_fd; char c;
if (pipe(pipe1) < 0 || pipe(pipe2) < 0) { perror("pipe"); return -1; } if (!drop_caps()) return -1; do_unshare();
root_fd = open("/", O_PATH);
errno = 0; mount("none", "/mnt", "tmpfs", 0, NULL); mkdir("/mnt/a", 0777); mkdir("/mnt/a/private", 0777); mkdir("/mnt/a/private/b", 0777); mkdir("/mnt/a/shared", 0777); mkdir("/mnt/a/slave", 0777); mkdir("/mnt/a/shared-slave", 0777); mkdir("/mnt/locked", 0777); mkdir("/mnt/no-locked", 0777); bind("/mnt/locked");
child = fork(); if (child < 0) { perror("fork"); return -1; } else if (child == 0) { do_unshare(); mount(NULL, "/mnt/", NULL, MS_SHARED, NULL); bind("/mnt/a"); write(pipe1[1], &c, 1); fchdir(root_fd); read(pipe2[0], &c, 1); printf("from should be someplace we have permissions for"); test_it(AT_FDCWD, "mnt/a", AT_FDCWD, "/mnt/a/private", EPERM); printf("to should be someplace we have permissions for"); test_it(AT_FDCWD, "/mnt/a", AT_FDCWD, "mnt/a/private", EPERM); write(pipe1[1], &c, 1); return 0; } read(pipe1[0], &c, 1); sprintf(path, "/proc/%d/root", child); chdir(path);
mount(NULL, "/mnt", NULL, MS_SHARED, NULL); bind("/mnt/a/private"); bind("/mnt/a/shared"); bind("/mnt/a/slave"); bind("/mnt/a/slave-shared"); bind("/mnt/no-locked"); mount(NULL, "/mnt/a/private", NULL, MS_PRIVATE, NULL); mount(NULL, "/mnt/a/slave", NULL, MS_SLAVE, NULL); mount(NULL, "/mnt/a/shared-slave", NULL, MS_SLAVE, NULL); mount(NULL, "/mnt/a/shared-slave", NULL, MS_SHARED, NULL); mount(NULL, "/mnt/no-locked", NULL, MS_PRIVATE, NULL);
printf("from should be mounted (pipes are not)"); test_it(pipe1[0], NULL, AT_FDCWD, "/mnt/a/private", EINVAL);
printf("to should be mounted (pipes are not)"); test_it(AT_FDCWD, "/mnt", pipe1[0], NULL, EINVAL);
printf("from should be someplace we have permissions for"); test_it(AT_FDCWD, "mnt/a", AT_FDCWD, "/mnt/a/private", 0); mount(NULL, "/mnt/a/private", NULL, MS_PRIVATE, NULL);
printf("from should be mountpoint"); test_it(AT_FDCWD, "/mnt/a", AT_FDCWD, "/mnt/a/private", EINVAL);
printf("to should be mountpoint"); test_it(AT_FDCWD, "/mnt/a", AT_FDCWD, "/mnt/a/private/b", EINVAL);
printf("from should not have anything locked in counterpart of to"); test_it(AT_FDCWD, "mnt", AT_FDCWD, "/mnt/locked", EINVAL);
printf("from should not have anything locked in counterpart of to"); test_it(AT_FDCWD, "mnt", AT_FDCWD, "/mnt/no-locked", 0); mount(NULL, "/mnt/no-locked", NULL, MS_PRIVATE, NULL);
fflush(stdout); write(pipe2[1], &c, 1); read(pipe1[0], &c, 1); return 0; }
[this should fix userland regression from do_change_type() fix last cycle; we have too little self-test coverage in the area, unfortunately. First approximation for selftest in the followup to this posting. Review and testing of both the patch and test would be very welcome]
do_change_type() and do_set_group() are operating on different aspects of the same thing - propagation graph. The latter asks for mounts involved to be mounted in namespace(s) the caller has CAP_SYS_ADMIN for. The former is a mess - originally it didn't even check that mount *is* mounted. That got fixed, but the resulting check turns out to be too strict for userland - in effect, we check that mount is in our namespace, having already checked that we have CAP_SYS_ADMIN there.
What we really need (in both cases) is * we only touch mounts that are mounted. Hard requirement, data corruption if that's get violated. * we don't allow to mess with a namespace unless you already have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns).
That's an equivalent of what do_set_group() does; let's extract that into a helper (may_change_propagation()) and use it in both do_set_group() and do_change_type().
Fixes: 12f147ddd6de "do_change_type(): refuse to operate on unmounted/not ours mounts" Signed-off-by: Al Viro viro@zeniv.linux.org.uk --- diff --git a/fs/namespace.c b/fs/namespace.c index ddfd4457d338..a191c6519e36 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2862,6 +2862,19 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) return attach_recursive_mnt(mnt, p, mp); }
+static int may_change_propagation(const struct mount *m) +{ + struct mnt_namespace *ns = m->mnt_ns; + + // it must be mounted in some namespace + if (IS_ERR_OR_NULL(ns)) // is_mounted() + return -EINVAL; + // and the caller must be admin in userns of that namespace + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) + return -EPERM; + return 0; +} + /* * Sanity check the flags to change_mnt_propagation. */ @@ -2898,10 +2911,10 @@ static int do_change_type(struct path *path, int ms_flags) return -EINVAL;
namespace_lock(); - if (!check_mnt(mnt)) { - err = -EINVAL; + err = may_change_propagation(mnt); + if (err) goto out_unlock; - } + if (type == MS_SHARED) { err = invent_group_ids(mnt, recurse); if (err) @@ -3347,18 +3360,11 @@ static int do_set_group(struct path *from_path, struct path *to_path)
namespace_lock();
- err = -EINVAL; - /* To and From must be mounted */ - if (!is_mounted(&from->mnt)) - goto out; - if (!is_mounted(&to->mnt)) - goto out; - - err = -EPERM; - /* We should be allowed to modify mount namespaces of both mounts */ - if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN)) + err = may_change_propagation(from); + if (err) goto out; - if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN)) + err = may_change_propagation(to); + if (err) goto out;
err = -EINVAL;
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH][RFC][CFT] use uniform permission checks for all mount propagation changes Link: https://lore.kernel.org/stable/20250814055142.GN222315%40ZenIV
// link with -lcap, can run both as root and as regular user #define _GNU_SOURCE #include <sched.h> #include <sys/capability.h> #include <sys/mount.h> #include <sys/stat.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <stdbool.h>
_Bool drop_caps(void) { cap_value_t cap_value[] = { CAP_SYS_ADMIN }; cap_t cap = cap_get_proc(); if (!cap) { perror("cap_get_proc"); return false; } return true; }
void do_unshare(void) { FILE *f; uid_t uid = geteuid(); gid_t gid = getegid(); unshare(CLONE_NEWNS|CLONE_NEWUSER); f = fopen("/proc/self/uid_map", "w"); fprintf(f, "0 %d 1", uid); fclose(f); f = fopen("/proc/self/setgroups", "w"); fprintf(f, "deny"); fclose(f); f = fopen("/proc/self/gid_map", "w"); fprintf(f, "0 %d 1", gid); fclose(f); mount(NULL, "/", NULL, MS_REC|MS_PRIVATE, NULL); }
void bind(char *p) { mount(p, p, NULL, MS_BIND, NULL); }
void change_type(char *p, int type) { errno = 0; mount(NULL, p, NULL, type, NULL); }
void set_group(int fd1, char *p1, int fd2, char *p2) { int flags = MOVE_MOUNT_SET_GROUP; int n;
if (!p1 || !*p1) { p1 = ""; // be kind to old kernels flags |= MOVE_MOUNT_F_EMPTY_PATH; } if (!p2 || !*p2) { p2 = ""; // be kind to old kernels flags |= MOVE_MOUNT_T_EMPTY_PATH; } errno = 0; move_mount(fd1, p1, fd2, p2, flags); }
_Bool result(int expected) { if (expected != errno) { printf(" failed: %d != %d\n", expected, errno); return false; } printf(" OK\n"); return true; }
int fd1[2], fd2[2];
void in_child(void) { printf("from should be someplace we have permissions for"); set_group(AT_FDCWD, "mnt/a", AT_FDCWD, "/mnt/a/private"); result(EPERM); printf("to should be someplace we have permissions for"); set_group(AT_FDCWD, "/mnt/a", AT_FDCWD, "mnt/a/private"); result(EPERM); printf("change_type: should have permissions for target"); change_type("mnt/locked", MS_PRIVATE); result(EPERM); }
void in_parent(void) { printf("from should be mounted (pipes are not)"); set_group(fd1[0], NULL, AT_FDCWD, "/mnt/a/private"); result(EINVAL);
printf("to should be mounted (pipes are not)"); set_group(AT_FDCWD, "/mnt", fd1[0], NULL); result(EINVAL);
printf("from should be someplace we have permissions for"); set_group(AT_FDCWD, "mnt/a", AT_FDCWD, "/mnt/a/private"); if (result(0)) change_type("/mnt/a/private", MS_PRIVATE);
printf("to should be someplace we have permissions for"); set_group(AT_FDCWD, "/mnt", AT_FDCWD, "mnt/a/private"); if (result(0)) change_type("mnt/a/private", MS_PRIVATE);
printf("from should be mountpoint"); set_group(AT_FDCWD, "/mnt/a", AT_FDCWD, "/mnt/a/private"); result(EINVAL);
printf("to should be mountpoint"); set_group(AT_FDCWD, "/mnt/a", AT_FDCWD, "/mnt/a/private/b"); result(EINVAL);
printf("to and from should be on the same filesystem"); mount("none", "/mnt/no-locked", "tmpfs", 0, NULL); set_group(AT_FDCWD, "/mnt/a/shared", AT_FDCWD, "/mnt/no-locked"); result(EINVAL); umount("/mnt/no-locked");
printf("from should contain the counterpart of to"); set_group(AT_FDCWD, "/mnt/a/shared", AT_FDCWD, "/mnt/no-locked"); result(EINVAL);
printf("from should not have anything locked in counterpart of to"); set_group(AT_FDCWD, "mnt", AT_FDCWD, "/mnt/no-locked"); if (result(0)) change_type("/mnt/no-locked", MS_PRIVATE);
printf("change_type: should have permissions for target"); change_type("mnt/a/private", MS_PRIVATE); result(0);
printf("change_type: target should be a mountpoint"); change_type("/mnt/a", MS_PRIVATE); result(EINVAL);
chdir("/mnt/a/private"); umount2("/mnt/a/private", MNT_DETACH); printf("change_type: target should be mounted"); change_type(".", MS_PRIVATE); result(EINVAL); }
int main() { char path[40]; pid_t child; int root_fd; char c;
if (pipe(fd1) < 0 || pipe(fd2) < 0) { perror("pipe"); return -1; } if (!drop_caps()) return -1; do_unshare();
root_fd = open("/", O_PATH);
errno = 0; mount("none", "/mnt", "tmpfs", 0, NULL); mkdir("/mnt/a", 0777); mkdir("/mnt/a/private", 0777); mkdir("/mnt/a/private/b", 0777); mkdir("/mnt/a/shared", 0777); mkdir("/mnt/a/slave", 0777); mkdir("/mnt/a/shared-slave", 0777); mkdir("/mnt/locked", 0777); mkdir("/mnt/no-locked", 0777); bind("/mnt/locked");
child = fork(); if (child < 0) { perror("fork"); return -1; } else if (child == 0) { do_unshare(); change_type("/mnt/", MS_SHARED); bind("/mnt/a"); bind("/mnt/a/private"); change_type("/mnt/a/private", MS_PRIVATE); write(fd1[1], &c, 1); read(fd2[0], &c, 1);
fchdir(root_fd); in_child();
write(fd1[1], &c, 1); return 0; } read(fd1[0], &c, 1); sprintf(path, "/proc/%d/root", child); chdir(path);
change_type("/mnt", MS_SHARED); bind("/mnt/a/private"); bind("/mnt/a/shared"); bind("/mnt/a/slave"); bind("/mnt/a/slave-shared"); bind("/mnt/no-locked"); change_type("/mnt/a/private", MS_PRIVATE); change_type("/mnt/a/slave", MS_SLAVE); change_type("/mnt/a/shared-slave", MS_SLAVE); change_type("/mnt/a/shared-slave", MS_SHARED); change_type("/mnt/no-locked", MS_PRIVATE);
in_parent();
fflush(stdout); write(fd2[1], &c, 1); read(fd1[0], &c, 1); return 0; }
void do_unshare(void) { FILE *f; uid_t uid = geteuid(); gid_t gid = getegid(); unshare(CLONE_NEWNS|CLONE_NEWUSER); f = fopen("/proc/self/uid_map", "w"); fprintf(f, "0 %d 1", uid); fclose(f); f = fopen("/proc/self/setgroups", "w"); fprintf(f, "deny"); fclose(f); f = fopen("/proc/self/gid_map", "w"); fprintf(f, "0 %d 1", gid); fclose(f); mount(NULL, "/", NULL, MS_REC|MS_PRIVATE, NULL); }
This obviously needs error checking - in this form it won't do anything good without userns enabled (coredump on the first fprintf() in there, since there won't be /proc/self/uid_map); should probably just report CLONE_NEWUSER failure, warn about skipped tests, fall back to unshare(CLONE_NEWNS) and skip everything in in_child()...
It should be enough to run a zdtm test-suit to check that change does not break something for CRIU (will do).
jfyi: checked 0cc53520e68 with patch "[PATCH] use uniform permission checks for all mount propagation changes" (+ s/from/to/), there is no problem on criu-zdtm mount related tests. I see some problems on socket related tests on it, but it looks unrelated.
linux-stable-mirror@lists.linaro.org