In [1] Michal Koutný mkoutny@suse.com reported that it does not make sense to unconditionally exempt the INIT_USER during fork and exec from RLIMIT_NPROC and then to impose a limit on that same user with is_ucounts_overlimit. So I looked into why the exeception was added.
commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") says:
If a setuid process swaps it's real and effective uids and then forks, the fork fails if the new realuid has more processes than the original process was limited to. This is particularly a problem if a user with a process limit (e.g. 256) runs a setuid-root program which does setuid() + fork() (e.g. lprng) while root already has more than 256 process (which is quite possible).
The root problem here is that a limit which should be a per-user limit is being implemented as a per-process limit with per-process (e.g. CAP_SYS_RESOURCE) controls. Being a per-user limit, it should be that the root-user can over-ride it, not just some process with CAP_SYS_RESOURCE.
This patch adds a test to ignore process limits if the real user is root.
With the moving of the limits from per-user to per-user-per-user_ns it is clear that testing a user_struct is no longer the proper test and the test should be a test against the ucounts struct to match the rest of the logic change that was made.
With RLIMIT_NPROC not being enforced for the global root user anywhere else should it be enforced in is_ucounts_overlimit for a user namespace created by the global root user?
Since this is limited to just the global root user, and RLIMIT_NPROC is already ignored for that user I am going to vote no.
This change does imply that it becomes possible to limit all users in a user namespace but to not enforce the rlimits on the root user or anyone with CAP_SYS_ADMIN and CAP_SYS_RESOURCE in the user namespace. It is not clear to me why any of those exceptions exist so I figure we should until this is actually a problem for someone before we relax the permission checks here.
Cc: stable@vger.kernel.org Reported-by: Michal Koutný mkoutny@suse.com [1] https://lkml.kernel.org/r/20220207121800.5079-5-mkoutny@suse.com History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 2 +- kernel/fork.c | 2 +- kernel/user_namespace.c | 2 ++ 3 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 1e7f757cbc2c..01c8c7bae9b4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1881,7 +1881,7 @@ static int do_execveat_common(int fd, struct filename *filename, */ if ((current->flags & PF_NPROC_CHECK) && is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) && - (current_user() != INIT_USER) && + (current_ucounts() != &init_ucounts) && !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) { retval = -EAGAIN; goto out_ret; diff --git a/kernel/fork.c b/kernel/fork.c index 2b6a28a86325..6f62d37f3650 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
retval = -EAGAIN; if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) { - if (p->real_cred->user != INIT_USER && + if ((task_ucounts(p) != &init_ucounts) && !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) goto bad_fork_cleanup_count; } diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 6b2e3ca7ee99..f0c04073403d 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -123,6 +123,8 @@ int create_user_ns(struct cred *new) ns->ucount_max[i] = INT_MAX; } set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)); + if (new->ucounts == &init_ucounts) + set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIMIT_INFINITY); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
On Thu, Feb 10, 2022 at 08:13:20PM -0600, "Eric W. Biederman" ebiederm@xmission.com wrote:
@@ -1881,7 +1881,7 @@ static int do_execveat_common(int fd, struct filename *filename,
[...]
(current_user() != INIT_USER) &&
(current_ucounts() != &init_ucounts) &&
[...]
@@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
[...]
if (p->real_cred->user != INIT_USER &&
if ((task_ucounts(p) != &init_ucounts) &&
These substitutions make sense to me.
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) goto bad_fork_cleanup_count;
} diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 6b2e3ca7ee99..f0c04073403d 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -123,6 +123,8 @@ int create_user_ns(struct cred *new) ns->ucount_max[i] = INT_MAX; } set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
- if (new->ucounts == &init_ucounts)
set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIMIT_INFINITY);
First, I wanted to object this double fork_init() but I realized it's relevant for newly created user_ns.
Second, I think new->ucounts would be correct at this point and the check should be
if (ucounts == &init_ucounts)
i.e. before set_cred_ucounts() new->ucounts may not be correct.
I'd suggest also a comment in the create_user_ns() explaining the reason is to exempt global root from RLIMINT_NRPOC also indirectly via descendant user_nss.
Thanks, Michal
Michal Koutný mkoutny@suse.com writes:
On Thu, Feb 10, 2022 at 08:13:20PM -0600, "Eric W. Biederman" ebiederm@xmission.com wrote:
@@ -1881,7 +1881,7 @@ static int do_execveat_common(int fd, struct filename *filename,
[...]
(current_user() != INIT_USER) &&
(current_ucounts() != &init_ucounts) &&
[...]
@@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
[...]
if (p->real_cred->user != INIT_USER &&
if ((task_ucounts(p) != &init_ucounts) &&
These substitutions make sense to me.
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) goto bad_fork_cleanup_count;
} diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 6b2e3ca7ee99..f0c04073403d 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -123,6 +123,8 @@ int create_user_ns(struct cred *new) ns->ucount_max[i] = INT_MAX; } set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
- if (new->ucounts == &init_ucounts)
set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIMIT_INFINITY);
First, I wanted to object this double fork_init() but I realized it's relevant for newly created user_ns.
Second, I think new->ucounts would be correct at this point and the check should be
if (ucounts == &init_ucounts)
i.e. before set_cred_ucounts() new->ucounts may not be correct.
I'd suggest also a comment in the create_user_ns() explaining the reason is to exempt global root from RLIMINT_NRPOC also indirectly via descendant user_nss.
Yes.
This one got culled from my next version of the patchset as it is not conservative enough. I think it is probably the right general direction.
On further reflection I am not convinced that it makes sense to test user or ucounts. They are really not fields designed to support permission checks.
I think if we want to exempt the root user's children from the root users rlimit using the second set_rlimit_ucount_max is the way to go.
Someone filed a bug that strongly suggests that we want the second set_rlimit_ucount_max: https://bugzilla.kernel.org/show_bug.cgi?id=215596
I am still trying to understand that case.
Eric
linux-stable-mirror@lists.linaro.org