Michal Koutný mkoutny@suse.com writes:
On Sat, Feb 12, 2022 at 03:32:30PM +0000, Etienne Dechamps etienne@edechamps.fr wrote:
I'm not sure I understand everything that's going on in this thread but it does seem very relevant. You guys might want to double-check the behavior in the particular scenario described there. I'm mostly sending this to make sure everything is cross-linked.
Thanks for the report with strace.
AFAICT, it's caused by setresuid() after unshare(), i.e. all root's tasks are (wrongly) compared against the lowered RLIMIT_NPROC.
This is tackled by my RFC patch 2/6 [1] or Eric's variant but 3/8 (equivalent fix for this case but I haven't run that build).
Michal
[1] I could run your test (LimitNPROC=1 actually) against kernel with my patches and the service starts.
So I looked into this and our previous patchsets (but not my final one) did resolve this.
What fixed it and what is needed to fix this is not enforcing RLIMIT_NPROC when the user who creates the user namespace is INIT_USER.
AKA something like the patch below. It is a regression so if at all possible it needs to be fixed, and it is certainly possible.
The patch below feels right at first glance, but I am not convinced that testing cred->user or cred->ucounts is the proper test so I am going to sleep on this a little bit.
I did want everyone to know I looked into this and I am going to ensure this gets fixed.
diff --git a/kernel/fork.c b/kernel/fork.c index 17d8a8c85e3b..532ce5cbf851 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
retval = -EAGAIN; if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) { - if (p->real_cred->user != INIT_USER && + if (p->real_cred->ucounts != &init_ucounts && !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) goto bad_fork_cleanup_count; } diff --git a/kernel/sys.c b/kernel/sys.c index 97dc9e5d6bf9..7b5d74a7845c 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -490,7 +490,7 @@ static void flag_nproc_exceeded(struct cred *new) * failure to the execve() stage. */ if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) && - new->user != INIT_USER) + new->ucounts != &init_ucounts) current->flags |= PF_NPROC_EXCEEDED; else current->flags &= ~PF_NPROC_EXCEEDED; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 6b2e3ca7ee99..925fb3579ef3 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -123,6 +123,8 @@ int create_user_ns(struct cred *new) ns->ucount_max[i] = INT_MAX; } set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)); + if (new->ucounts == &init_ucounts) + set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIM_INFINITY); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING)); set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));