Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set. This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior.
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Signed-off-by: Roman Gushchin roman.gushchin@linux.dev Co-developed-by: Andrei Vagin avagin@google.com Signed-off-by: Andrei Vagin avagin@google.com Cc: Kees Cook kees@kernel.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Alexey Gladkov legion@kernel.org Cc: stable@vger.kernel.org --- include/linux/user_namespace.h | 3 ++- kernel/signal.c | 3 ++- kernel/ucount.c | 5 +++-- 3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 3625096d5f85..7183e5aca282 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty
long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type); +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, + bool override_rlimit); void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type); bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
diff --git a/kernel/signal.c b/kernel/signal.c index 4344860ffcac..cbabb2d05e0a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, */ rcu_read_lock(); ucounts = task_ucounts(t); - sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING); + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, + override_rlimit); rcu_read_unlock(); if (!sigpending) return NULL; diff --git a/kernel/ucount.c b/kernel/ucount.c index 16c0ea1cb432..046b3d57ebb4 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); }
-long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, + bool override_rlimit) { /* Caller must hold a reference to ucounts */ struct ucounts *iter; @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
for (iter = ucounts; iter; iter = iter->ns->ucounts) { long new = atomic_long_add_return(1, &iter->rlimit[type]); - if (new < 0 || new > max) + if (new < 0 || (!override_rlimit && (new > max))) goto unwind; if (iter == ucounts) ret = new;
Roman Gushchin roman.gushchin@linux.dev writes:
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set.
Not true.
It added a limit on the number of siginfo structures that a container may allocate. Have you tried not limiting your container?
This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications.
Note. There are always conditions when the allocation may fail. The structure is allocated with __GFP_ATOMIC so it is much more likely to fail than a typical kernel memory allocation.
But I agree it does look like there is a quality of implementation issue here.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior.
Instead please just give the container and unlimited number of siginfo structures it can play with.
The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING) value when the user namespace is created.
Given that it took 3 and half years to report this. I am going to say this really looks like a userspace bug.
Beyond that your patch is actually buggy, and should not be applied.
If we want to change the semantics and ignore the maximum number of pending signals in a container (when override_rlimit is set) then the code should change the computation of the max value (pegging it at LONG_MAX) and not ignore it.
As it is the patch below disables the check that keeps the ucount counters from wrapping around. That makes it possible for someone to overflow those counters and get into all kinds of trouble.
Eric
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Signed-off-by: Roman Gushchin roman.gushchin@linux.dev Co-developed-by: Andrei Vagin avagin@google.com Signed-off-by: Andrei Vagin avagin@google.com Cc: Kees Cook kees@kernel.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Alexey Gladkov legion@kernel.org Cc: stable@vger.kernel.org
include/linux/user_namespace.h | 3 ++- kernel/signal.c | 3 ++- kernel/ucount.c | 5 +++-- 3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 3625096d5f85..7183e5aca282 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type); +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
bool override_rlimit);
void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type); bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max); diff --git a/kernel/signal.c b/kernel/signal.c index 4344860ffcac..cbabb2d05e0a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, */ rcu_read_lock(); ucounts = task_ucounts(t);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
rcu_read_unlock(); if (!sigpending) return NULL;override_rlimit);
diff --git a/kernel/ucount.c b/kernel/ucount.c index 16c0ea1cb432..046b3d57ebb4 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); } -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
bool override_rlimit)
{ /* Caller must hold a reference to ucounts */ struct ucounts *iter; @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) for (iter = ucounts; iter; iter = iter->ns->ucounts) { long new = atomic_long_add_return(1, &iter->rlimit[type]);
if (new < 0 || new > max)
if (iter == ucounts) ret = new;if (new < 0 || (!override_rlimit && (new > max))) goto unwind;
On Fri, Nov 01, 2024 at 02:51:00PM -0500, Eric W. Biederman wrote:
Roman Gushchin roman.gushchin@linux.dev writes:
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set.
Not true.
It added a limit on the number of siginfo structures that a container may allocate. Have you tried not limiting your container?
This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications.
Note. There are always conditions when the allocation may fail. The structure is allocated with __GFP_ATOMIC so it is much more likely to fail than a typical kernel memory allocation.
But I agree it does look like there is a quality of implementation issue here.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior.
Instead please just give the container and unlimited number of siginfo structures it can play with.
Well, personally I'd not use this limit too, but I don't think "it's broken, userspace shouldn't use it" argument is valid.
The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING) value when the user namespace is created.
Given that it took 3 and half years to report this. I am going to say this really looks like a userspace bug.
The trick here is another bug fixed by https://lkml.org/lkml/2024/10/31/185. Basically it's a leak of the rlimit value. If a limit is set and reached in the reality, all following signals will not have a siginfo attached, causing applications which depend on handling SIGSEGV to crash.
Beyond that your patch is actually buggy, and should not be applied.
If we want to change the semantics and ignore the maximum number of pending signals in a container (when override_rlimit is set) then the code should change the computation of the max value (pegging it at LONG_MAX) and not ignore it.
Hm, isn't the unconditional (new < 0) enough to capture the overflow? Actually I'm not sure I understand how "long new" can be "> LONG_MAX" anyway.
Thanks!
Roman Gushchin roman.gushchin@linux.dev writes:
On Fri, Nov 01, 2024 at 02:51:00PM -0500, Eric W. Biederman wrote:
Roman Gushchin roman.gushchin@linux.dev writes:
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set.
Not true.
It added a limit on the number of siginfo structures that a container may allocate. Have you tried not limiting your container?
This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications.
Note. There are always conditions when the allocation may fail. The structure is allocated with __GFP_ATOMIC so it is much more likely to fail than a typical kernel memory allocation.
But I agree it does look like there is a quality of implementation issue here.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior.
Instead please just give the container and unlimited number of siginfo structures it can play with.
Well, personally I'd not use this limit too, but I don't think "it's broken, userspace shouldn't use it" argument is valid.
I said if you don't want the limit don't use it.
A version of "Doctor it hurts when I do this". To which the doctor replies "Don't do that then".
I was also asking that you test with the limit disabled (at user namespace creation time) so that you can verify that is problem.
The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING) value when the user namespace is created.
Given that it took 3 and half years to report this. I am going to say this really looks like a userspace bug.
The trick here is another bug fixed by https://lkml.org/lkml/2024/10/31/185. Basically it's a leak of the rlimit value. If a limit is set and reached in the reality, all following signals will not have a siginfo attached, causing applications which depend on handling SIGSEGV to crash.
I will take a deeper look at the patch you are referring to.
Beyond that your patch is actually buggy, and should not be applied.
If we want to change the semantics and ignore the maximum number of pending signals in a container (when override_rlimit is set) then the code should change the computation of the max value (pegging it at LONG_MAX) and not ignore it.
Hm, isn't the unconditional (new < 0) enough to capture the overflow? Actually I'm not sure I understand how "long new" can be "> LONG_MAX" anyway.
Agreed "new < 0" should catch that, but still splitting the logic between the calculation of max and the test of max is quite confusing. It makes much more sense to put the logic into the calculate of max.
Eric
On Fri, Nov 01, 2024 at 03:58:07PM -0500, Eric W. Biederman wrote:
Roman Gushchin roman.gushchin@linux.dev writes:
On Fri, Nov 01, 2024 at 02:51:00PM -0500, Eric W. Biederman wrote:
Roman Gushchin roman.gushchin@linux.dev writes:
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set.
Not true.
It added a limit on the number of siginfo structures that a container may allocate. Have you tried not limiting your container?
This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications.
Note. There are always conditions when the allocation may fail. The structure is allocated with __GFP_ATOMIC so it is much more likely to fail than a typical kernel memory allocation.
But I agree it does look like there is a quality of implementation issue here.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior.
Instead please just give the container and unlimited number of siginfo structures it can play with.
Well, personally I'd not use this limit too, but I don't think "it's broken, userspace shouldn't use it" argument is valid.
I said if you don't want the limit don't use it.
A version of "Doctor it hurts when I do this". To which the doctor replies "Don't do that then".
I was also asking that you test with the limit disabled (at user namespace creation time) so that you can verify that is problem.
The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING) value when the user namespace is created.
Given that it took 3 and half years to report this. I am going to say this really looks like a userspace bug.
The trick here is another bug fixed by https://lkml.org/lkml/2024/10/31/185. Basically it's a leak of the rlimit value. If a limit is set and reached in the reality, all following signals will not have a siginfo attached, causing applications which depend on handling SIGSEGV to crash.
I will take a deeper look at the patch you are referring to.
Beyond that your patch is actually buggy, and should not be applied.
If we want to change the semantics and ignore the maximum number of pending signals in a container (when override_rlimit is set) then the code should change the computation of the max value (pegging it at LONG_MAX) and not ignore it.
Hm, isn't the unconditional (new < 0) enough to capture the overflow? Actually I'm not sure I understand how "long new" can be "> LONG_MAX" anyway.
Agreed "new < 0" should catch that, but still splitting the logic between the calculation of max and the test of max is quite confusing. It makes much more sense to put the logic into the calculate of max.
You mean something like this?
diff --git a/kernel/ucount.c b/kernel/ucount.c index 046b3d57ebb4..49fcec41e5b4 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -317,11 +317,12 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
for (iter = ucounts; iter; iter = iter->ns->ucounts) { long new = atomic_long_add_return(1, &iter->rlimit[type]); - if (new < 0 || (!override_rlimit && (new > max))) + if (new < 0 || new > max) goto unwind; if (iter == ucounts) ret = new; - max = get_userns_rlimit_max(iter->ns, type); + if (!override_rlimit) + max = get_userns_rlimit_max(iter->ns, type); /* * Grab an extra ucount reference for the caller when * the rlimit count was previously 0.
--
If you strongly prefer this version, I can send a v2. I like my original version slightly better, but not a strong preference. Please, let me know.
Thanks!
On Fri, Nov 1, 2024 at 1:58 PM Eric W. Biederman ebiederm@xmission.com wrote:
Well, personally I'd not use this limit too, but I don't think "it's broken, userspace shouldn't use it" argument is valid.
I said if you don't want the limit don't use it.
A version of "Doctor it hurts when I do this". To which the doctor replies "Don't do that then".
Unfortunately, it isn't an option here. This is a non-root user that creates a new user-namespace. It can't change RLIMIT_SIGPENDING beyond the current limit.
We have to distinguish between two types of signals:
* Kernel signals: These are essential for proper application behavior. If a user process triggers a memory fault, it gets SIGSEGV and it can’t ignore it. The number of such signals is limited by one per user thread.
* User signals: These are sent by users and can be blocked by the receiving process, potentially leading to a large queue of pending signals. This is where the RLIMIT_SIGPENDING limit plays its role in preventing excessive resource consumption.
New user namespaces inherit the current sigpending rlimit, so it is expected that the behavior of the user-namespace limit is aligned with the overall RLIMIT_SIGPENDING mechanism.
I was also asking that you test with the limit disabled (at user namespace creation time) so that you can verify that is problem.
The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING) value when the user namespace is created.
Given that it took 3 and half years to report this. I am going to say this really looks like a userspace bug.
This issue was introduced in v5.15. The latest rhel release, 9.4, is based on v5.14... 3.5 years is not a long time...
Thanks, Andrei
On Fri, Nov 01, 2024 at 03:44:48PM -0700, Andrei Vagin wrote:
On Fri, Nov 1, 2024 at 1:58 PM Eric W. Biederman ebiederm@xmission.com wrote:
Well, personally I'd not use this limit too, but I don't think "it's broken, userspace shouldn't use it" argument is valid.
I said if you don't want the limit don't use it.
A version of "Doctor it hurts when I do this". To which the doctor replies "Don't do that then".
Unfortunately, it isn't an option here. This is a non-root user that creates a new user-namespace. It can't change RLIMIT_SIGPENDING beyond the current limit.
We have to distinguish between two types of signals:
- Kernel signals: These are essential for proper application behavior.
If a user process triggers a memory fault, it gets SIGSEGV and it can’t ignore it. The number of such signals is limited by one per user thread.
- User signals: These are sent by users and can be blocked by the
receiving process, potentially leading to a large queue of pending signals. This is where the RLIMIT_SIGPENDING limit plays its role in preventing excessive resource consumption.
New user namespaces inherit the current sigpending rlimit, so it is expected that the behavior of the user-namespace limit is aligned with the overall RLIMIT_SIGPENDING mechanism.
Hm. I think I understand the problem now.
+Cc Oleg Nesterov.
On 11/02, Alexey Gladkov wrote:
+Cc Oleg Nesterov.
Well, I tend to agree with Roman and his patch looks good to me.
But it seems that the change in inc_rlimit_get_ucounts() can be a bit simpler and more readable, see below.
Oleg. ---
--- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); }
-long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, + bool override_rlimit) { /* Caller must hold a reference to ucounts */ struct ucounts *iter; @@ -320,7 +321,8 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) goto unwind; if (iter == ucounts) ret = new; - max = get_userns_rlimit_max(iter->ns, type); + if (!override_rlimit) + max = get_userns_rlimit_max(iter->ns, type); /* * Grab an extra ucount reference for the caller when * the rlimit count was previously 0.
On Sun, Nov 03, 2024 at 05:50:49PM +0100, Oleg Nesterov wrote:
On 11/02, Alexey Gladkov wrote:
+Cc Oleg Nesterov.
Well, I tend to agree with Roman and his patch looks good to me.
Thanks, Oleg!
But it seems that the change in inc_rlimit_get_ucounts() can be a bit simpler and more readable, see below.
Eric suggested the same approach earlier in this thread. I personally don't have a strong preference here or actually I slightly prefer my own version because this comparison to LONG_MAX looks confusing to me. But if you have a strong preference, I'm happy to send out v2. Please, let me know.
Thanks!
On 11/04, Roman Gushchin wrote:
On Sun, Nov 03, 2024 at 05:50:49PM +0100, Oleg Nesterov wrote:
But it seems that the change in inc_rlimit_get_ucounts() can be a bit simpler and more readable, see below.
Eric suggested the same approach earlier in this thread.
Ah, good, I didn't know ;)
I personally don't have a strong preference here or actually I slightly prefer my own version because this comparison to LONG_MAX looks confusing to me. But if you have a strong preference, I'm happy to send out v2. Please, let me know.
Well, I won't insist.
To me the change proposed by Eric and me looks much more readable, but of course this is subjective.
But you know, you can safely ignore me. Alexey and Eric understand this code much better, so I leave this to you/Alexey/Eric.
Oleg.
On Mon, Nov 04, 2024 at 07:44:43PM +0100, Oleg Nesterov wrote:
On 11/04, Roman Gushchin wrote:
On Sun, Nov 03, 2024 at 05:50:49PM +0100, Oleg Nesterov wrote:
But it seems that the change in inc_rlimit_get_ucounts() can be a bit simpler and more readable, see below.
Eric suggested the same approach earlier in this thread.
Ah, good, I didn't know ;)
I personally don't have a strong preference here or actually I slightly prefer my own version because this comparison to LONG_MAX looks confusing to me. But if you have a strong preference, I'm happy to send out v2. Please, let me know.
Well, I won't insist.
To me the change proposed by Eric and me looks much more readable, but of course this is subjective.
But you know, you can safely ignore me. Alexey and Eric understand this code much better, so I leave this to you/Alexey/Eric.
Personally, I like Oleg's patch more.
On Mon, Nov 04, 2024 at 08:02:35PM +0100, Alexey Gladkov wrote:
On Mon, Nov 04, 2024 at 07:44:43PM +0100, Oleg Nesterov wrote:
On 11/04, Roman Gushchin wrote:
On Sun, Nov 03, 2024 at 05:50:49PM +0100, Oleg Nesterov wrote:
But it seems that the change in inc_rlimit_get_ucounts() can be a bit simpler and more readable, see below.
Eric suggested the same approach earlier in this thread.
Ah, good, I didn't know ;)
I personally don't have a strong preference here or actually I slightly prefer my own version because this comparison to LONG_MAX looks confusing to me. But if you have a strong preference, I'm happy to send out v2. Please, let me know.
Well, I won't insist.
To me the change proposed by Eric and me looks much more readable, but of course this is subjective.
But you know, you can safely ignore me. Alexey and Eric understand this code much better, so I leave this to you/Alexey/Eric.
Personally, I like Oleg's patch more.
Ok, I'll send out a v2 shortly.
Thanks!
On Thu, Oct 31, 2024 at 08:04:38PM +0000, Roman Gushchin wrote:
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set. This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior.
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Signed-off-by: Roman Gushchin roman.gushchin@linux.dev Co-developed-by: Andrei Vagin avagin@google.com Signed-off-by: Andrei Vagin avagin@google.com Cc: Kees Cook kees@kernel.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Alexey Gladkov legion@kernel.org Cc: stable@vger.kernel.org
include/linux/user_namespace.h | 3 ++- kernel/signal.c | 3 ++- kernel/ucount.c | 5 +++-- 3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 3625096d5f85..7183e5aca282 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type); +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
bool override_rlimit);
void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type); bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max); diff --git a/kernel/signal.c b/kernel/signal.c index 4344860ffcac..cbabb2d05e0a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, */ rcu_read_lock(); ucounts = task_ucounts(t);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
rcu_read_unlock(); if (!sigpending) return NULL;override_rlimit);
diff --git a/kernel/ucount.c b/kernel/ucount.c index 16c0ea1cb432..046b3d57ebb4 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); } -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
bool override_rlimit)
{ /* Caller must hold a reference to ucounts */ struct ucounts *iter; @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) for (iter = ucounts; iter; iter = iter->ns->ucounts) { long new = atomic_long_add_return(1, &iter->rlimit[type]);
if (new < 0 || new > max)
if (iter == ucounts) ret = new;if (new < 0 || (!override_rlimit && (new > max))) goto unwind;
It's a bad patch. If we do as you suggest, it will do_dec_rlimit_put_ucounts() in case of overflow. This means you'll break the counter and there will be an extra decrement in __sigqueue_free(). We can't just ignore the overflow here.
On Sat, Nov 02, 2024 at 12:28:38AM +0100, Alexey Gladkov wrote:
On Thu, Oct 31, 2024 at 08:04:38PM +0000, Roman Gushchin wrote:
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set. This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior.
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Signed-off-by: Roman Gushchin roman.gushchin@linux.dev Co-developed-by: Andrei Vagin avagin@google.com Signed-off-by: Andrei Vagin avagin@google.com Cc: Kees Cook kees@kernel.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Alexey Gladkov legion@kernel.org Cc: stable@vger.kernel.org
include/linux/user_namespace.h | 3 ++- kernel/signal.c | 3 ++- kernel/ucount.c | 5 +++-- 3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 3625096d5f85..7183e5aca282 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type); +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
bool override_rlimit);
void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type); bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max); diff --git a/kernel/signal.c b/kernel/signal.c index 4344860ffcac..cbabb2d05e0a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, */ rcu_read_lock(); ucounts = task_ucounts(t);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
rcu_read_unlock(); if (!sigpending) return NULL;override_rlimit);
diff --git a/kernel/ucount.c b/kernel/ucount.c index 16c0ea1cb432..046b3d57ebb4 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); } -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
bool override_rlimit)
{ /* Caller must hold a reference to ucounts */ struct ucounts *iter; @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) for (iter = ucounts; iter; iter = iter->ns->ucounts) { long new = atomic_long_add_return(1, &iter->rlimit[type]);
if (new < 0 || new > max)
if (iter == ucounts) ret = new;if (new < 0 || (!override_rlimit && (new > max))) goto unwind;
It's a bad patch. If we do as you suggest, it will do_dec_rlimit_put_ucounts() in case of overflow. This means you'll break the counter and there will be an extra decrement in __sigqueue_free(). We can't just ignore the overflow here.
Hm, I don't think my code is changing anything in terms of the overflow handling. The (new < 0) handling is exactly the same as it was, the only difference is that (new > max) is allowed if override_rlimit is set. But new physically can't be larger than LONG_MAX, so there is no actual change if the limit is LONG_MAX.
Maybe I'm missing something here, please, clarify.
Thanks!
On Fri, Nov 01, 2024 at 11:50:54PM +0000, Roman Gushchin wrote:
On Sat, Nov 02, 2024 at 12:28:38AM +0100, Alexey Gladkov wrote:
On Thu, Oct 31, 2024 at 08:04:38PM +0000, Roman Gushchin wrote:
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set. This behavior change caused production issues.
For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications.
Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior.
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Signed-off-by: Roman Gushchin roman.gushchin@linux.dev Co-developed-by: Andrei Vagin avagin@google.com Signed-off-by: Andrei Vagin avagin@google.com Cc: Kees Cook kees@kernel.org Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Alexey Gladkov legion@kernel.org Cc: stable@vger.kernel.org
include/linux/user_namespace.h | 3 ++- kernel/signal.c | 3 ++- kernel/ucount.c | 5 +++-- 3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 3625096d5f85..7183e5aca282 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type); +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
bool override_rlimit);
void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type); bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max); diff --git a/kernel/signal.c b/kernel/signal.c index 4344860ffcac..cbabb2d05e0a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, */ rcu_read_lock(); ucounts = task_ucounts(t);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
- sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
rcu_read_unlock(); if (!sigpending) return NULL;override_rlimit);
diff --git a/kernel/ucount.c b/kernel/ucount.c index 16c0ea1cb432..046b3d57ebb4 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); } -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
bool override_rlimit)
{ /* Caller must hold a reference to ucounts */ struct ucounts *iter; @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type) for (iter = ucounts; iter; iter = iter->ns->ucounts) { long new = atomic_long_add_return(1, &iter->rlimit[type]);
if (new < 0 || new > max)
if (iter == ucounts) ret = new;if (new < 0 || (!override_rlimit && (new > max))) goto unwind;
It's a bad patch. If we do as you suggest, it will do_dec_rlimit_put_ucounts() in case of overflow. This means you'll break the counter and there will be an extra decrement in __sigqueue_free(). We can't just ignore the overflow here.
Hm, I don't think my code is changing anything in terms of the overflow handling. The (new < 0) handling is exactly the same as it was, the only difference is that (new > max) is allowed if override_rlimit is set. But new physically can't be larger than LONG_MAX, so there is no actual change if the limit is LONG_MAX.
Maybe I'm missing something here, please, clarify.
I re-read your patch one more time. Sorry. Yes, you're right, i am wrong. You're just allow overlimit.
But one thing confuses me.
Now the maximum rlimits of the upper userns are being forced. Changing rlimit to RLIM_INFINITY affects only the current userns and child userns.
But after this patch, this is not the case for RLIMIT_SIGPENDING and within userns it is possible to ignore the restrictions of upper-level userns which ruins the whole idea.
I agree with Eric. If you don't need upper-level userns limits, you don't need to set them.
linux-stable-mirror@lists.linaro.org