On 11/11/25 10:21, Christian Brauner wrote:
On Wed, Nov 05, 2025 at 03:32:10PM +0100, Oleg Nesterov wrote:
I am still thinking about another approach, will write another email. But let me take a closer look at your patch.
First of all, can you split it? See below.
On 08/21, Bernd Edlinger wrote:
-static int de_thread(struct task_struct *tsk) +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock;
struct task_struct *t;
bool unsafe_execve_in_progress = false;
if (thread_group_empty(tsk)) goto no_thread_group;
@@ -932,6 +934,19 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) sig->notify_count--;
- for_other_threads(tsk, t) {
if (unlikely(t->ptrace)&& (t != tsk->group_leader || !t->exit_state))unsafe_execve_in_progress = true;you can add "break" into the "if ()" block...
But this is minor. Why do we need "bool unsafe_execve_in_progress" ? If this patch is correct, de_thread() can drop/reacquire cred_guard_mutex unconditionally.
If you really think it makes sense, please make another patch with the changelog.
I'd certainly prefer to avoid this boolean at least for the start. If nothing else to catch the potential problems earlier.
- if (unlikely(unsafe_execve_in_progress)) {
spin_unlock_irq(lock);sig->exec_bprm = bprm;mutex_unlock(&sig->cred_guard_mutex);spin_lock_irq(lock);I don't think spin_unlock_irq() + spin_lock_irq() makes any sense...
@@ -1114,13 +1139,31 @@ int begin_new_exec(struct linux_binprm * bprm) */ trace_sched_prepare_exec(current, bprm);
- /* If the binary is not readable then enforce mm->dumpable=0 */
- would_dump(bprm, bprm->file);
- if (bprm->have_execfd)
would_dump(bprm, bprm->executable);- /*
* Figure out dumpability. Note that this checking only of current* is wrong, but userspace depends on it. This should be testing* bprm->secureexec instead.*/- if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
is_dumpability_changed(current_cred(), bprm->cred) ||!(uid_eq(current_euid(), current_uid()) &&gid_eq(current_egid(), current_gid())))set_dumpable(bprm->mm, suid_dumpable);- else
set_dumpable(bprm->mm, SUID_DUMP_USER);OK, we need to do this before de_thread() drops cred_guard_mutex. But imo this too should be done in a separate patch, the changelog should explain this change.
@@ -1361,6 +1387,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm) if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex)) return -ERESTARTNOINTR;
- if (unlikely(current->signal->exec_bprm)) {
mutex_unlock(¤t->signal->cred_guard_mutex);return -ERESTARTNOINTR;- }
OK, if signal->exec_bprm != NULL, then current is already killed. But proc_pid_attr_write() and ptrace_traceme() do the same. So how about something like
int lock_current_cgm(void) { if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex)) return -ERESTARTNOINTR;
if (!current->signal->group_exec_task) return 0; WARN_ON(!fatal_signal_pending(current)); mutex_unlock(¤t->signal->cred_guard_mutex); return -ERESTARTNOINTR;}
?
Note that it checks ->group_exec_task, not ->exec_bprm. So this change can come in a separate patch too, but I won't insist.
@@ -453,6 +454,28 @@ static int ptrace_attach(struct task_struct *task, long request, return retval; }
if (unlikely(task == task->signal->group_exec_task)) {retval = down_write_killable(&task->signal->exec_update_lock);if (retval)return retval;scoped_guard (task_lock, task) {struct linux_binprm *bprm = task->signal->exec_bprm;const struct cred __rcu *old_cred = task->real_cred;struct mm_struct *old_mm = task->mm;rcu_assign_pointer(task->real_cred, bprm->cred);task->mm = bprm->mm;retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);rcu_assign_pointer(task->real_cred, old_cred);task->mm = old_mm;}This is the most problematic change which I can't review...
Firstly, it changes task->mm/real_cred for __ptrace_may_access() and this looks dangerous to me.
Yeah, that is not ok. This is effectively override_creds for real_cred and that is not a pattern I want to see us establish at all! Temporary credential overrides for the subjective credentials is already terrible but at least we have the explicit split between real_cred and cred expressely for that. So no, that's not an acceptable solution.
Well when this is absolutely not acceptable then I would have to change all security engines to be aware of the current and the new credentials. That may be as well be possible but would be a rather big change. Of course that was only meant as a big exception, and somehow safe as long as it is protected under the right mutexes: cred_guard_mutex, exec_update_lock and task_lock at least.
Say, current_is_single_threaded() called by another CLONE_VM process can miss group_exec_task and falsely return true. Probably not that bad, in this case old_mm should go away soon, but still...
And I don't know if this can fool the users of task_cred_xxx/__task_cred somehow.
Or. check_unsafe_exec() sets LSM_UNSAFE_PTRACE if ptrace. Is it safe to ptrace the execing task after that? I have no idea what the security hooks can do...
Again, can't review this part.
Never mind, your review was really helpful. At the very least it pointed out some places where better comments are needed.
Thanks Bernd.
Oleg.