On Wed, Sep 03, 2025 at 08:38:03PM +0000, Tom Hromatka wrote:
+static long seccomp_clone_filter(void __user *upidfd) +{
- struct task_struct *task;
- unsigned int flags;
- pid_t pidfd;
- if (!capable(CAP_SYS_ADMIN))
return -EPERM;
OK...
- if (atomic_read(¤t->seccomp.filter_count) > 0)
return -EINVAL;
If it's atomic, then presumably there's something that can change it asynchronously, right? If so, what's there to prevent invalidation of the result of this test right after you've decided everything's fine?
Let's check... ; git grep -n -w filter_count <64 lines of output, most clearly unrelated to that> ; git grep -n -w -c filter_count drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c:1 drivers/net/ethernet/intel/i40e/i40e_common.c:18 drivers/net/ethernet/intel/i40e/i40e_prototype.h:4 drivers/net/ethernet/qlogic/qede/qede_filter.c:13 drivers/net/ipa/ipa.h:2 drivers/net/ipa/ipa_cmd.c:1 drivers/net/ipa/ipa_table.c:6 fs/proc/array.c:1 include/linux/seccomp_types.h:2 init/init_task.c:1 kernel/seccomp.c:3 lib/kunit/executor.c:7 lib/kunit/executor_test.c:5
Sod drivers and lib/kunit, these are irrelevant. Removing those hits yields this: ; git grep -n -w filter_count|grep -v '[^dl]' fs/proc/array.c:340: atomic_read(&p->seccomp.filter_count)); include/linux/seccomp_types.h:15: * @filter_count: number of seccomp filters include/linux/seccomp_types.h:24: atomic_t filter_count; init/init_task.c:208: .seccomp = { .filter_count = ATOMIC_INIT(0) }, kernel/seccomp.c:631: atomic_set(&thread->seccomp.filter_count, kernel/seccomp.c:632: atomic_read(&caller->seccomp.filter_count)); kernel/seccomp.c:932: atomic_inc(¤t->seccomp.filter_count);
Aha. We have a reader in fs/proc/array.c, definition of that thing in seccomp_types.h, initialization in init_task.c and two places in seccomp.c, one around line 631 copying the value from one thread to another (seccomp_sync_threads()) and one at line 932 incrementing the damn thing (seccomp_attach_filter()).
Humm... OK, former is copying the filter_count of current (caller is set to current there) to other threads in the same thread group and apparently that's serialized on ->signal->cred_guard_mutex of that bunch, as well as on current->sighand->siglock (and since all threads in the group are going to share ->sighand, it's the same thing as their ->sighand->siglock).
The latter increments that thing for current, under ->sighand->siglock.
So * atomic_t for filter_count looks like cargo-culting (and unless I'm missing something, the only thing that cares about it is /proc/*/status; rudiment of some sort?) * looks like the test can be invalidated by another thread hitting that seccomp_sync_threads() thing (from a quick look, SECCOMP_SET_MODE_FILTER with SECCOMP_FILTER_FLAG_TSYNC in flags).
- if (copy_from_user(&pidfd, upidfd, sizeof(pid_t)))
return -EFAULT;
OK...
- task = pidfd_get_task(pidfd, &flags);
- if (IS_ERR(task))
return -ESRCH;
OK...
- spin_lock_irq(¤t->sighand->siglock);
- spin_lock_irq(&task->sighand->siglock);
WTF? You are apparently trying to lock both the current and the task you want to copy from, but... you are nesting two locks of the same sort here, with not even preventing the self-deadlock (task and current sharing ->sighand - e.g. by belonging to the same thread group), let alone preventing the same from two threads trying to take the same couple of locks in the opposite orders.
More to the point, why do you need both at once?
- if (atomic_read(&task->seccomp.filter_count) == 0) {
OK... from the earlier digging it looks like this actually stands for "if task has no filter attached, piss off - nothing to copy".
spin_unlock_irq(&task->sighand->siglock);
spin_unlock_irq(¤t->sighand->siglock);
put_task_struct(task);
return -EINVAL;
- }
- get_seccomp_filter(task);
Umm... void get_seccomp_filter(struct task_struct *tsk) { struct seccomp_filter *orig = tsk->seccomp.filter; if (!orig) return; __get_seccomp_filter(orig); refcount_inc(&orig->users); } and static void __get_seccomp_filter(struct seccomp_filter *filter) { refcount_inc(&filter->refs); }
So you are taking task->seccomp.filter and bumping refcounts on it, presumably allowing to unlock the task->sighand->siglock?
- current->seccomp = task->seccomp;
wait, what? You are copying all fields at once, but... from the look at what seccomp_sync_threads() was doing, it not quite that simple. OK, atomic for filter_count is worthless, so plain copy will do, but what about ->seccomp.filter? This /* Make our new filter tree visible. */ smp_store_release(&thread->seccomp.filter, caller->seccomp.filter); is potentially more serious. Granted, in this case we are doing store to our own thread's ->seccomp.filter, so the barrier implications might be unimportant - if all reads are either under ->sighand->siglock or done to current->seccomp.filter, we should be fine, but that needs to be verified _AND_ commented upon. Memory barriers are subtle enough...
Looks like the only lockless reader is struct seccomp_filter *f = READ_ONCE(current->seccomp.filter); in seccomp_run_filters(), so unless I've missed something (and "filter" is not a search-friendly name ;-/) we should be fine; that READ_ONCE() is there to handle *other* threads doing stores (with that smp_store_release() in seccomp_sync_threads()). Incidentally, this if (!lock_task_sighand(task, &flags)) return -ESRCH;
f = READ_ONCE(task->seccomp.filter); in proc_pid_seccomp_cache() looks cargo-culted - it's *not* a lockless access, so this READ_ONCE() is confusing.
Anyway, that copying needs a comment. What's more, this __seccomp_filter_release(thread->seccomp.filter); just prior to smp_store_release() is more serious - it drops the old reference. Yes, you count upon no old reference being there - that's what your check of current->seccomp.filter_count was supposed to guarantee, but... it could've appeared after the test.
- spin_unlock_irq(&task->sighand->siglock);
OK, finally unlocked the source...
- set_task_syscall_work(current, SECCOMP);
... marked current as "we have filters"
- spin_unlock_irq(¤t->sighand->siglock);
... and unlocked the current.
So basically you have
verify that current has no filters lock current lock source verify that source has filters grab reference to that store it in current, assuming that it still has no filters unlock source mark current as having filters unlock current
For one thing, the first check is done before we locked current, making it racy. For another, this "hold two locks at once" is asking for trouble - it could be dealt with, but do we really need both at once? We do need the source locked for checking if it has filters and grabbing a reference, but we don't need current locked for that - the only thing this lock would give is protection against filters appearing, but it's done too late to guarantee that. And the lock on source shouldn't be needed after we'd got its filters and grabbed the reference. So... something along the lines of
lock source verify that source has filters grab reference to that store it in local variable, along with filter_count and mode unlock source lock current verify that current has no filters copy the stuff we'd stashed into our local variabl to current mark current as having filters unlock current
That would avoid all problems with nested locks, by virtue of never taking more than one at a time, but now we grab the reference(s) to source filters before checking that current has none. Which means that we need to undo that on failure. From the way an old reference is dropped by seccomp_sync_threads(), that would be __seccomp_filter_release(filters)...
So something like this: spin_lock_irq(&task->sighand->siglock); if (atomic_read(&task->seccomp.filter_count) == 0) { spin_unlock_irq(&task->sighand->siglock); put_task_struct(task); return -EINVAL; } get_seccomp_filter(task); new_seccomp = task->seccomp; spin_unlock_irq(&task->sighand->siglock); spin_lock_irq(¤t->sighand->siglock); if (atomic_read(¤t->seccomp.filter_count) > 0) { spin_unlock_irq(¤t->sighand->siglock); __seccomp_filter_release(new_seccomp.filter); put_task_struct(task); return -EINVAL; } // no barriers - only current->seccomp.filter is read locklessly current->seccomp = new_seccomp; set_task_syscall_work(current, SECCOMP); spin_unlock_irq(¤t->sighand->siglock); put_task_struct(task); return 0;
and I would suggest asking whoever had come up with that atomic for filter_count if it's needed (or ever had been, for that matter). Who was it, anyway? Kees, unless I'm misreading the history, and AFAICS on Cc in this thread, so...
Kees, is there any reason not to make it a plain int? And what is that READ_ONCE() doing in proc_pid_seccomp_cache(), while we are at it... That's 0d8315dddd28 "seccomp/cache: Report cache data through /proc/pid/seccomp_cache", by YiFei Zhu yifeifz2@illinois.edu, AFAICS. Looks like YiFei Zhu zhuyifei@google.com is the current address [Cc'd]...