On Wed, Sep 03, 2025 at 08:38:03PM +0000, Tom Hromatka wrote:
Add an operation, SECCOMP_CLONE_FILTER, that can copy the seccomp filters from another process to the current process.
I roughly reproduced the Docker seccomp filter [1] and timed how long it takes to build it (via libseccomp) and attach it to a process. After 1000 runs, on average it took 3,740,000 TSC ticks (or ~1440 microseconds) on an AMD EPYC 9J14 running at 2596 MHz. The median build/load time was 3,715,000 TSC ticks.
On the same system, I preloaded the above Docker seccomp filter onto a process. (Note that I opened a pidfd to the reference process and left the pidfd open for the entire run.) I then cloned the filter using the feature in this patch to 1000 new processes. On average, it took 9,300 TSC ticks (or ~3.6 microseconds) to copy the filter to the new processes. The median clone time was 9,048 TSC ticks.
This is approximately a 400x performance improvement for those container managers that are using the exact same seccomp filter across all of their containers.
This is a nice speedup, but with devil's advocate hat on, are launchers spawning at rates high enough that this makes a difference?
[1] https://raw.githubusercontent.com/moby/moby/refs/heads/master/profiles/secco...
Signed-off-by: Tom Hromatka tom.hromatka@oracle.com
.../userspace-api/seccomp_filter.rst | 10 ++ include/uapi/linux/seccomp.h | 1 + kernel/seccomp.c | 48 ++++++ samples/seccomp/Makefile | 2 +- samples/seccomp/clone-filter.c | 143 ++++++++++++++++++ tools/include/uapi/linux/seccomp.h | 1 + tools/testing/selftests/seccomp/seccomp_bpf.c | 71 +++++++++ 7 files changed, 275 insertions(+), 1 deletion(-) create mode 100644 samples/seccomp/clone-filter.c
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst index cff0fa7f3175..ef1797d093f6 100644 --- a/Documentation/userspace-api/seccomp_filter.rst +++ b/Documentation/userspace-api/seccomp_filter.rst @@ -289,6 +289,16 @@ above in this document: all arguments being read from the tracee's memory should be read into the tracer's memory before any policy decisions are made. This allows for an atomic decision on syscall arguments. +Cloning an Existing Seccomp Filter +==================================
+Constructing and loading a complex seccomp filter can often take a non-trivial +amount of time. If a user wants to use the same seccomp filter across more +than one process, it can be cloned to new processes via the +``SECCOMP_CLONE_FILTER`` operation. Note that the clone will only succeed if +the destination process does not have any seccomp filters already applied to +it. See ``samples/seccomp/clone-filter.c`` for an example.
Is "does not have any seccomp filters" a reasonable expectation? I mean, I'm fine with it, but will this feature be generally useful with that restriction? (i.e. becomes unusable under Docker, in several systemd contexts, etc)
+static long seccomp_clone_filter(void __user *upidfd) +{
- struct task_struct *task;
- unsigned int flags;
- pid_t pidfd;
- if (!capable(CAP_SYS_ADMIN))
return -EPERM;
Again, I'm fine with this being CAP_SYS_ADMIN, but the normal seccomp filter restriction is NNP || CAP_SYS_ADMIN. Would it be safe to do clones with non-ADMIN? Hmmm.
- if (atomic_read(¤t->seccomp.filter_count) > 0)
return -EINVAL;
I'd wait to test this until you're under ¤t->sighand->siglock, and just test current->seccomp.filter.
- if (copy_from_user(&pidfd, upidfd, sizeof(pid_t)))
return -EFAULT;
- task = pidfd_get_task(pidfd, &flags);
- if (IS_ERR(task))
return -ESRCH;
Is this the right way to pass in a pidfd? Shouldn't this be an int, not a pointer? What is idiomatic for pidfd interfaces?
- spin_lock_irq(¤t->sighand->siglock);
- spin_lock_irq(&task->sighand->siglock);
As others mentioned, you can't do this. ;) I'm pretty sure you can just take references progressively:
- task = pidfd_get_task - mutex_lock_killable(&task->signal->cred_guard_mutex) - spin_lock_irq(&task->sighand->siglock); - check seccomp mode for filter mode, e.g. see seccomp_may_assign_mode() - get_seccomp_filter(task) - struct seccomp copy = task->seccomp (copies filter pointer, count, and mode) - release siglock - release cred_guard_mutex - put task
Now you have the seccomp copy! :) Any errors from here mean you need to use __seccomp_filter_release to "put" the filter, if I'm reading things correctly. (We might have issues with USER_NOTIF, but I haven't looked closely yet.)
And applying it should be:
- take current->signal->cred_guard_mutex - take current->sighand->siglock - make sure task->seccomp.filter == NULL - current->seccomp = copy - release siglock - release cred_guard_mutex
I *think* the above is correct. I may have forgotten some details, but I was mostly trying to combine TSYNC and regular application of a filter to current.
- if (atomic_read(&task->seccomp.filter_count) == 0) {
spin_unlock_irq(&task->sighand->siglock);
spin_unlock_irq(¤t->sighand->siglock);
put_task_struct(task);
return -EINVAL;
- }
- get_seccomp_filter(task);
- current->seccomp = task->seccomp;
- spin_unlock_irq(&task->sighand->siglock);
- set_task_syscall_work(current, SECCOMP);
- spin_unlock_irq(¤t->sighand->siglock);
- put_task_struct(task);
- return 0;
+}
/* Common entry point for both prctl and syscall. */ static long do_seccomp(unsigned int op, unsigned int flags, void __user *uargs) @@ -2102,6 +2145,11 @@ static long do_seccomp(unsigned int op, unsigned int flags, return -EINVAL; return seccomp_get_notif_sizes(uargs);
- case SECCOMP_CLONE_FILTER:
if (flags != 0)
return -EINVAL;
default: return -EINVAL; }return seccomp_clone_filter(uargs);
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile index c85ae0ed8342..d38977f41b86 100644 --- a/samples/seccomp/Makefile +++ b/samples/seccomp/Makefile @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0 -userprogs-always-y += bpf-fancy dropper bpf-direct user-trap +userprogs-always-y += bpf-fancy dropper bpf-direct user-trap clone-filter bpf-fancy-objs := bpf-fancy.o bpf-helper.o diff --git a/samples/seccomp/clone-filter.c b/samples/seccomp/clone-filter.c new file mode 100644 index 000000000000..d26e1375b9dc --- /dev/null +++ b/samples/seccomp/clone-filter.c @@ -0,0 +1,143 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- Seccomp filter example for cloning a filter
- Copyright (c) 2025 Oracle and/or its affiliates.
- Author: Tom Hromatka tom.hromatka@oracle.com
- The code may be used by anyone for any purpose,
- and can serve as a starting point for developing
- applications that reuse the same seccomp filter
- across many processes.
- */
+#include <linux/seccomp.h> +#include <linux/filter.h> +#include <sys/syscall.h> +#include <sys/wait.h> +#include <stdbool.h> +#include <signal.h> +#include <stddef.h> +#include <stdlib.h> +#include <unistd.h> +#include <stdio.h> +#include <errno.h>
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+static int seccomp(unsigned int op, unsigned int flags, void *args) +{
- errno = 0;
- return syscall(__NR_seccomp, op, flags, args);
+}
+static int install_filter(void) +{
- struct sock_filter deny_filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getppid, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ESRCH),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
- };
- struct sock_fprog deny_prog = {
.len = (unsigned short)ARRAY_SIZE(deny_filter),
.filter = deny_filter,
- };
- return seccomp(SECCOMP_SET_MODE_FILTER, 0, &deny_prog);
+}
+static int clone_filter(pid_t ref_pid) +{
- int ref_pidfd, ret;
- ref_pidfd = syscall(SYS_pidfd_open, ref_pid, 0);
- if (ref_pidfd < 0)
return -errno;
- ret = seccomp(SECCOMP_CLONE_FILTER, 0, &ref_pidfd);
- close(ref_pidfd);
- return ret;
+}
+static void do_ref_filter(void) +{
- int ret;
- ret = install_filter();
- if (ret) {
perror("Failed to install ref filter\n");
exit(1);
- }
- while (true)
sleep(1);
+}
+static void do_child_process(pid_t ref_pid) +{
- pid_t res;
- int ret;
- ret = clone_filter(ref_pid);
- if (ret != 0) {
perror("Failed to clone filter. Installing filter from scratch\n");
ret = install_filter();
if (ret != 0) {
perror("Filter install failed\n");
exit(ret);
}
- }
- res = syscall(__NR_getpid);
- if (res < 0) {
perror("getpid() unexpectedly failed\n");
exit(errno);
- }
- res = syscall(__NR_getppid);
- if (res > 0) {
perror("getppid() unexpectedly succeeded\n");
exit(1);
- }
- exit(0);
+}
+int main(void) +{
- pid_t ref_pid = -1, child_pid = -1;
- int ret, status;
- ref_pid = fork();
- if (ref_pid < 0)
exit(errno);
- else if (ref_pid == 0)
do_ref_filter();
- child_pid = fork();
- if (child_pid < 0)
goto out;
- else if (child_pid == 0)
do_child_process(ref_pid);
- waitpid(child_pid, &status, 0);
- if (WEXITSTATUS(status) != 0) {
perror("child process failed");
ret = WEXITSTATUS(status);
goto out;
- }
- ret = 0;
+out:
- if (ref_pid != -1)
kill(ref_pid, SIGKILL);
- if (child_pid != -1)
kill(child_pid, SIGKILL);
- exit(ret);
+} diff --git a/tools/include/uapi/linux/seccomp.h b/tools/include/uapi/linux/seccomp.h index dbfc9b37fcae..b0917e333b4b 100644 --- a/tools/include/uapi/linux/seccomp.h +++ b/tools/include/uapi/linux/seccomp.h @@ -16,6 +16,7 @@ #define SECCOMP_SET_MODE_FILTER 1 #define SECCOMP_GET_ACTION_AVAIL 2 #define SECCOMP_GET_NOTIF_SIZES 3 +#define SECCOMP_CLONE_FILTER 4 /* Valid flags for SECCOMP_SET_MODE_FILTER */ #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 61acbd45ffaa..df5e0f615da0 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -177,6 +177,10 @@ struct seccomp_data { #define SECCOMP_GET_NOTIF_SIZES 3 #endif +#ifndef SECCOMP_CLONE_FILTER +#define SECCOMP_CLONE_FILTER 4 +#endif
#ifndef SECCOMP_FILTER_FLAG_TSYNC #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) #endif @@ -5090,6 +5094,73 @@ TEST_F(URETPROBE, uretprobe_default_block_with_uretprobe_syscall) ASSERT_EQ(0, run_probed_with_filter(&prog)); } +TEST(clone_filter)
Yay tests!
+{
- struct sock_filter deny_filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getppid, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ESRCH),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
- };
- struct sock_fprog deny_prog = {
.len = (unsigned short)ARRAY_SIZE(deny_filter),
.filter = deny_filter,
- };
- struct timespec ts = {
.tv_sec = 0,
.tv_nsec = 100000000,
- };
- pid_t child_pid, self_pid, res;
- int child_pidfd, ret;
- /* Only real root can copy a filter. */
- if (geteuid()) {
That's not true: uid==0 is not CAP_SYS_ADMIN. :) Look in this test for:
cap_get_flag(cap, CAP_SYS_ADMIN, CAP_EFFECTIVE, &is_cap_sys_admin);
which is how to test this correctly.
SKIP(return, "clone_filter requires real root");
return;
- }
- self_pid = getpid();
- child_pid = fork();
- ASSERT_LE(0, child_pid);
Uh, isn't that supposed to be GE? This should have failed your test immediately every time. Does this test pass for you? :P
- if (child_pid == 0) {
ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
ASSERT_EQ(0, prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &deny_prog));
I'd assert that get_ppid gives ESRCH here just for completeness.
while (true)
EXPECT_EQ(0, syscall(__NR_nanosleep, &ts, NULL));
- }
- /* wait for the child pid to create its seccomp filter */
- ASSERT_EQ(0, syscall(__NR_nanosleep, &ts, NULL));
Please look at the other tests to see how to synchronize without a non-deterministic sleep "guess". :) It's bad form to induce needless delays in tests.
- child_pidfd = syscall(SYS_pidfd_open, child_pid, 0);
- EXPECT_LE(0, child_pidfd);
- /* Invalid flag provided */
- ret = seccomp(SECCOMP_CLONE_FILTER, 1, &child_pidfd);
I'd set all the bits, not just 1.
This is also missing a EFAULT test (i.e. a NULL child_pidfd). (Though I still want to know if this is idiomatic for pidfd interfaces -- I'd not expect this to be passed as a pointer.)
And it's missing a ESRCH test.
And a "this is not a pidfd" test.
- EXPECT_EQ(-1, ret);
- EXPECT_EQ(errno, EINVAL);
- errno = 0;
- ret = seccomp(SECCOMP_CLONE_FILTER, 0, &child_pidfd);
- EXPECT_EQ(0, ret);
- EXPECT_EQ(errno, 0);
- res = syscall(__NR_getppid);
- EXPECT_EQ(res, -1);
- EXPECT_EQ(errno, ESRCH);
I would validate that getppid succeeds before the CLONE_FILTER, just for robustness.
- res = syscall(__NR_getpid);
- EXPECT_EQ(res, self_pid);
- close(child_pidfd);
- kill(child_pid, SIGKILL);
I'm not a fan of using "kill" for child sync in tests. I'd rather see a blocking read or similar (so the child doesn't have to spin, even if it's in sleep). But at the very least I'd want to see a waitpid for this kill.
+}
Please also check for the "I already have a filter attached" failure path.
/*
- TODO:
- expand NNP testing
-- 2.47.3
Thanks for working on this! It'd be a nice speed-up for sure.
-Kees