Re: [PATCH] seccomp: Add SECCOMP_CLONE_FILTER operation

4 Sep 2025

On Wed, Sep 03, 2025 at 08:38:03PM +0000, Tom Hromatka wrote:
...
Add an operation, SECCOMP_CLONE_FILTER, that can copy the seccomp filters
from another process to the current process.
I roughly reproduced the Docker seccomp filter [1] and timed how long it
takes to build it (via libseccomp) and attach it to a process.  After
1000 runs, on average it took 3,740,000 TSC ticks (or ~1440 microseconds)
on an AMD EPYC 9J14 running at 2596 MHz.  The median build/load time was
3,715,000 TSC ticks.
On the same system, I preloaded the above Docker seccomp filter onto a
process.  (Note that I opened a pidfd to the reference process and left
the pidfd open for the entire run.)  I then cloned the filter using the
feature in this patch to 1000 new processes.  On average, it took 9,300
TSC ticks (or ~3.6 microseconds) to copy the filter to the new processes.
The median clone time was 9,048 TSC ticks.
This is approximately a 400x performance improvement for those container
managers that are using the exact same seccomp filter across all of their
containers.
This is a nice speedup, but with devil's advocate hat on, are launchers
spawning at rates high enough that this makes a difference?
...
[1] https://raw.githubusercontent.com/moby/moby/refs/heads/master/profiles/secco...
Signed-off-by: Tom Hromatka tom.hromatka@oracle.com
.../userspace-api/seccomp_filter.rst          |  10 ++
 include/uapi/linux/seccomp.h                  |   1 +
 kernel/seccomp.c                              |  48 ++++++
 samples/seccomp/Makefile                      |   2 +-
 samples/seccomp/clone-filter.c                | 143 ++++++++++++++++++
 tools/include/uapi/linux/seccomp.h            |   1 +
 tools/testing/selftests/seccomp/seccomp_bpf.c |  71 +++++++++
 7 files changed, 275 insertions(+), 1 deletion(-)
 create mode 100644 samples/seccomp/clone-filter.c

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index cff0fa7f3175..ef1797d093f6 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -289,6 +289,16 @@ above in this document: all arguments being read from the tracee's memory
 should be read into the tracer's memory before any policy decisions are made.
 This allows for an atomic decision on syscall arguments.
 
+Cloning an Existing Seccomp Filter
+==================================



+Constructing and loading a complex seccomp filter can often take a non-trivial
+amount of time. If a user wants to use the same seccomp filter across more
+than one process, it can be cloned to new processes via the
+``SECCOMP_CLONE_FILTER`` operation. Note that the clone will only succeed if
+the destination process does not have any seccomp filters already applied to
+it. See ``samples/seccomp/clone-filter.c`` for an example.
Is "does not have any seccomp filters" a reasonable expectation? I mean,
I'm fine with it, but will this feature be generally useful with that
restriction? (i.e. becomes unusable under Docker, in several systemd
contexts, etc)
...
+static long seccomp_clone_filter(void __user *upidfd)
+{

struct task_struct *task;
unsigned int flags;
pid_t pidfd;

if (!capable(CAP_SYS_ADMIN))
return -EPERM;



Again, I'm fine with this being CAP_SYS_ADMIN, but the normal seccomp
filter restriction is NNP || CAP_SYS_ADMIN. Would it be safe to do
clones with non-ADMIN? Hmmm.
...

if (atomic_read(&current->seccomp.filter_count) > 0)
return -EINVAL;



I'd wait to test this until you're under &current->sighand->siglock,
and just test current->seccomp.filter.
...


if (copy_from_user(&pidfd, upidfd, sizeof(pid_t)))
return -EFAULT;



task = pidfd_get_task(pidfd, &flags);
if (IS_ERR(task))
return -ESRCH;



Is this the right way to pass in a pidfd? Shouldn't this be an int, not
a pointer? What is idiomatic for pidfd interfaces?
...


spin_lock_irq(&current->sighand->siglock);
spin_lock_irq(&task->sighand->siglock);

As others mentioned, you can't do this. ;) I'm pretty sure you can just
take references progressively:
- task = pidfd_get_task
- mutex_lock_killable(&task->signal->cred_guard_mutex)
- spin_lock_irq(&task->sighand->siglock);
- check seccomp mode for filter mode, e.g. see seccomp_may_assign_mode()
- get_seccomp_filter(task)
- struct seccomp copy = task->seccomp (copies filter pointer, count,
  and mode)
- release siglock
- release cred_guard_mutex
- put task
Now you have the seccomp copy! :) Any errors from here mean you need
to use __seccomp_filter_release to "put" the filter, if I'm reading
things correctly. (We might have issues with USER_NOTIF, but I haven't
looked closely yet.)
And applying it should be:
- take current->signal->cred_guard_mutex
- take current->sighand->siglock
- make sure task->seccomp.filter == NULL
- current->seccomp = copy
- release siglock
- release cred_guard_mutex
I *think* the above is correct. I may have forgotten some details, but I
was mostly trying to combine TSYNC and regular application of a filter
to current.
...


if (atomic_read(&task->seccomp.filter_count) == 0) {
spin_unlock_irq(&task->sighand->siglock);


spin_unlock_irq(&current->sighand->siglock);


put_task_struct(task);


return -EINVAL;


}

get_seccomp_filter(task);
current->seccomp = task->seccomp;

spin_unlock_irq(&task->sighand->siglock);

set_task_syscall_work(current, SECCOMP);

spin_unlock_irq(&current->sighand->siglock);

put_task_struct(task);

return 0;

+}



/* Common entry point for both prctl and syscall. */
 static long do_seccomp(unsigned int op, unsigned int flags,
   	       void __user *uargs)
@@ -2102,6 +2145,11 @@ static long do_seccomp(unsigned int op, unsigned int flags,
   		return -EINVAL;
 
   	return seccomp_get_notif_sizes(uargs);

case SECCOMP_CLONE_FILTER:
if (flags != 0)


	return -EINVAL;



return seccomp_clone_filter(uargs);

default:
return -EINVAL;
}

diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
index c85ae0ed8342..d38977f41b86 100644
--- a/samples/seccomp/Makefile
+++ b/samples/seccomp/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-userprogs-always-y += bpf-fancy dropper bpf-direct user-trap
+userprogs-always-y += bpf-fancy dropper bpf-direct user-trap clone-filter
 
 bpf-fancy-objs := bpf-fancy.o bpf-helper.o
 
diff --git a/samples/seccomp/clone-filter.c b/samples/seccomp/clone-filter.c
new file mode 100644
index 000000000000..d26e1375b9dc
--- /dev/null
+++ b/samples/seccomp/clone-filter.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0
+/*


Seccomp filter example for cloning a filter







Copyright (c) 2025 Oracle and/or its affiliates.



Author: Tom Hromatka tom.hromatka@oracle.com







The code may be used by anyone for any purpose,



and can serve as a starting point for developing



applications that reuse the same seccomp filter



across many processes.


*/

+#include <linux/seccomp.h>
+#include <linux/filter.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <errno.h>



+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))



+static int seccomp(unsigned int op, unsigned int flags, void *args)
+{

errno = 0;
return syscall(__NR_seccomp, op, flags, args);

+}



+static int install_filter(void)
+{

struct sock_filter deny_filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS,


	offsetof(struct seccomp_data, nr)),


BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getppid, 0, 1),


BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ESRCH),


BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),


};
struct sock_fprog deny_prog = {
.len = (unsigned short)ARRAY_SIZE(deny_filter),


.filter = deny_filter,


};

return seccomp(SECCOMP_SET_MODE_FILTER, 0, &deny_prog);

+}



+static int clone_filter(pid_t ref_pid)
+{

int ref_pidfd, ret;

ref_pidfd = syscall(SYS_pidfd_open, ref_pid, 0);
if (ref_pidfd < 0)
return -errno;



ret = seccomp(SECCOMP_CLONE_FILTER, 0, &ref_pidfd);

close(ref_pidfd);

return ret;

+}



+static void do_ref_filter(void)
+{

int ret;

ret = install_filter();
if (ret) {
perror("Failed to install ref filter\n");


exit(1);


}

while (true)
sleep(1);



+}



+static void do_child_process(pid_t ref_pid)
+{

pid_t res;
int ret;

ret = clone_filter(ref_pid);
if (ret != 0) {
perror("Failed to clone filter. Installing filter from scratch\n");



ret = install_filter();


if (ret != 0) {


	perror("Filter install failed\n");


	exit(ret);


}


}

res = syscall(__NR_getpid);
if (res < 0) {
perror("getpid() unexpectedly failed\n");


exit(errno);


}

res = syscall(__NR_getppid);
if (res > 0) {
perror("getppid() unexpectedly succeeded\n");


exit(1);


}

exit(0);

+}



+int main(void)
+{

pid_t ref_pid = -1, child_pid = -1;
int ret, status;

ref_pid = fork();
if (ref_pid < 0)
exit(errno);


else if (ref_pid == 0)
do_ref_filter();



child_pid = fork();
if (child_pid < 0)
goto out;


else if (child_pid == 0)
do_child_process(ref_pid);



waitpid(child_pid, &status, 0);
if (WEXITSTATUS(status) != 0) {
perror("child process failed");


ret = WEXITSTATUS(status);


goto out;


}

ret = 0;


+out:

if (ref_pid != -1)
kill(ref_pid, SIGKILL);


if (child_pid != -1)
kill(child_pid, SIGKILL);



exit(ret);

+}
diff --git a/tools/include/uapi/linux/seccomp.h b/tools/include/uapi/linux/seccomp.h
index dbfc9b37fcae..b0917e333b4b 100644
--- a/tools/include/uapi/linux/seccomp.h
+++ b/tools/include/uapi/linux/seccomp.h
@@ -16,6 +16,7 @@
 #define SECCOMP_SET_MODE_FILTER		1
 #define SECCOMP_GET_ACTION_AVAIL	2
 #define SECCOMP_GET_NOTIF_SIZES		3
+#define SECCOMP_CLONE_FILTER		4
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 61acbd45ffaa..df5e0f615da0 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -177,6 +177,10 @@ struct seccomp_data {
 #define SECCOMP_GET_NOTIF_SIZES 3
 #endif
 
+#ifndef SECCOMP_CLONE_FILTER
+#define SECCOMP_CLONE_FILTER 4
+#endif



#ifndef SECCOMP_FILTER_FLAG_TSYNC
 #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
 #endif
@@ -5090,6 +5094,73 @@ TEST_F(URETPROBE, uretprobe_default_block_with_uretprobe_syscall)
   ASSERT_EQ(0, run_probed_with_filter(&prog));
 }
 
+TEST(clone_filter)
Yay tests!
...
+{

struct sock_filter deny_filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS,


	offsetof(struct seccomp_data, nr)),


BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getppid, 0, 1),


BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ESRCH),


BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),


};
struct sock_fprog deny_prog = {
.len = (unsigned short)ARRAY_SIZE(deny_filter),


.filter = deny_filter,


};
struct timespec ts = {
.tv_sec = 0,


.tv_nsec = 100000000,


};

pid_t child_pid, self_pid, res;
int child_pidfd, ret;

/* Only real root can copy a filter. */
if (geteuid()) {

That's not true: uid==0 is not CAP_SYS_ADMIN. :) Look in this test for:
cap_get_flag(cap, CAP_SYS_ADMIN, CAP_EFFECTIVE, &is_cap_sys_admin);
which is how to test this correctly.
...

SKIP(return, "clone_filter requires real root");


return;


}

self_pid = getpid();

child_pid = fork();
ASSERT_LE(0, child_pid);

Uh, isn't that supposed to be GE? This should have failed your test
immediately every time. Does this test pass for you? :P
...


if (child_pid == 0) {
ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));


ASSERT_EQ(0, prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &deny_prog));



I'd assert that get_ppid gives ESRCH here just for completeness.
...


while (true)


	EXPECT_EQ(0, syscall(__NR_nanosleep, &ts, NULL));


}

/* wait for the child pid to create its seccomp filter */
ASSERT_EQ(0, syscall(__NR_nanosleep, &ts, NULL));

Please look at the other tests to see how to synchronize without a
non-deterministic sleep "guess". :) It's bad form to induce needless
delays in tests.
...


child_pidfd = syscall(SYS_pidfd_open, child_pid, 0);
EXPECT_LE(0, child_pidfd);

/* Invalid flag provided */
ret = seccomp(SECCOMP_CLONE_FILTER, 1, &child_pidfd);

I'd set all the bits, not just 1.
This is also missing a EFAULT test (i.e. a NULL child_pidfd). (Though I
still want to know if this is idiomatic for pidfd interfaces -- I'd not
expect this to be passed as a pointer.)
And it's missing a ESRCH test.
And a "this is not a pidfd" test.
...

EXPECT_EQ(-1, ret);
EXPECT_EQ(errno, EINVAL);

errno = 0;
ret = seccomp(SECCOMP_CLONE_FILTER, 0, &child_pidfd);
EXPECT_EQ(0, ret);
EXPECT_EQ(errno, 0);

res = syscall(__NR_getppid);
EXPECT_EQ(res, -1);
EXPECT_EQ(errno, ESRCH);

I would validate that getppid succeeds before the CLONE_FILTER, just for
robustness.
...


res = syscall(__NR_getpid);
EXPECT_EQ(res, self_pid);

close(child_pidfd);
kill(child_pid, SIGKILL);

I'm not a fan of using "kill" for child sync in tests. I'd rather see a
blocking read or similar (so the child doesn't have to spin, even if
it's in sleep). But at the very least I'd want to see a
waitpid for this kill.
...
+}
Please also check for the "I already have a filter attached"
failure path.
...



/*

TODO:

expand NNP testing



-- 
2.47.3
Thanks for working on this! It'd be a nice speed-up for sure.
-Kees
-- 
Kees Cook


    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] seccomp: Add SECCOMP_CLONE_FILTER operation

Signed-off-by: Tom Hromatka tom.hromatka@oracle.com