If WAIT_KILLABLE_RECV was specified, and an event is received, the tracee's syscall is not supposed to be interruptible. This was not properly ensured if the reply was sent too fast, and an interrupting signal was received before the reply was processed on the tracee side.
This series fixes the bug and adds a test case for it to the selftests.
Signed-off-by: Johannes Nixdorf johannes@nixdorf.dev --- Changes in v2: - Added a selftest for the bug. - Link to v1: https://lore.kernel.org/r/20250723-seccomp-races-v1-1-bef5667ce30a@nixdorf.d...
--- Johannes Nixdorf (2): seccomp: Fix a race with WAIT_KILLABLE_RECV if the tracer replies too fast selftests/seccomp: Add a test for the WAIT_KILLABLE_RECV fast reply race
kernel/seccomp.c | 13 ++- tools/testing/selftests/seccomp/seccomp_bpf.c | 130 ++++++++++++++++++++++++++ 2 files changed, 136 insertions(+), 7 deletions(-) --- base-commit: 89be9a83ccf1f88522317ce02f854f30d6115c41 change-id: 20250721-seccomp-races-e97897d6d94b
Best regards,
Normally the tracee starts in SECCOMP_NOTIFY_INIT, sends an event to the tracer, and starts to wait interruptibly. With SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV, if the tracer receives the message (SECCOMP_NOTIFY_SENT is reached) while the tracee was waiting and is subsequently interrupted, the tracee begins to wait again uninterruptibly (but killable).
This fails if SECCOMP_NOTIFY_REPLIED is reached before the tracee is interrupted, as the check only considered SECCOMP_NOTIFY_SENT as a condition to begin waiting again. In this case the tracee is interrupted even though the tracer already acted on its behalf. This breaks the assumption SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV wanted to ensure, namely that the tracer can be sure the syscall is not interrupted or restarted on the tracee after it is received on the tracer. Fix this by also considering SECCOMP_NOTIFY_REPLIED when evaluating whether to switch to uninterruptible waiting.
With the condition changed the loop in seccomp_do_user_notification() would exit immediately after deciding that noninterruptible waiting is required if the operation already reached SECCOMP_NOTIFY_REPLIED, skipping the code that processes pending addfd commands first. Prevent this by executing the remaining loop body one last time in this case.
Fixes: c2aa2dfef243 ("seccomp: Add wait_killable semantic to seccomp user notifier") Reported-by: Ali Polatel alip@chesswob.org Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220291 Signed-off-by: Johannes Nixdorf johannes@nixdorf.dev --- kernel/seccomp.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 41aa761c7738cefe01ca755f78f12844d7186e2a..fa44bcb6aa47df88bdc5951217d99779bd56ab70 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1139,7 +1139,7 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn static bool should_sleep_killable(struct seccomp_filter *match, struct seccomp_knotif *n) { - return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT; + return match->wait_killable_recv && n->state >= SECCOMP_NOTIFY_SENT; }
static int seccomp_do_user_notification(int this_syscall, @@ -1186,13 +1186,12 @@ static int seccomp_do_user_notification(int this_syscall,
if (err != 0) { /* - * Check to see if the notifcation got picked up and - * whether we should switch to wait killable. + * Check to see whether we should switch to wait + * killable. Only return the interrupted error if not. */ - if (!wait_killable && should_sleep_killable(match, &n)) - continue; - - goto interrupted; + if (!(!wait_killable && should_sleep_killable(match, + &n))) + goto interrupted; }
addfd = list_first_entry_or_null(&n.addfd,
If WAIT_KILLABLE_RECV was specified, and an event is received, the tracee's syscall is not supposed to be interruptible. This was not properly ensured if the reply was sent too fast, and an interrupting signal was received before the reply was processed on the tracee side.
Add a test for this, that consists of:
- a tracee with a timer that keeps sending it signals while repeatedly running a traced syscall in a loop, - a tracer that repeatedly handles all syscalls from the tracee in a loop, and - a shared pipe between both, on which the tracee sends one byte per syscall attempted and the tracer reads one byte per syscall handled.
If the syscall for the tracee is restarted after the tracer received the event for it due to this bug, the tracee will not have sent a second token on the pipe, which the tracer will notice and fail the test.
The tests also uses SECCOMP_IOCTL_NOTIF_ADDFD with SECCOMP_ADDFD_FLAG_SEND for the reply, as the fix for the bug has an additional code path change for handling addfd, which would not be exercised by a simple SECCOMP_IOCTL_NOTIF_SEND, and it is possible to fix the bug while leaving the same race intact for the addfd case.
This test is not guaranteed to reproduce the bug on every run, but the parameters (signal frequency and number of repeated syscalls) have been chosen so that on my machine this test:
- takes ~0.8s in the good case (+1s in the failure case), and - detects the bug in 999 of 1000 runs.
Signed-off-by: Johannes Nixdorf johannes@nixdorf.dev --- tools/testing/selftests/seccomp/seccomp_bpf.c | 130 ++++++++++++++++++++++++++ 1 file changed, 130 insertions(+)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 61acbd45ffaaf87b180c8dff2324a02282356fcd..b24d0cbe88b4499a7635c6a075bfc6a660409792 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -3547,6 +3547,10 @@ static void signal_handler(int signal) perror("write from signal"); }
+static void signal_handler_nop(int signal) +{ +} + TEST(user_notification_signal) { pid_t pid; @@ -4819,6 +4823,132 @@ TEST(user_notification_wait_killable_fatal) EXPECT_EQ(SIGTERM, WTERMSIG(status)); }
+/* Ensure signals after the reply do not interrupt */ +TEST(user_notification_wait_killable_after_reply) +{ + int i, max_iter = 100000; + int listener, status; + int pipe_fds[2]; + pid_t pid; + long ret; + + ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); + ASSERT_EQ(0, ret) + { + TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!"); + } + + listener = user_notif_syscall( + __NR_dup, SECCOMP_FILTER_FLAG_NEW_LISTENER | + SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV); + ASSERT_GE(listener, 0); + + /* + * Used to count invocations. One token is transferred from the child + * to the parent per syscall invocation, the parent tries to take + * one token per successful RECV. If the syscall is restarted after + * RECV the parent will try to get two tokens while the child only + * provided one. + */ + ASSERT_EQ(pipe(pipe_fds), 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + struct sigaction new_action = { + .sa_handler = signal_handler_nop, + .sa_flags = SA_RESTART, + }; + struct itimerval timer = { + .it_value = { .tv_usec = 1000 }, + .it_interval = { .tv_usec = 1000 }, + }; + char c = 'a'; + + close(pipe_fds[0]); + + /* Setup the sigaction with SA_RESTART */ + if (sigaction(SIGALRM, &new_action, NULL)) { + perror("sigaction"); + exit(1); + } + + /* + * Kill with SIGALRM repeatedly, to try to hit the race when + * handling the syscall. + */ + if (setitimer(ITIMER_REAL, &timer, NULL) < 0) + perror("setitimer"); + + for (i = 0; i < max_iter; ++i) { + int fd; + + /* Send one token per iteration to catch repeats. */ + if (write(pipe_fds[1], &c, sizeof(c)) != 1) { + perror("write"); + exit(1); + } + + fd = syscall(__NR_dup, 0); + if (fd < 0) { + perror("dup"); + exit(1); + } + close(fd); + } + + exit(0); + } + + close(pipe_fds[1]); + + for (i = 0; i < max_iter; ++i) { + struct seccomp_notif req = {}; + struct seccomp_notif_addfd addfd = {}; + struct pollfd pfd = { + .fd = pipe_fds[0], + .events = POLLIN, + }; + char c; + + /* + * Try to receive one token. If it failed, one child syscall + * was restarted after RECV and needed to be handled twice. + */ + ASSERT_EQ(poll(&pfd, 1, 1000), 1) + kill(pid, SIGKILL); + + ASSERT_EQ(read(pipe_fds[0], &c, sizeof(c)), 1) + kill(pid, SIGKILL); + + /* + * Get the notification, reply to it as fast as possible to test + * whether the child wrongly skips going into the non-preemptible + * (TASK_KILLABLE) state. + */ + do + ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req); + while (ret < 0 && errno == ENOENT); /* Accept interruptions before RECV */ + ASSERT_EQ(ret, 0) + kill(pid, SIGKILL); + + addfd.id = req.id; + addfd.flags = SECCOMP_ADDFD_FLAG_SEND; + addfd.srcfd = 0; + ASSERT_GE(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), 0) + kill(pid, SIGKILL); + } + + /* + * Wait for the process to exit, and make sure the process terminated + * with a zero exit code.. + */ + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + struct tsync_vs_thread_leader_args { pthread_t leader; };
linux-kselftest-mirror@lists.linaro.org