Currently the vDSO selftests use the time-related types from libc.
This works on glibc by chance today but will break with other libc
implementations or on distributions which switch to 64-bit times
everywhere.
The kernel's UAPI headers provide the proper types to use with the vDSO
(and raw syscalls) but are not necessarily compatible with libc types.
Introduce a new header which makes the UAPI headers compatible with the
libc.
Also contains some related cleanups.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Use __kernel_old_time_t in vdso_time_t.
- Add vdso_syscalls.h.
- Add a test for the time() function.
- Validate return value of syscall(clock_getres) in vdso_test_abi
- Link to v1: https://lore.kernel.org/r/20251111-vdso-test-types-v1-0-03b31f88c659@linutr…
---
Thomas Weißschuh (14):
Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers"
selftests: vDSO: Introduce vdso_types.h
selftests: vDSO: Introduce vdso_syscalls.h
selftests: vDSO: vdso_test_gettimeofday: Remove nolibc checks
selftests: vDSO: vdso_test_gettimeofday: Use types from vdso_types.h
selftests: vDSO: vdso_test_abi: Use types from vdso_types.h
selftests: vDSO: vdso_test_abi: Validate return value of syscall(clock_getres)
selftests: vDSO: vdso_test_abi: Use system call wrappers from vdso_syscalls.h
selftests: vDSO: vdso_test_correctness: Drop SYS_getcpu fallbacks
selftests: vDSO: vdso_test_correctness: Make ts_leq() and tv_leq() more generic
selftests: vDSO: vdso_test_correctness: Use types from vdso_types.h
selftests: vDSO: vdso_test_correctness: Use system call wrappers from vdso_syscalls.h
selftests: vDSO: vdso_test_correctness: Use facilities from parse_vdso.c
selftests: vDSO: vdso_test_correctness: Add a test for time()
tools/testing/selftests/vDSO/Makefile | 6 +-
tools/testing/selftests/vDSO/parse_vdso.c | 3 +-
tools/testing/selftests/vDSO/vdso_syscalls.h | 93 ++++++++++
tools/testing/selftests/vDSO/vdso_test_abi.c | 46 +++--
.../testing/selftests/vDSO/vdso_test_correctness.c | 190 +++++++++++----------
.../selftests/vDSO/vdso_test_gettimeofday.c | 9 +-
tools/testing/selftests/vDSO/vdso_types.h | 70 ++++++++
7 files changed, 285 insertions(+), 132 deletions(-)
---
base-commit: 1b2eb8c1324859864f4aa79dc3cfbb2f7ef5c524
change-id: 20251110-vdso-test-types-68ce0c712b79
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
This introduces signal->exec_bprm, which is used to
fix the case when at least one of the sibling threads
is traced, and therefore the trace process may dead-lock
in ptrace_attach, but de_thread will need to wait for the
tracer to continue execution.
The solution is to detect this situation and allow
ptrace_attach to continue by temporarily releasing the
cred_guard_mutex, while de_thread() is still waiting for
traced zombies to be eventually released by the tracer.
In the case of the thread group leader we only have to wait
for the thread to become a zombie, which may also need
co-operation from the tracer due to PTRACE_O_TRACEEXIT.
When a tracer wants to ptrace_attach a task that already
is in execve, we simply retry the ptrace_may_access
check while temporarily installing the new credentials
and dumpability which are about to be used after execve
completes. If the ptrace_attach happens on a thread that
is a sibling-thread of the thread doing execve, it is
sufficient to check against the old credentials, as this
thread will be waited for, before the new credentials are
installed.
Other threads die quickly since the cred_guard_mutex is
released, but a deadly signal is already pending. In case
the mutex_lock_killable misses the signal, the non-zero
current->signal->exec_bprm makes sure they release the
mutex immediately and return with -ERESTARTNOINTR.
This means there is no API change, unlike the previous
version of this patch which was discussed here:
https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.d…
See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.
Note that since the test case was originally designed to
test the ptrace_attach returning an error in this situation,
the test expectation needed to be adjusted, to allow the
API to succeed at the first attempt.
Signed-off-by: Bernd Edlinger <bernd.edlinger(a)hotmail.de>
---
fs/exec.c | 69 ++++++++++++++++-------
fs/proc/base.c | 6 ++
include/linux/cred.h | 1 +
include/linux/sched/signal.h | 18 ++++++
kernel/cred.c | 28 +++++++--
kernel/ptrace.c | 32 +++++++++++
kernel/seccomp.c | 12 +++-
tools/testing/selftests/ptrace/vmaccess.c | 23 +++++---
8 files changed, 155 insertions(+), 34 deletions(-)
v10: Changes to previous version, make the PTRACE_ATTACH
retun -EAGAIN, instead of execve return -ERESTARTSYS.
Added some lessions learned to the description.
v11: Check old and new credentials in PTRACE_ATTACH again without
changing the API.
Note: I got actually one response from an automatic checker to the v11 patch,
https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/
which is complaining about:
>> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@
417 struct linux_binprm *bprm = task->signal->exec_bprm;
418 const struct cred *old_cred;
419 struct mm_struct *old_mm;
420
421 retval = down_write_killable(&task->signal->exec_update_lock);
422 if (retval)
423 goto unlock_creds;
424 task_lock(task);
> 425 old_cred = task->real_cred;
v12: Essentially identical to v11.
- Fixed a minor merge conflict in linux v5.17, and fixed the
above mentioned nit by adding __rcu to the declaration.
- re-tested the patch with all linux versions from v5.11 to v6.6
v10 was an alternative approach which did imply an API change.
But I would prefer to avoid such an API change.
The difficult part is getting the right dumpability flags assigned
before de_thread starts, hope you like this version.
If not, the v10 is of course also acceptable.
Thanks
Bernd.
diff --git a/fs/exec.c b/fs/exec.c
index 2f2b0acec4f0..902d3b230485 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm)
return 0;
}
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
{
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t = tsk;
+ bool unsafe_execve_in_progress = false;
if (thread_group_empty(tsk))
goto no_thread_group;
@@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk)
if (!thread_group_leader(tsk))
sig->notify_count--;
+ while_each_thread(tsk, t) {
+ if (unlikely(t->ptrace)
+ && (t != tsk->group_leader || !t->exit_state))
+ unsafe_execve_in_progress = true;
+ }
+
+ if (unlikely(unsafe_execve_in_progress)) {
+ spin_unlock_irq(lock);
+ sig->exec_bprm = bprm;
+ mutex_unlock(&sig->cred_guard_mutex);
+ spin_lock_irq(lock);
+ }
+
while (sig->notify_count) {
__set_current_state(TASK_KILLABLE);
spin_unlock_irq(lock);
@@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk)
release_task(leader);
}
+ if (unlikely(unsafe_execve_in_progress)) {
+ mutex_lock(&sig->cred_guard_mutex);
+ sig->exec_bprm = NULL;
+ }
+
sig->group_exec_task = NULL;
sig->notify_count = 0;
@@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk)
return 0;
killed:
+ if (unlikely(unsafe_execve_in_progress)) {
+ mutex_lock(&sig->cred_guard_mutex);
+ sig->exec_bprm = NULL;
+ }
+
/* protects against exit_notify() and __exit_signal() */
read_lock(&tasklist_lock);
sig->group_exec_task = NULL;
@@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
+ /* If the binary is not readable then enforce mm->dumpable=0 */
+ would_dump(bprm, bprm->file);
+ if (bprm->have_execfd)
+ would_dump(bprm, bprm->executable);
+
+ /*
+ * Figure out dumpability. Note that this checking only of current
+ * is wrong, but userspace depends on it. This should be testing
+ * bprm->secureexec instead.
+ */
+ if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
+ is_dumpability_changed(current_cred(), bprm->cred) ||
+ !(uid_eq(current_euid(), current_uid()) &&
+ gid_eq(current_egid(), current_gid())))
+ set_dumpable(bprm->mm, suid_dumpable);
+ else
+ set_dumpable(bprm->mm, SUID_DUMP_USER);
+
/*
* Ensure all future errors are fatal.
*/
@@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm)
/*
* Make this the only thread in the thread group.
*/
- retval = de_thread(me);
+ retval = de_thread(me, bprm);
if (retval)
goto out;
@@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
goto out;
- /* If the binary is not readable then enforce mm->dumpable=0 */
- would_dump(bprm, bprm->file);
- if (bprm->have_execfd)
- would_dump(bprm, bprm->executable);
-
/*
* Release all of the old mmap stuff
*/
@@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm)
me->sas_ss_sp = me->sas_ss_size = 0;
- /*
- * Figure out dumpability. Note that this checking only of current
- * is wrong, but userspace depends on it. This should be testing
- * bprm->secureexec instead.
- */
- if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
- !(uid_eq(current_euid(), current_uid()) &&
- gid_eq(current_egid(), current_gid())))
- set_dumpable(current->mm, suid_dumpable);
- else
- set_dumpable(current->mm, SUID_DUMP_USER);
-
perf_event_exec();
__set_task_comm(me, kbasename(bprm->filename), true);
@@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ return -ERESTARTNOINTR;
+ }
+
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ffd54617c354..0da9adfadb48 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
if (rv < 0)
goto out_free;
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ rv = -ERESTARTNOINTR;
+ goto out_free;
+ }
+
rv = security_setprocattr(PROC_I(inode)->op.lsm,
file->f_path.dentry->d_name.name, page,
count);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index f923528d5cc4..b01e309f5686 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
extern struct cred *cred_alloc_blank(void);
extern struct cred *prepare_creds(void);
extern struct cred *prepare_exec_creds(void);
+extern bool is_dumpability_changed(const struct cred *, const struct cred *);
extern int commit_creds(struct cred *);
extern void abort_creds(struct cred *);
extern const struct cred *override_creds(const struct cred *);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0014d3adaf84..14df7073a0a8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -234,9 +234,27 @@ struct signal_struct {
struct mm_struct *oom_mm; /* recorded mm when the thread group got
* killed by the oom killer */
+ struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access
+ * against new credentials while
+ * de_thread is waiting for other
+ * traced threads to terminate.
+ * Set while de_thread is executing.
+ * The cred_guard_mutex is released
+ * after de_thread() has called
+ * zap_other_threads(), therefore
+ * a fatal signal is guaranteed to be
+ * already pending in the unlikely
+ * event, that
+ * current->signal->exec_bprm happens
+ * to be non-zero after the
+ * cred_guard_mutex was acquired.
+ */
+
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace)
+ * Held while execve runs, except when
+ * a sibling thread is being traced.
* Deprecated do not use in new code.
* Use exec_update_lock instead.
*/
diff --git a/kernel/cred.c b/kernel/cred.c
index 98cb4eca23fb..586cb6c7cf6b 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
return false;
}
+/**
+ * is_dumpability_changed - Will changing creds from old to new
+ * affect the dumpability in commit_creds?
+ *
+ * Return: false - dumpability will not be changed in commit_creds.
+ * Return: true - dumpability will be changed to non-dumpable.
+ *
+ * @old: The old credentials
+ * @new: The new credentials
+ */
+bool is_dumpability_changed(const struct cred *old, const struct cred *new)
+{
+ if (!uid_eq(old->euid, new->euid) ||
+ !gid_eq(old->egid, new->egid) ||
+ !uid_eq(old->fsuid, new->fsuid) ||
+ !gid_eq(old->fsgid, new->fsgid) ||
+ !cred_cap_issubset(old, new))
+ return true;
+
+ return false;
+}
+
/**
* commit_creds - Install new credentials upon the current task
* @new: The credentials to be assigned
@@ -467,11 +489,7 @@ int commit_creds(struct cred *new)
get_cred(new); /* we will require a ref for the subj creds too */
/* dumpability changes */
- if (!uid_eq(old->euid, new->euid) ||
- !gid_eq(old->egid, new->egid) ||
- !uid_eq(old->fsuid, new->fsuid) ||
- !gid_eq(old->fsgid, new->fsgid) ||
- !cred_cap_issubset(old, new)) {
+ if (is_dumpability_changed(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 443057bee87c..eb1c450bb7d7 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -20,6 +20,7 @@
#include <linux/pagemap.h>
#include <linux/ptrace.h>
#include <linux/security.h>
+#include <linux/binfmts.h>
#include <linux/signal.h>
#include <linux/uio.h>
#include <linux/audit.h>
@@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request,
if (retval)
goto unlock_creds;
+ if (unlikely(task->in_execve)) {
+ struct linux_binprm *bprm = task->signal->exec_bprm;
+ const struct cred __rcu *old_cred;
+ struct mm_struct *old_mm;
+
+ retval = down_write_killable(&task->signal->exec_update_lock);
+ if (retval)
+ goto unlock_creds;
+ task_lock(task);
+ old_cred = task->real_cred;
+ old_mm = task->mm;
+ rcu_assign_pointer(task->real_cred, bprm->cred);
+ task->mm = bprm->mm;
+ retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+ rcu_assign_pointer(task->real_cred, old_cred);
+ task->mm = old_mm;
+ task_unlock(task);
+ up_write(&task->signal->exec_update_lock);
+ if (retval)
+ goto unlock_creds;
+ }
+
write_lock_irq(&tasklist_lock);
retval = -EPERM;
if (unlikely(task->exit_state))
@@ -508,6 +531,14 @@ static int ptrace_traceme(void)
{
int ret = -EPERM;
+ if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
+ return -ERESTARTNOINTR;
+
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ return -ERESTARTNOINTR;
+ }
+
write_lock_irq(&tasklist_lock);
/* Are we already being traced? */
if (!current->ptrace) {
@@ -523,6 +554,7 @@ static int ptrace_traceme(void)
}
}
write_unlock_irq(&tasklist_lock);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
return ret;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 255999ba9190..b29bbfa0b044 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
* Make sure we cannot change seccomp or nnp state via TSYNC
* while another thread is in the middle of calling exec.
*/
- if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
- mutex_lock_killable(¤t->signal->cred_guard_mutex))
- goto out_put_fd;
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+ if (mutex_lock_killable(¤t->signal->cred_guard_mutex))
+ goto out_put_fd;
+
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ goto out_put_fd;
+ }
+ }
spin_lock_irq(¤t->sighand->siglock);
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..3b7d81fb99bb 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -39,8 +39,15 @@ TEST(vmaccess)
f = open(mm, O_RDONLY);
ASSERT_GE(f, 0);
close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(f, 0);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_NE(f, -1);
+ ASSERT_NE(f, 0);
+ ASSERT_NE(f, pid);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_EQ(f, pid);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_EQ(f, -1);
+ ASSERT_EQ(errno, ECHILD);
}
TEST(attach)
@@ -57,22 +64,24 @@ TEST(attach)
sleep(1);
k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
- ASSERT_EQ(errno, EAGAIN);
- ASSERT_EQ(k, -1);
+ ASSERT_EQ(k, 0);
k = waitpid(-1, &s, WNOHANG);
ASSERT_NE(k, -1);
ASSERT_NE(k, 0);
ASSERT_NE(k, pid);
ASSERT_EQ(WIFEXITED(s), 1);
ASSERT_EQ(WEXITSTATUS(s), 0);
- sleep(1);
- k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
- k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
--
2.39.2
syzbot ci has tested the following series
[v4] ipvlan: support mac-nat mode
https://lore.kernel.org/all/20251118100046.2944392-1-skorodumov.dmitry@huaw…
* [PATCH net-next 01/13] ipvlan: Support MACNAT mode
* [PATCH net-next 02/13] ipvlan: macnat: Handle rx mcast-ip and unicast eth
* [PATCH net-next 03/13] ipvlan: Forget all IP when device goes down
* [PATCH net-next 04/13] ipvlan: Support IPv6 in macnat mode.
* [PATCH net-next 05/13] ipvlan: Fix compilation warning about __be32 -> u32
* [PATCH net-next 06/13] ipvlan: Make the addrs_lock be per port
* [PATCH net-next 07/13] ipvlan: Take addr_lock in ipvlan_open()
* [PATCH net-next 08/13] ipvlan: Don't allow children to use IPs of main
* [PATCH net-next 09/13] ipvlan: const-specifier for functions that use iaddr
* [PATCH net-next 10/13] ipvlan: Common code from v6/v4 validator_event
* [PATCH net-next 11/13] ipvlan: common code to handle ipv6/ipv4 address events
* [PATCH net-next 12/13] ipvlan: Ignore PACKET_LOOPBACK in handle_mode_l2()
* [PATCH net-next 13/13] selftests: drv-net: selftest for ipvlan-macnat mode
and found the following issue:
WARNING: suspicious RCU usage in ipvlan_addr_event
Full report is available here:
https://ci.syzbot.org/series/e483b93a-1063-4c8a-b0e2-89530e79768b
***
WARNING: suspicious RCU usage in ipvlan_addr_event
tree: net-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base: c99ebb6132595b4b288a413981197eb076547c5a
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/ac5af6f3-6b14-4e35-9d81-ee1522de3952/config
8021q: adding VLAN 0 to HW filter on device batadv0
=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
drivers/net/ipvlan/ipvlan.h:128 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by syz-executor/5984:
#0: ffffffff8f2cc248 (rtnl_mutex){+.+.}-{4:4}, at: inet_rtm_newaddr+0x3b0/0x18b0
#1: ffffffff8f39d9b0 ((inetaddr_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x54/0x90
stack backtrace:
CPU: 1 UID: 0 PID: 5984 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250
lockdep_rcu_suspicious+0x140/0x1d0
ipvlan_addr_event+0x60b/0x950
notifier_call_chain+0x1b6/0x3e0
blocking_notifier_call_chain+0x6a/0x90
__inet_insert_ifa+0xa13/0xbf0
inet_rtm_newaddr+0xf3a/0x18b0
rtnetlink_rcv_msg+0x7cf/0xb70
netlink_rcv_skb+0x208/0x470
netlink_unicast+0x82f/0x9e0
netlink_sendmsg+0x805/0xb30
__sock_sendmsg+0x21c/0x270
__sys_sendto+0x3bd/0x520
__x64_sys_sendto+0xde/0x100
do_syscall_64+0xfa/0xfa0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f711f191503
Code: 64 89 02 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d 61 70 22 00 00 41 89 ca 74 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 55 48 83 ec 30 44 89 4c 24
RSP: 002b:00007ffc44b05f28 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f711ff14620 RCX: 00007f711f191503
RDX: 0000000000000028 RSI: 00007f711ff14670 RDI: 0000000000000003
RBP: 0000000000000001 R08: 00007ffc44b05f44 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000000000000 R14: 00007f711ff14670 R15: 0000000000000000
</TASK>
syz-executor (5984) used greatest stack depth: 19864 bytes left
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot(a)syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller(a)googlegroups.com.
Main objective of this series is to convert the gro.sh and toeplitz.sh
tests to be "NIPA-compatible" - meaning make use of the Python env,
which lets us run the tests against either netdevsim or a real device.
The tests seem to have been written with a different flow in mind.
Namely they source different bash "setup" scripts depending on arguments
passed to the test. While I have nothing against the use of bash and
the overall architecture - the existing code needs quite a bit of work
(don't assume MAC/IP addresses, support remote endpoint over SSH).
If I'm the one fixing it, I'd rather convert them to our "simplistic"
Python.
This series rewrites the tests in Python while addressing their
shortcomings. The functionality of running the test over loopback
on a real device is retained but with a different method of invocation
(see the last patch).
Once again we are dealing with a script which run over a variety of
protocols (combination of [ipv4, ipv6, ipip] x [tcp, udp]). The first
4 patches add support for test variants to our scripts. We use the
term "variant" in the same sense as the C kselftest_harness.h -
variant is just a set of static input arguments.
Note that neither GRO nor the Toeplitz test fully passes for me on
any HW I have access to. But this is unrelated to the conversion.
This series is not making any real functional changes to the tests,
it is limited to improving the "test harness" scripts.
Jakub Kicinski (12):
selftests: net: py: coding style improvements
selftests: net: py: extract the case generation logic
selftests: net: py: add test variants
selftests: drv-net: xdp: use variants for qstat tests
selftests: net: relocate gro and toeplitz tests to drivers/net
selftests: net: py: support ksft ready without wait
selftests: net: py: read ip link info about remote dev
netdevsim: pass packets thru GRO on Rx
selftests: drv-net: add a Python version of the GRO test
selftests: drv-net: hw: convert the Toeplitz test to Python
netdevsim: add loopback support
selftests: net: remove old setup_* scripts
tools/testing/selftests/drivers/net/Makefile | 2 +
.../testing/selftests/drivers/net/hw/Makefile | 6 +-
tools/testing/selftests/net/Makefile | 7 -
tools/testing/selftests/net/lib/Makefile | 1 +
drivers/net/netdevsim/netdev.c | 26 ++-
.../testing/selftests/{ => drivers}/net/gro.c | 5 +-
.../{net => drivers/net/hw}/toeplitz.c | 7 +-
.../testing/selftests/drivers/net/.gitignore | 1 +
tools/testing/selftests/drivers/net/gro.py | 161 ++++++++++++++
.../selftests/drivers/net/hw/.gitignore | 3 +-
.../drivers/net/hw/lib/py/__init__.py | 4 +-
.../selftests/drivers/net/hw/toeplitz.py | 208 ++++++++++++++++++
.../selftests/drivers/net/lib/py/__init__.py | 4 +-
.../selftests/drivers/net/lib/py/env.py | 2 +
tools/testing/selftests/drivers/net/xdp.py | 42 ++--
tools/testing/selftests/net/.gitignore | 2 -
tools/testing/selftests/net/gro.sh | 105 ---------
.../selftests/net/lib/ksft_setup_loopback.sh | 111 ++++++++++
.../testing/selftests/net/lib/py/__init__.py | 5 +-
tools/testing/selftests/net/lib/py/ksft.py | 93 ++++++--
tools/testing/selftests/net/lib/py/nsim.py | 2 +-
tools/testing/selftests/net/lib/py/utils.py | 20 +-
tools/testing/selftests/net/setup_loopback.sh | 120 ----------
tools/testing/selftests/net/setup_veth.sh | 45 ----
tools/testing/selftests/net/toeplitz.sh | 199 -----------------
.../testing/selftests/net/toeplitz_client.sh | 28 ---
26 files changed, 631 insertions(+), 578 deletions(-)
rename tools/testing/selftests/{ => drivers}/net/gro.c (99%)
rename tools/testing/selftests/{net => drivers/net/hw}/toeplitz.c (99%)
create mode 100755 tools/testing/selftests/drivers/net/gro.py
create mode 100755 tools/testing/selftests/drivers/net/hw/toeplitz.py
delete mode 100755 tools/testing/selftests/net/gro.sh
create mode 100755 tools/testing/selftests/net/lib/ksft_setup_loopback.sh
delete mode 100644 tools/testing/selftests/net/setup_loopback.sh
delete mode 100644 tools/testing/selftests/net/setup_veth.sh
delete mode 100755 tools/testing/selftests/net/toeplitz.sh
delete mode 100755 tools/testing/selftests/net/toeplitz_client.sh
--
2.51.1
Since commit 31158ad02ddb ("rqspinlock: Add deadlock detection
and recovery") the updated path on re-entrancy now reports deadlock
via -EDEADLK instead of the previous -EBUSY.
Also, the way reentrancy was exercised (via fentry/lookup_elem_raw)
has been fragile because lookup_elem_raw may be inlined
(find_kernel_btf_id() will return -ESRCH).
To fix this fentry is attached to bpf_obj_free_fields() instead of
lookup_elem_raw() and:
- The htab map is made to use a BTF-described struct val with a
struct bpf_timer so that check_and_free_fields() reliably calls
bpf_obj_free_fields() on element replacement.
- The selftest is updated to do two updates to the same key (insert +
replace) in prog_test.
- The selftest is updated to align with expected errno with the
kernel’s current behavior.
Signed-off-by: Saket Kumar Bhaskar <skb99(a)linux.ibm.com>
---
Changes since v2:
Addressed CI failures:
* Initialize key to 0 before the first update.
* Used pointer value to pass for update and memset rather than
&value.
v2: https://lore.kernel.org/all/20251114152653.356782-1-skb99@linux.ibm.com/
Changes since v1:
Addressed comments from Alexei:
* Fixed the scenario where test may fail when lookup_elem_raw()
is inlined.
v1: https://lore.kernel.org/all/20251106052628.349117-1-skb99@linux.ibm.com/
.../selftests/bpf/prog_tests/htab_update.c | 37 ++++++++++++++-----
.../testing/selftests/bpf/progs/htab_update.c | 19 +++++++---
2 files changed, 41 insertions(+), 15 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_update.c b/tools/testing/selftests/bpf/prog_tests/htab_update.c
index 2bc85f4814f4..d0b405eb2966 100644
--- a/tools/testing/selftests/bpf/prog_tests/htab_update.c
+++ b/tools/testing/selftests/bpf/prog_tests/htab_update.c
@@ -15,17 +15,17 @@ struct htab_update_ctx {
static void test_reenter_update(void)
{
struct htab_update *skel;
- unsigned int key, value;
+ void *value = NULL;
+ unsigned int key, value_size;
int err;
skel = htab_update__open();
if (!ASSERT_OK_PTR(skel, "htab_update__open"))
return;
- /* lookup_elem_raw() may be inlined and find_kernel_btf_id() will return -ESRCH */
- bpf_program__set_autoload(skel->progs.lookup_elem_raw, true);
+ bpf_program__set_autoload(skel->progs.bpf_obj_free_fields, true);
err = htab_update__load(skel);
- if (!ASSERT_TRUE(!err || err == -ESRCH, "htab_update__load") || err)
+ if (!ASSERT_TRUE(!err, "htab_update__load") || err)
goto out;
skel->bss->pid = getpid();
@@ -33,14 +33,33 @@ static void test_reenter_update(void)
if (!ASSERT_OK(err, "htab_update__attach"))
goto out;
- /* Will trigger the reentrancy of bpf_map_update_elem() */
+ value_size = bpf_map__value_size(skel->maps.htab);
+
+ value = calloc(1, value_size);
+ if (!ASSERT_OK_PTR(value, "calloc value"))
+ goto out;
+ /*
+ * First update: plain insert. This should NOT trigger the re-entrancy
+ * path, because there is no old element to free yet.
+ */
key = 0;
- value = 0;
- err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, &value, 0);
- if (!ASSERT_OK(err, "add element"))
+ err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, value, BPF_ANY);
+ if (!ASSERT_OK(err, "first update (insert)"))
+ goto out;
+
+ /*
+ * Second update: replace existing element with same key and trigger
+ * the reentrancy of bpf_map_update_elem().
+ * check_and_free_fields() calls bpf_obj_free_fields() on the old
+ * value, which is where fentry program runs and performs a nested
+ * bpf_map_update_elem(), triggering -EDEADLK.
+ */
+ memset(value, 0, value_size);
+ err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, value, BPF_ANY);
+ if (!ASSERT_OK(err, "second update (replace)"))
goto out;
- ASSERT_EQ(skel->bss->update_err, -EBUSY, "no reentrancy");
+ ASSERT_EQ(skel->bss->update_err, -EDEADLK, "no reentrancy");
out:
htab_update__destroy(skel);
}
diff --git a/tools/testing/selftests/bpf/progs/htab_update.c b/tools/testing/selftests/bpf/progs/htab_update.c
index 7481bb30b29b..195d3b2fba00 100644
--- a/tools/testing/selftests/bpf/progs/htab_update.c
+++ b/tools/testing/selftests/bpf/progs/htab_update.c
@@ -6,24 +6,31 @@
char _license[] SEC("license") = "GPL";
+/* Map value type: has BTF-managed field (bpf_timer) */
+struct val {
+ struct bpf_timer t;
+ __u64 payload;
+};
+
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1);
- __uint(key_size, sizeof(__u32));
- __uint(value_size, sizeof(__u32));
+ __type(key, __u32);
+ __type(value, struct val);
} htab SEC(".maps");
int pid = 0;
int update_err = 0;
-SEC("?fentry/lookup_elem_raw")
-int lookup_elem_raw(void *ctx)
+SEC("?fentry/bpf_obj_free_fields")
+int bpf_obj_free_fields(void *ctx)
{
- __u32 key = 0, value = 1;
+ __u32 key = 0;
+ struct val value = { .payload = 1 };
if ((bpf_get_current_pid_tgid() >> 32) != pid)
return 0;
- update_err = bpf_map_update_elem(&htab, &key, &value, 0);
+ update_err = bpf_map_update_elem(&htab, &key, &value, BPF_ANY);
return 0;
}
--
2.51.0
Here are a bunch of small improvements to the MPTCP selftests:
- Patch 1: move code to mptcp_lib.sh to prepare the new features.
- Patch 2: simplify mptcp_lib_pr_err_stats helper use.
- Patch 3: remove unused last column from nstat output.
- Patch 4: improve stats dump in mptcp_join.sh.
- Patch 5: get counters from nstat history and simplify mptcp_connect.sh.
- Patch 6: avoid taking the same packet trace twice.
- Patch 7: wait for an event instead of a fix time.
- Patch 8: instead of using 'timeout' and print the stats after, another
internal timeout is used: if it fires, it will print stats, then stop
everything. This avoids confusions around stats in case of timeout.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (8):
selftests: mptcp: lib: introduce 'nstat_{init,get}'
selftests: mptcp: lib: remove stats files args
selftests: mptcp: lib: stats: remove nstat rate columns
selftests: mptcp: join: dump stats from history
selftests: mptcp: lib: get counters from nstat history
selftests: mptcp: connect: avoid double packet traces
selftests: mptcp: wait for port instead of sleep
selftests: mptcp: get stats just before timing out
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 140 ++++++++++-----------
tools/testing/selftests/net/mptcp/mptcp_join.sh | 65 +++++-----
tools/testing/selftests/net/mptcp/mptcp_lib.sh | 58 +++++++--
tools/testing/selftests/net/mptcp/mptcp_sockopt.sh | 43 ++++---
tools/testing/selftests/net/mptcp/simult_flows.sh | 44 ++++---
tools/testing/selftests/net/mptcp/userspace_pm.sh | 3 +-
6 files changed, 203 insertions(+), 150 deletions(-)
---
base-commit: df58ee7d8faf353ebf5d4703c35fcf3e578e9b1b
change-id: 20251114-net-next-mptcp-sft-count-cache-stats-timeout-faa64482db92
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>