Hello,
binder_alloc_selftest provides a robust set of checks for the binder
allocator, but it rarely runs because it must hook into a running binder
process and block all other binder threads until it completes. The test
itself is a good candidate for conversion to KUnit, and it can be
further isolated from user processes by using a test-specific lru
freelist instead of the global one. This series converts the selftest
to KUnit to make it less burdensome to run and to set up a foundation
for unit testing future binder_alloc changes.
Thanks,
Tiffany
Tiffany Yang (6):
binder: Fix selftest page indexing
binder: Store lru freelist in binder_alloc
kunit: test: Export kunit_attach_mm()
binder: Scaffolding for binder_alloc KUnit tests
binder: Convert binder_alloc selftests to KUnit
binder: encapsulate individual alloc test cases
drivers/android/Kconfig | 15 +-
drivers/android/Makefile | 2 +-
drivers/android/binder.c | 10 +-
drivers/android/binder_alloc.c | 39 +-
drivers/android/binder_alloc.h | 14 +-
drivers/android/binder_alloc_selftest.c | 306 -----------
drivers/android/binder_internal.h | 4 +
drivers/android/tests/.kunitconfig | 3 +
drivers/android/tests/Makefile | 3 +
drivers/android/tests/binder_alloc_kunit.c | 573 +++++++++++++++++++++
include/kunit/test.h | 12 +
lib/kunit/user_alloc.c | 4 +-
12 files changed, 645 insertions(+), 340 deletions(-)
delete mode 100644 drivers/android/binder_alloc_selftest.c
create mode 100644 drivers/android/tests/.kunitconfig
create mode 100644 drivers/android/tests/Makefile
create mode 100644 drivers/android/tests/binder_alloc_kunit.c
--
2.50.0.727.gbf7dc18ff4-goog
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space.
Previous version of this patchset [1] tried to perform /proc/pid/maps
reading under RCU, however its implementation is quite complex and the
results are worse than the new version because it still relied on
mmap_lock speculation which retries if any part of the address space gets
modified. New implementaion is both simpler and results in less
contention. Note that similar approach would not work for /proc/pid/smaps
reading as it also walks the page table and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears.
The latest results from Paul:
Stock mm-unstable, all of the runs had maximum latencies in excess of
0.5 milliseconds, and with 80% of the runs' latencies exceeding a full
millisecond, and ranging up beyond 4 full milliseconds. In contrast,
99% of the runs with this patch series applied had maximum latencies
of less than 0.5 milliseconds, with the single outlier at only 0.608
milliseconds.
From a median-performance (as opposed to maximum-latency) viewpoint,
this patch series also looks good, with stock mm weighing in at 11
microseconds and patch series at 6 microseconds, better than a 2x
improvement.
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.521
0.011 0.008 0.552
0.011 0.008 0.590
0.011 0.008 0.660
...
0.011 0.015 2.987
0.011 0.015 3.038
0.011 0.016 3.431
0.011 0.016 4.707
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.026
0.006 0.005 0.029
0.006 0.005 0.034
0.006 0.005 0.035
...
0.006 0.006 0.421
0.006 0.006 0.423
0.006 0.006 0.439
0.006 0.006 0.608
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.
In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:
Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;
Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;
Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;
If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:
Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;
In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.
Changes since v6 [4]:
- Updated patch 7/8 changelog, per Lorenzo Stoakes
- Added comments, per Lorenzo Stoakes
- Added Reviewed-by, per Lorenzo Stoakes and Liam Howlett
- Replaced iter with vmi, per Lorenzo Stoakes
- Renamed from lock_vma_under_mmap_lock() to
lock_next_vma_under_mmap_lock(), per Lorenzo Stoakes
- Renamed lock_next_vma() parameter from addr to from_addr
- Renamed labels in lock_next_vma() to reflect fallback cases,
per Lorenzo Stoakes
- Handle vma_start_read_locked() failure inside
lock_next_vma_under_mmap_lock() and added fallback_to_mmap_lock()
for that, per Vlastimil Babka
- Added missing vma_iter_init() after re-entering rcu read section inside
lock_next_vma(), per Vlastimil Babka
- Replaced vma_iter_init() with vma_iter_set(), per Liam Howlett
- Removed the last patch converting PROCMAP_QUERY to use per-vma locks.
That patch will be posted separately,
per David Hildenbrand, Vlastimil Babka and Liam Howlett
- Updated performance numbers, per Paul E. McKenney
!!! NOTES FOR APPLYING THE PATCHSET !!!
Applies cleanly over mm-unstable after reverting v6 version of this
patchset (from 2771a4b86aa1 to a20b00f7cf33 in mm-unstable).
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo…
[4] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/
Suren Baghdasaryan (7):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
fs/proc/task_mmu: remove conversion of seq_file position to unsigned
fs/proc/task_mmu: read proc/pid/maps under per-vma lock
fs/proc/internal.h | 5 +
fs/proc/task_mmu.c | 155 +++-
include/linux/mmap_lock.h | 11 +
mm/madvise.c | 3 +-
mm/mmap_lock.c | 93 ++
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++
8 files changed, 1082 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/proc/proc-maps-race.c
--
2.50.0.727.gbf7dc18ff4-goog
Hello,
this series follows up on the one introducing 9+ args for tracing
programs [1]. It has been observed with this series that there are cases
for which we can not identify accurately the location of the target
function arguments to prepare correctly the corresponding BPF
trampoline. This is the case for example if:
- the function consumes a struct variable _by value_
- it is passed on the stack (no more register available for it)
- it has some __packed__ or __aligned(X)__ attribute
As a consequence, a small restrictive check has been added to the ARM64
side, highlighting that other arch supporting 9+ args in BPF trampolines
are already suffering from the same issue. After a bit of discussions
and attempts, the chosen solution is, rather than applying the same
constraint to all JIT compilers, to prevent such function from being
encoded at all in BTF info([2]). As the pahole side is closed to be
integrated, we can now remove the restrictive check from kernel side.
[1] https://lore.kernel.org/bpf/20250527-many_args_arm64-v3-0-3faf7bb8e4a2@boot…
[2] https://lore.kernel.org/bpf/20250707-btf_skip_structs_on_stack-v3-0-29569e0…
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com>
---
Alexis Lothoré (eBPF Foundation) (2):
bpf, arm64: remove structs on stack constraint
selftests/bpf: enable tracing_struct tests for arm64
arch/arm64/net/bpf_jit_comp.c | 5 -----
tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 -
2 files changed, 6 deletions(-)
---
base-commit: 8da1e37fc84868b50ba6a7cdf082aa3b0d11e006
change-id: 20250708-arm64_relax_jit_comp-e8889647d8d2
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
I am submitting a new selftest for the netpoll subsystem specifically
targeting the case where the RX is polling in the TX path, which is
a case that we don't have any test in the tree today. This is done when
netpoll_poll_dev() called, and this test creates a scenario when that is
probably.
The test does the following:
1) Configuring a single RX/TX queue to increase contention on the
interface.
2) Generating background traffic to saturate the network, mimicking
real-world congestion.
3) Sending netconsole messages to trigger netpoll polling and monitor
its behavior.
4) Using dynamic netconsole targets via configfs, with the ability to
delete and recreate targets during the test.
5) Running bpftrace in parallel to verify that netpoll_poll_dev() is
called when expected. If it is called, then the test passes,
otherwise the test is marked as skipped.
In order to achieve it, I stole Jakub's bpftrace helper from [1], and
did some small changes that I found useful to use the helper.
So, this patchset basically contains:
1) The code stolen from Jakub
2) Improvements on bpftrace() helper
3) The selftest itself
Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1]
---
Changes in v7:
- Rebased on top of net-next
- Using `ethtool -l` json option instead of parsing it manually.
- Link to v6: https://lore.kernel.org/r/20250711-netpoll_test-v6-0-130465f286a8@debian.org
Changes in v6:
- Remove the network toggled (Jakub)
- Set ringsize and queue size (Jakub)
- Some other general improvements (Jakub)
- Link to v5: https://lore.kernel.org/r/20250709-netpoll_test-v5-0-b3737895affe@debian.org
Changes in v5:
- Rebased on top of net-next.
- Calling bpftrace_stop using the defer helper. (Willem)
- Link to v4: https://lore.kernel.org/r/20250702-netpoll_test-v4-0-cec227e85639@debian.org
Changes in v4:
- Make the test XFail if it doesn't hit the function we are looking for
- Toggle the interface while the traffic is flowing.
- Bumped the number of messages from 10 to 40 per iterations.
* This is hitting ~15 times per run on my vng test.
- Decreased the time from 15 seconds to 10 seconds, given that if
it didn't hit the function in 10 seconds, 5 seconds extra will not
help.
- Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org
Changes in v3:
- Make pylint happy (Simon)
- Remove the unnecessary patch in bpftrace to raise an exception when it
fails. (Jakub)
- Improved the bpftrace code (Willem)
- Stop sending messages if bpftrace is not alive anymore.
- Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org
Changes in v2:
- Stole Jakub's helper to run bpftrace
- Removed the DEBUG option and moved logs to logging
- Change the code to have a higher chance of calling netpoll_poll_dev().
In my current configuration, it is hitting multiple times during the
test.
- Save and restore TX/RX queue size (Jakub)
- Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org
---
Breno Leitao (2):
selftests: drv-net: Strip '@' prefix from bpftrace map keys
selftests: net: add netpoll basic functionality test
Jakub Kicinski (1):
selftests: drv-net: add helper/wrapper for bpftrace
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/py/__init__.py | 4 +-
.../testing/selftests/drivers/net/netpoll_basic.py | 396 +++++++++++++++++++++
tools/testing/selftests/net/lib/py/utils.py | 35 ++
4 files changed, 434 insertions(+), 2 deletions(-)
---
base-commit: b06c4311711c57c5e558bd29824b08f0a6e2a155
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Jakub reported that the rtnetlink test for the preferred lifetime of an
address has become quite flaky. The issue started appearing around the 6.16
merge window in May, and the test fails with:
FAIL: preferred_lft addresses remaining
The flakiness might be related to power-saving behavior, as address
expiration is handled by a "power-efficient" workqueue.
To address this, use slowwait to check more frequently whether the address
still exists. This reduces the likelihood of the system entering a low-power
state during the test, improving reliability.
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
tools/testing/selftests/net/rtnetlink.sh | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 2e8243a65b50..49141254065c 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -291,6 +291,17 @@ kci_test_route_get()
end_test "PASS: route get"
}
+check_addr_not_exist()
+{
+ dev=$1
+ addr=$2
+ if ip addr show dev $dev | grep -q $addr; then
+ return 1
+ else
+ return 0
+ fi
+}
+
kci_test_addrlft()
{
for i in $(seq 10 100) ;do
@@ -298,9 +309,8 @@ kci_test_addrlft()
run_cmd ip addr add 10.23.11.$i/32 dev "$devdummy" preferred_lft $lft valid_lft $((lft+1))
done
- sleep 5
- run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy"
- if [ $? -eq 0 ]; then
+ slowwait 5 check_addr_not_exist "$devdummy" "10.23.11."
+ if [ $? -eq 1 ]; then
check_err 1
end_test "FAIL: preferred_lft addresses remaining"
return
--
2.46.0
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
v3:
* Disallowed move operation except for MREMAP_FIXED.
* Disallow gap at start of aggregate range to avoid confusion.
* Disallow any file-baked VMAs with custom get_unmapped_area.
* Renamed multi_vma to seen_vma to be clearer. Stop reusing new_addr, use
separate target_addr var to track next target address.
* Check if first VMA fails multi VMA check, if so we'll allow one VMA but
not multiple.
* Updated the commit message for patch 9 to be clearer about gap behaviour.
* Removed accidentally included debug goto statement in test (doh!). Test
was and is passing regardless.
* Unmap target range in test, previously we ended up moving additional VMAs
unintentionally. This still all passed :) but was not what was intended.
* Removed self-merge check - there is absolutely no way this can happen
across multiple VMAs, as there is no means of moving VMAs such that a VMA
merges with itself.
v2:
* Squashed uffd stub fix into series.
* Propagated tags, thanks!
* Fixed param naming in patch 4 as per Vlastimil.
* Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per
Liam.
* Correctly return -EFAULT if no VMAs in input range.
* Account for get_unmapped_area() disregarding MAP_FIXED and returning an
altered address.
* Added additional explanatatory comment to the remap_move() function.
https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
v1:
https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (10):
mm/mremap: perform some simple cleanups
mm/mremap: refactor initial parameter sanity checks
mm/mremap: put VMA check and prep logic into helper function
mm/mremap: cleanup post-processing stage of mremap
mm/mremap: use an explicit uffd failure path for mremap
mm/mremap: check remap conditions earlier
mm/mremap: move remap_is_valid() into check_prep_vma()
mm/mremap: clean up mlock populate behaviour
mm/mremap: permit mremap() move of multiple VMAs
tools/testing/selftests: extend mremap_test to test multi-VMA mremap
fs/userfaultfd.c | 15 +-
include/linux/userfaultfd_k.h | 5 +
mm/mremap.c | 553 +++++++++++++++--------
tools/testing/selftests/mm/mremap_test.c | 146 +++++-
4 files changed, 518 insertions(+), 201 deletions(-)
--
2.50.0
This patchset fixes an issue where bonding fails to establish a stable LACP
negotiation when operating in passive mode (lacp_active=off).
In passive mode, the current implementation only replies when the partner's
state changes, which results in LACP timeout and unstable aggregator formation.
With this change, the bond responds to each received LACPDU in passive mode
by setting ntt = true, ensuring timely replies and stable LACP negotiation.
Hangbin Liu (2):
bonding: update ntt to true in passive mode
selftests: bonding: add test for passive LACP mode
drivers/net/bonding/bond_3ad.c | 6 ++
.../drivers/net/bonding/bond_passive_lacp.sh | 21 +++++
.../drivers/net/bonding/bond_topo_lacp.sh | 77 +++++++++++++++++++
3 files changed, 104 insertions(+)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_passive_lacp.sh
create mode 100644 tools/testing/selftests/drivers/net/bonding/bond_topo_lacp.sh
--
2.46.0
A task in the kernel (task_mm_cid_work) runs somewhat periodically to
compact the mm_cid for each process. Add a test to validate that it runs
correctly and timely.
The test spawns 1 thread pinned to each CPU, then each thread, including
the main one, runs in short bursts for some time. During this period, the
mm_cids should be spanning all numbers between 0 and nproc.
At the end of this phase, a thread with high enough mm_cid (>= nproc/2)
is selected to be the new leader, all other threads terminate.
After some time, the only running thread should see 0 as mm_cid, if that
doesn't happen, the compaction mechanism didn't work and the test fails.
The test never fails if only 1 core is available, in which case, we
cannot test anything as the only available mm_cid is 0.
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com>
---
tools/testing/selftests/rseq/.gitignore | 1 +
tools/testing/selftests/rseq/Makefile | 2 +-
.../selftests/rseq/mm_cid_compaction_test.c | 204 ++++++++++++++++++
3 files changed, 206 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 0fda241fa62b0..b3920c59bf401 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -3,6 +3,7 @@ basic_percpu_ops_test
basic_percpu_ops_mm_cid_test
basic_test
basic_rseq_op_test
+mm_cid_compaction_test
param_test
param_test_benchmark
param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 0d0a5fae59547..bc4d940f66d40 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test mm_cid_compaction_test
TEST_GEN_PROGS_EXTENDED = librseq.so
diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
new file mode 100644
index 0000000000000..d13623625f5a9
--- /dev/null
+++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "../kselftest.h"
+#include "rseq.h"
+
+#define VERBOSE 0
+#define printf_verbose(fmt, ...) \
+ do { \
+ if (VERBOSE) \
+ printf(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+/* 50 ms */
+#define RUNNER_PERIOD 50000
+/*
+ * Number of runs before we terminate or get the token.
+ * The number is slowly increasing with the number of CPUs as the compaction
+ * process can take longer on larger systems. This is an arbitrary value.
+ */
+#define THREAD_RUNS (3 + args->num_cpus/8)
+
+/*
+ * Number of times we check that the mm_cid were compacted.
+ * Checks are repeated every RUNNER_PERIOD.
+ */
+#define MM_CID_COMPACT_TIMEOUT 10
+
+struct thread_args {
+ int cpu;
+ int num_cpus;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ pthread_t *tinfo;
+ struct thread_args *args_head;
+};
+
+static void __noreturn *thread_runner(void *arg)
+{
+ struct thread_args *args = arg;
+ int i, ret, curr_mm_cid;
+ cpu_set_t cpumask;
+
+ CPU_ZERO(&cpumask);
+ CPU_SET(args->cpu, &cpumask);
+ ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to set affinity");
+ abort();
+ }
+ pthread_barrier_wait(args->barrier);
+
+ for (i = 0; i < THREAD_RUNS; i++)
+ usleep(RUNNER_PERIOD);
+ curr_mm_cid = rseq_current_mm_cid();
+ /*
+ * We select one thread with high enough mm_cid to be the new leader.
+ * All other threads (including the main thread) will terminate.
+ * After some time, the mm_cid of the only remaining thread should
+ * converge to 0, if not, the test fails.
+ */
+ if (curr_mm_cid >= args->num_cpus / 2 &&
+ !pthread_mutex_trylock(args->token)) {
+ printf_verbose(
+ "cpu%d has mm_cid=%d and will be the new leader.\n",
+ sched_getcpu(), curr_mm_cid);
+ for (i = 0; i < args->num_cpus; i++) {
+ if (args->tinfo[i] == pthread_self())
+ continue;
+ ret = pthread_join(args->tinfo[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to join thread");
+ abort();
+ }
+ }
+ pthread_barrier_destroy(args->barrier);
+ free(args->tinfo);
+ free(args->token);
+ free(args->barrier);
+ free(args->args_head);
+
+ for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) {
+ curr_mm_cid = rseq_current_mm_cid();
+ printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i,
+ curr_mm_cid, sched_getcpu());
+ if (curr_mm_cid == 0)
+ exit(EXIT_SUCCESS);
+ usleep(RUNNER_PERIOD);
+ }
+ exit(EXIT_FAILURE);
+ }
+ printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n",
+ sched_getcpu(), curr_mm_cid);
+ pthread_exit(NULL);
+}
+
+int test_mm_cid_compaction(void)
+{
+ cpu_set_t affinity;
+ int i, j, ret = 0, num_threads;
+ pthread_t *tinfo;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ struct thread_args *args;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ num_threads = CPU_COUNT(&affinity);
+ tinfo = calloc(num_threads, sizeof(*tinfo));
+ if (!tinfo) {
+ perror("Error: failed to allocate tinfo");
+ return -1;
+ }
+ args = calloc(num_threads, sizeof(*args));
+ if (!args) {
+ perror("Error: failed to allocate args");
+ ret = -1;
+ goto out_free_tinfo;
+ }
+ token = malloc(sizeof(*token));
+ if (!token) {
+ perror("Error: failed to allocate token");
+ ret = -1;
+ goto out_free_args;
+ }
+ barrier = malloc(sizeof(*barrier));
+ if (!barrier) {
+ perror("Error: failed to allocate barrier");
+ ret = -1;
+ goto out_free_token;
+ }
+ if (num_threads == 1) {
+ fprintf(stderr, "Cannot test on a single cpu. "
+ "Skipping mm_cid_compaction test.\n");
+ /* only skipping the test, this is not a failure */
+ goto out_free_barrier;
+ }
+ pthread_mutex_init(token, NULL);
+ ret = pthread_barrier_init(barrier, NULL, num_threads);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to initialise barrier");
+ goto out_free_barrier;
+ }
+ for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+ args[j].num_cpus = num_threads;
+ args[j].tinfo = tinfo;
+ args[j].token = token;
+ args[j].barrier = barrier;
+ args[j].cpu = i;
+ args[j].args_head = args;
+ if (!j) {
+ /* The first thread is the main one */
+ tinfo[0] = pthread_self();
+ ++j;
+ continue;
+ }
+ ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to create thread");
+ abort();
+ }
+ ++j;
+ }
+ printf_verbose("Started %d threads.\n", num_threads);
+
+ /* Also main thread will terminate if it is not selected as leader */
+ thread_runner(&args[0]);
+
+ /* only reached in case of errors */
+out_free_barrier:
+ free(barrier);
+out_free_token:
+ free(token);
+out_free_args:
+ free(args);
+out_free_tinfo:
+ free(tinfo);
+
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ if (!rseq_mm_cid_available()) {
+ fprintf(stderr, "Error: rseq_mm_cid unavailable\n");
+ return -1;
+ }
+ if (test_mm_cid_compaction())
+ return -1;
+ return 0;
+}
--
2.50.1
This patch series add tests to validate XDP native support for PASS,
DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment.
For adjustment tests, validate support for both the extension and
shrinking cases across various packet sizes and offset values.
The pass criteria for head/tail adjustment tests require that at-least
one adjustment value works for at-least one packet size. This ensure
that the variability in maximum supported head/tail adjustment offset
across different drivers is being incorporated.
The results reported in this series are based on fbnic. However, the
series is tested against multiple other drivers including netdevism.
Note: The XDP support for fbnic will be added later.
---
Change-log:
V5:
- Fix warning caused by rcu_dereference() in p1
- Fix checkpatch warnings with P3, P4, and P5
V4: https://lore.kernel.org/netdev/20250714210352.1115230-1-mohsin.bashr@gmail.…
V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.…
V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com
V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.…
Jakub Kicinski (1):
net: netdevsim: hook in XDP handling
Mohsin Bashir (4):
selftests: drv-net: Test XDP_PASS/DROP support
selftests: drv-net: Test XDP_TX support
selftests: drv-net: Test tail-adjustment support
selftests: drv-net: Test head-adjustment support
drivers/net/netdevsim/netdev.c | 19 +-
tools/testing/selftests/drivers/net/Makefile | 1 +
tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++
.../selftests/net/lib/xdp_native.bpf.c | 540 ++++++++++++++
4 files changed, 1215 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/drivers/net/xdp.py
create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c
--
2.47.1
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.455
0.011 0.008 0.472
0.011 0.008 0.535
0.011 0.009 0.545
...
0.011 0.014 2.875
0.011 0.014 2.913
0.011 0.014 3.007
0.011 0.015 3.018
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.036
0.006 0.005 0.039
0.006 0.005 0.039
0.006 0.005 0.039
...
0.006 0.006 0.403
0.006 0.006 0.474
0.006 0.006 0.479
0.006 0.006 0.498
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.
In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:
Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;
Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;
Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;
If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:
Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;
In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.
Changes since v5 [4]:
- Made /proc/pid/maps tearing test a separate selftest,
per Alexey Dobriyan
- Changed asserts with or'ed conditions into separate ones,
per Alexey Dobriyan
- Added a small cleanup patch [6/8] to avoid unnecessary seq_file position
type casting
- Removed unnecessary is_sentinel_pos() helper
- Changed titles to use fs/proc/task_mmu instead of mm/maps prefix,
per David Hildenbrand
- Included Lorenzo's fix for mmap lock assertion in anon_vma_name()
- Reworked the last patch to avoid allocation in the rcu read section,
which replaces Jeongjun Park's fix
!!! NOTES FOR APPLYING THE PATCHSET !!!
Applies cleanly over mm-unstable after reverting old version with fixes.
The following patches should be reverted before applyng this patchset:
b33ce1be8a40 ("selftests/proc: add /proc/pid/maps tearing from vma split test")
b538e0580fd6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma resizing")
4996b4409cc6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma remapping")
c39471f78d5e ("selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified")
487570f548f3 ("selftests/proc: add verbose more for tests to facilitate debugging")
e1ba4969cba1 ("mm/maps: read proc/pid/maps under per-vma lock")
ecb110179e77 ("mm/madvise: fixup stray mmap lock assert in anon_vma_name()")
6772c457a865 ("fs/proc/task_mmu:: execute PROCMAP_QUERY ioctl under per-vma locks")
d5c67bb2c5fb ("mm/maps: move kmalloc() call location in do_procmap_query() out of RCU critical section")
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo…
[4] https://lore.kernel.org/all/20250624193359.3865351-1-surenb@google.com/
Suren Baghdasaryan (8):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
fs/proc/task_mmu: remove conversion of seq_file position to unsigned
fs/proc/task_mmu: read proc/pid/maps under per-vma lock
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 5 +
fs/proc/task_mmu.c | 188 +++-
include/linux/mmap_lock.h | 11 +
mm/madvise.c | 3 +-
mm/mmap_lock.c | 88 ++
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++
8 files changed, 1098 insertions(+), 28 deletions(-)
create mode 100644 tools/testing/selftests/proc/proc-maps-race.c
--
2.50.0.727.gbf7dc18ff4-goog