This patch series add tests to validate XDP native support for PASS,
DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment.
For adjustment tests, validate support for both the extension and
shrinking cases across various packet sizes and offset values.
The pass criteria for head/tail adjustment tests require that at-least
one adjustment value works for at-least one packet size. This ensure
that the variability in maximum supported head/tail adjustment offset
across different drivers is being incorporated.
The results reported in this series are based on fbnic. However, the
series is tested against multiple other drivers including netdevism.
Note: The XDP support for fbnic will be added later.
---
Change-log:
V5:
- Fix warning caused by rcu_dereference() in p1
- Fix checkpatch warnings with P3, P4, and P5
V4: https://lore.kernel.org/netdev/20250714210352.1115230-1-mohsin.bashr@gmail.…
V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.…
V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com
V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.…
Jakub Kicinski (1):
net: netdevsim: hook in XDP handling
Mohsin Bashir (4):
selftests: drv-net: Test XDP_PASS/DROP support
selftests: drv-net: Test XDP_TX support
selftests: drv-net: Test tail-adjustment support
selftests: drv-net: Test head-adjustment support
drivers/net/netdevsim/netdev.c | 19 +-
tools/testing/selftests/drivers/net/Makefile | 1 +
tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++
.../selftests/net/lib/xdp_native.bpf.c | 540 ++++++++++++++
4 files changed, 1215 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/drivers/net/xdp.py
create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c
--
2.47.1
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.455
0.011 0.008 0.472
0.011 0.008 0.535
0.011 0.009 0.545
...
0.011 0.014 2.875
0.011 0.014 2.913
0.011 0.014 3.007
0.011 0.015 3.018
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.036
0.006 0.005 0.039
0.006 0.005 0.039
0.006 0.005 0.039
...
0.006 0.006 0.403
0.006 0.006 0.474
0.006 0.006 0.479
0.006 0.006 0.498
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.
In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:
Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;
Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;
Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;
If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:
Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;
In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.
Changes since v5 [4]:
- Made /proc/pid/maps tearing test a separate selftest,
per Alexey Dobriyan
- Changed asserts with or'ed conditions into separate ones,
per Alexey Dobriyan
- Added a small cleanup patch [6/8] to avoid unnecessary seq_file position
type casting
- Removed unnecessary is_sentinel_pos() helper
- Changed titles to use fs/proc/task_mmu instead of mm/maps prefix,
per David Hildenbrand
- Included Lorenzo's fix for mmap lock assertion in anon_vma_name()
- Reworked the last patch to avoid allocation in the rcu read section,
which replaces Jeongjun Park's fix
!!! NOTES FOR APPLYING THE PATCHSET !!!
Applies cleanly over mm-unstable after reverting old version with fixes.
The following patches should be reverted before applyng this patchset:
b33ce1be8a40 ("selftests/proc: add /proc/pid/maps tearing from vma split test")
b538e0580fd6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma resizing")
4996b4409cc6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma remapping")
c39471f78d5e ("selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified")
487570f548f3 ("selftests/proc: add verbose more for tests to facilitate debugging")
e1ba4969cba1 ("mm/maps: read proc/pid/maps under per-vma lock")
ecb110179e77 ("mm/madvise: fixup stray mmap lock assert in anon_vma_name()")
6772c457a865 ("fs/proc/task_mmu:: execute PROCMAP_QUERY ioctl under per-vma locks")
d5c67bb2c5fb ("mm/maps: move kmalloc() call location in do_procmap_query() out of RCU critical section")
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo…
[4] https://lore.kernel.org/all/20250624193359.3865351-1-surenb@google.com/
Suren Baghdasaryan (8):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
fs/proc/task_mmu: remove conversion of seq_file position to unsigned
fs/proc/task_mmu: read proc/pid/maps under per-vma lock
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 5 +
fs/proc/task_mmu.c | 188 +++-
include/linux/mmap_lock.h | 11 +
mm/madvise.c | 3 +-
mm/mmap_lock.c | 88 ++
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++
8 files changed, 1098 insertions(+), 28 deletions(-)
create mode 100644 tools/testing/selftests/proc/proc-maps-race.c
--
2.50.0.727.gbf7dc18ff4-goog
NOTE: I'm leaving Daynix Computing Ltd., for which I worked on this
patch series, by the end of this month.
While net-next is closed, this is the last chance for me to send another
version so let me send the local changes now.
Please contact Yuri Benditovich, who is CCed on this email, for anything
about this series.
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250530-hash-v5-0-343d7d7a8200@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v12:
- Updated tools/testing/selftests/net/config.
- Split TUNSETVNETHASH.
- Link to v11: https://lore.kernel.org/r/20250317-rss-v11-0-4cacca92f31f@daynix.com
Changes in v11:
- Added the missing code to free vnet_hash in patch
"tap: Introduce virtio-net hash feature".
- Link to v10: https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9af0@daynix.com
Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
the testing function now needs access to more variant-specific
variables.
- Corrected the message of patch "selftest: tun: Add tests for
virtio-net hashing"; it mentioned validation of configuration but
it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
"vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
Changes in v9:
- Added a missing return statement in patch
"tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (10):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Allow steering eBPF program to fall back
tun: Add common virtio-net hash feature code
tun: Introduce virtio-net hash feature
tap: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
selftest: tap: Add tests for virtio-net ioctls
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/ipvlan/ipvtap.c | 2 +-
drivers/net/macvtap.c | 2 +-
drivers/net/tap.c | 80 +++++-
drivers/net/tun.c | 92 +++++--
drivers/net/tun_vnet.h | 165 +++++++++++-
drivers/vhost/net.c | 68 ++---
include/linux/if_tap.h | 4 +-
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 80 ++++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/tap.c | 131 +++++++++-
tools/testing/selftests/net/tun.c | 485 ++++++++++++++++++++++++++++++++++-
19 files changed, 1234 insertions(+), 85 deletions(-)
---
base-commit: 5cb8274d66c611b7889565c418a8158517810f9b
change-id: 20240403-rss-e737d89efa77
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v12 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (6):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: Add wait_third_ack for ECN negotiation in simultaneous connect
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 13 +
include/linux/tcp.h | 33 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 663 ++++++++++++++++++
include/uapi/linux/tcp.h | 7 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 29 +-
net/ipv4/tcp_input.c | 371 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 297 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1444 insertions(+), 187 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: Jann Horn <jannh(a)google.com>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Miguel Ojeda <ojeda(a)kernel.org>
To: Alex Gaynor <alex.gaynor(a)gmail.com>
To: Boqun Feng <boqun.feng(a)gmail.com>
To: Gary Guo <gary(a)garyguo.net>
To: Björn Roy Baron <bjorn3_gh(a)protonmail.com>
To: Benno Lossin <benno.lossin(a)proton.me>
To: Andreas Hindborg <a.hindborg(a)kernel.org>
To: Alice Ryhl <aliceryhl(a)google.com>
To: Trevor Gross <tmgross(a)umich.edu>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Cc: rust-for-linux(a)vger.kernel.org
changelog
---------
v17:
- fixed warnings due to empty macros in usercfi.h (reported by alexg)
- fixed prefixes in commit titles reported by alexg
- took below uprobe with fcfi v2 patch from Zong Li and squashed it with
"riscv/traps: Introduce software check exception and uprobe handling"
https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/
v16:
- If FWFT is not implemented or returns error for shadow stack activation, then
no_usercfi is set to disable shadow stack. Although this should be picked up
by extension validation and activation. Fixed this bug for zicfilp and zicfiss
both. Thanks to Charlie Jenkins for reporting this.
- If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by
Charlie Jenkins.
- Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to
keep it off till we have more hardware availibility with RVA23 profile and
zimop/zcmop implemented. Else this will start breaking people's workflow
- Includes the fix if "!RV64 and !SBI" then definitions for FWFT in
asm-offsets.c error.
v15:
- Toolchain has been updated to include `-fcf-protection` flag. This
exists for x86 as well. Updated kernel patches to compile vDSO and
selftest to compile with `fcf-protection=full` flag.
- selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI.
- Patch to enable shadow stack for kernel wasn't hidden behind
CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that.
v14:
- rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches
Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants.
- Took Radim's suggestions on bitfields.
- Placed cfi_state at the end of thread_info block so that current situation
is not disturbed with respect to member fields of thread_info in single
cacheline.
v13:
- cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses
riscv_has_extension_unlikely()
- uses nops(count) to create nop slide
- RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it
- changed ternaries to simply use implicit casting to convert to bool.
- kernel command line allows to disable zicfilp and zicfiss independently.
updated kernel-parameters.txt.
- ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace
kselftest.
- cosmetic and grammatical changes to documentation.
v12:
- It seems like I had accidently squashed arch agnostic indirect branch
tracking prctl and riscv implementation of those prctls. Split them again.
- set_shstk_status/set_indir_lp_status perform CSR writes only when CPU
support is available. As suggested by Zong Li.
- Some minor clean up in kselftests as suggested by Zong Li.
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v17:
- Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri…
Changes in v16:
- Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri…
Changes in v15:
- changelog posted just below cover letter
- Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri…
Changes in v14:
- changelog posted just below cover letter
- Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri…
Changes in v13:
- changelog posted just below cover letter
- Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri…
Changes in v12:
- changelog posted just below cover letter
- Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri…
Changes in v11:
- changelog posted just below cover letter
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Deepak Gupta (25):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv/mm: manufacture shadow stack pte
riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs
riscv/mm: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception and uprobe handling
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: kernel command line option to opt out of user cfi
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/admin-guide/kernel-parameters.txt | 8 +
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 179 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 21 +
arch/riscv/Makefile | 5 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 12 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 26 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 95 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 34 ++
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 10 +
arch/riscv/kernel/cpufeature.c | 27 +
arch/riscv/kernel/entry.S | 33 +-
arch/riscv/kernel/head.S | 27 +
arch/riscv/kernel/process.c | 27 +-
arch/riscv/kernel/ptrace.c | 95 ++++
arch/riscv/kernel/signal.c | 148 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 51 ++
arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 6 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 16 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 2 +
include/uapi/linux/prctl.h | 27 +
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 16 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++
tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 27 +
54 files changed, 2369 insertions(+), 29 deletions(-)
---
base-commit: 4181f8ad7a1061efed0219951d608d4988302af7
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
This patch series add tests to validate XDP native support for PASS,
DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment.
For adjustment tests, validate support for both the extension and
shrinking cases across various packet sizes and offset values.
The pass criteria for head/tail adjustment tests require that at-least
one adjustment value works for at-least one packet size. This ensure
that the variability in maximum supported head/tail adjustment offset
across different drivers is being incorporated.
The results reported in this series are based on fbnic. However, the
series is tested against multiple other drivers including netdevism.
Note: The XDP support for fbnic will be added later.
---
Change-log:
V4:
- Support XDP handling for netdevsim
- Fix pylint warning with P4
- Update commit message for P2,P3 to show pass/fail summary
V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.…
V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com
V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.…
Jakub Kicinski (1):
net: netdevsim: hook in XDP handling
Mohsin Bashir (4):
selftests: drv-net: Test XDP_PASS/DROP support
selftests: drv-net: Test XDP_TX support
selftests: drv-net: Test tail-adjustment support
selftests: drv-net: Test head-adjustment support
drivers/net/netdevsim/netdev.c | 19 +-
tools/testing/selftests/drivers/net/Makefile | 1 +
tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++
.../selftests/net/lib/xdp_native.bpf.c | 538 ++++++++++++++
4 files changed, 1213 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/drivers/net/xdp.py
create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c
--
2.47.1
Hi everyone,
Here's a V3 for the netdevsim PHY support. This V3 includes :
- A fix for a compiling issue with PHYLIB=n
- An updated KConfig to only allow PHYLIB=y|n
- Converted the link setting file to a bool debugfs file, relying on
link state polling
The idea of this series is to allow attaching virtual PHY devices to
netdevsim, so that we can test PHY-related ethtool commands. This can be
extended in the future for phylib testing as well.
V2: https://lore.kernel.org/netdev/20250708115531.111326-1-maxime.chevallier@bo…
- Fix building with PHYLIB=m
- Use shellcheck on the shell scripts
V1: https://lore.kernel.org/netdev/20250702082806.706973-1-maxime.chevallier@bo…
Maxime Chevallier (3):
net: netdevsim: Add PHY support in netdevsim
selftests: ethtool: Drop the unused old_netdevs variable
selftests: ethtool: Introduce ethernet PHY selftests on netdevsim
drivers/net/Kconfig | 1 +
drivers/net/netdevsim/Makefile | 4 +
drivers/net/netdevsim/dev.c | 2 +
drivers/net/netdevsim/netdev.c | 8 +
drivers/net/netdevsim/netdevsim.h | 25 ++
drivers/net/netdevsim/phy.c | 375 ++++++++++++++++++
.../selftests/drivers/net/netdevsim/config | 1 +
.../drivers/net/netdevsim/ethtool-common.sh | 19 +-
.../drivers/net/netdevsim/ethtool-phy.sh | 64 +++
9 files changed, 496 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/netdevsim/phy.c
create mode 100755 tools/testing/selftests/drivers/net/netdevsim/ethtool-phy.sh
--
2.49.0
There is a spelling mistake in a ksft_exit_fail_msg message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/futex/functional/futex_priv_hash.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
index a9cedc365102..aea001ac4946 100644
--- a/tools/testing/selftests/futex/functional/futex_priv_hash.c
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -122,7 +122,7 @@ static void futex_dummy_op(void)
}
ret = pthread_mutex_timedlock(&lock, &timeout);
if (ret == 0)
- ksft_exit_fail_msg("Succeffuly locked an already locked mutex.\n");
+ ksft_exit_fail_msg("Successfully locked an already locked mutex.\n");
if (ret != ETIMEDOUT)
ksft_exit_fail_msg("pthread_mutex_timedlock() did not timeout: %d.\n", ret);
--
2.50.0
Move the generic `FORCE_READ` macro from `guard-regions.c` to the shared
`vm_util.h` header to promote code reuse.
In `guard-regions.c`, replace `ksft_exit_skip()` with the `SKIP()` macro
to ensure only the current test is skipped on permission failure, instead
of terminating the entire test binary.
Signed-off-by: wang lian <lianux.mm(a)gmail.com>
---
tools/testing/selftests/mm/guard-regions.c | 9 +--------
tools/testing/selftests/mm/vm_util.h | 7 +++++++
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 93af3d3760f9..b0d42eb04e3a 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -35,13 +35,6 @@
static volatile sig_atomic_t signal_jump_set;
static sigjmp_buf signal_jmp_buf;
-/*
- * Ignore the checkpatch warning, we must read from x but don't want to do
- * anything with it in order to trigger a read page fault. We therefore must use
- * volatile to stop the compiler from optimising this away.
- */
-#define FORCE_READ(x) (*(volatile typeof(x) *)x)
-
/*
* How is the test backing the mapping being tested?
*/
@@ -582,7 +575,7 @@ TEST_F(guard_regions, process_madvise)
/* OK we don't have permission to do this, skip. */
if (count == -1 && errno == EPERM)
- ksft_exit_skip("No process_madvise() permissions, try running as root.\n");
+ SKIP(return, "No process_madvise() permissions, try running as root.\n");
/* Returns the number of bytes advised. */
ASSERT_EQ(count, 6 * page_size);
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 2b154c287591..c20298ae98ea 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -18,6 +18,13 @@
#define PM_SWAP BIT_ULL(62)
#define PM_PRESENT BIT_ULL(63)
+/*
+ * Ignore the checkpatch warning, we must read from x but don't want to do
+ * anything with it in order to trigger a read page fault. We therefore must use
+ * volatile to stop the compiler from optimising this away.
+ */
+#define FORCE_READ(x) (*(volatile typeof(x) *)x)
+
extern unsigned int __page_size;
extern unsigned int __page_shift;
--
2.43.0
mptcp_connect.sh can be executed manually with "-m <MODE>" and "-C" to
make sure everything works as expected when using "mmap" and "sendfile"
modes instead of "poll", and with the MPTCP checksum support.
These modes should be validated, but they are not when the selftests are
executed via the kselftest helpers. It means that most CIs validating
these selftests, like NIPA for the net development trees and LKFT for
the stable ones, are not covering these modes.
To fix that, new test programs have been added, simply calling
mptcp_connect.sh with the right parameters.
The first patch can be backported up to v5.6, and the second one up to
v5.14.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (2):
selftests: mptcp: connect: also cover alt modes
selftests: mptcp: connect: also cover checksum
tools/testing/selftests/net/mptcp/Makefile | 3 ++-
tools/testing/selftests/net/mptcp/mptcp_connect_checksum.sh | 4 ++++
tools/testing/selftests/net/mptcp/mptcp_connect_mmap.sh | 4 ++++
tools/testing/selftests/net/mptcp/mptcp_connect_sendfile.sh | 4 ++++
4 files changed, 14 insertions(+), 1 deletion(-)
---
base-commit: b640daa2822a39ff76e70200cb2b7b892b896dce
change-id: 20250714-net-mptcp-sft-connect-alt-c1aaf073ef4e
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
This patch series refactors all futex selftests to use
kselftest_harness.h instead of futex's logging.h, as discussed here [1].
This allows to remove a lot of boilerplate code and to simplify some
parts of the test logic, mainly when the test needs to exit early. The
result of this is more than 500 lines removed from
tools/testing/selftests/futex/. Also, this enables new tests to use
kselftest.h features like ASSERT_s and such.
There are some caveats around this refactor:
- logging.h had verbosity levels, while kselftest_harness.h doesn't. I
created a new print function called ksft_print_dbg_msg() that prints
the message if the user uses the -d flag, so now there's an
equivalent of this feature.
- futex_requeue_pi test accepted command line arguments to be used as
test parameters (e.g. ./futex_requeue_pi -b -l -t 500000). This
doesn't work with kselftest_harness.h because there's no
straightforward way to send command line arguments to the test.
I used FIXTURE_VARIANT() to achieve the same result, but now the
parameters live inside of the test file, instead of on
functional/run.sh. This increased a little bit the number of test
cases for futex_requeue_pi, from 22 to 24.
- test_harness_run() calls mmap(MAP_SHARED) before running the test and
this has caused a side effect on test futex_numa_mpol.c. This test
also calls mmap() and then try to access an address out of
boundaries of this mapped memory for a "Memory out of range" subtest,
where the kernel should return -EACCESS. After the refactor, the test
address might be fall inside the first memory mapped region, thus
being a valid address and succeeding the syscall, making the test
fail. To fix that, I created a small "buffer zone" with
mmap(PROT_NONE) between both mmaps.
I have compared the results of run.sh before and after this patchset and
didn't find any regression from the test results.
Thanks,
André
[1] https://lore.kernel.org/lkml/87ecv6p364.ffs@tglx/
---
André Almeida (15):
selftests: kselftest: Create ksft_print_dbg_msg()
selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h
selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h
selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h
selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h
selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h
selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h
selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h
selftests/futex: Refactor futex_wait with kselftest_harness.h
selftests/futex: Refactor futex_requeue with kselftest_harness.h
selftests/futex: Refactor futex_waitv with kselftest_harness.h
selftests/futex: Refactor futex_priv_hash with kselftest_harness.h
selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h
selftests/futex: Drop logging.h include from futex_numa
selftests/futex: Remove logging.h file
tools/testing/selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/futex_numa.c | 3 +-
.../selftests/futex/functional/futex_numa_mpol.c | 57 ++---
.../selftests/futex/functional/futex_priv_hash.c | 79 +++----
.../selftests/futex/functional/futex_requeue.c | 76 ++----
.../selftests/futex/functional/futex_requeue_pi.c | 261 ++++++++++-----------
.../functional/futex_requeue_pi_mismatched_ops.c | 80 ++-----
.../functional/futex_requeue_pi_signal_restart.c | 129 +++-------
.../selftests/futex/functional/futex_wait.c | 103 +++-----
.../functional/futex_wait_private_mapped_file.c | 83 ++-----
.../futex/functional/futex_wait_timeout.c | 139 +++++------
.../functional/futex_wait_uninitialized_heap.c | 76 ++----
.../futex/functional/futex_wait_wouldblock.c | 75 ++----
.../selftests/futex/functional/futex_waitv.c | 98 ++++----
tools/testing/selftests/futex/functional/run.sh | 62 +----
tools/testing/selftests/futex/include/logging.h | 148 ------------
tools/testing/selftests/kselftest.h | 13 +
tools/testing/selftests/kselftest_harness.h | 13 +-
18 files changed, 491 insertions(+), 1007 deletions(-)
---
base-commit: a24cc6ce1933eade12aa2b9859de0fcd2dac2c06
change-id: 20250703-tonyk-robust_test_cleanup-d1f3406365d9
Best regards,
--
André Almeida <andrealmeid(a)igalia.com>
Hello everyone,
I am reaching out to announce that we are once again planning to
gather to discuss testing and dependability related topics at the
Kernel Testing & Dependability Micro-conference (a.k.a. Testing MC)
at Linux Plumbers Conference Japan 2025.
- https://lpc.events/event/19/sessions/228/
The Linux Plumbers 2025 Kernel Testing & Dependability track focuses
on advancing the current state of testing of the Linux Kernel and its
related infrastructure.
The main purpose is to improve software quality and dependability for
applications that require predictability and trust.
We aim to create connections between folks working on similar
projects, and help individual projects make progress.
This track is intended to promote collaboration between all the
communities and people interested in the Kernel testing &
dependability.
This will help move the conversation forward from where we left off at
the LPC 2024 Kernel Testing & Dependability MC.
We ask that any topic discussions focus on issues/problems they are
facing and possible alternatives to resolving them.
The Micro-conference is open to all topics related to testing on
Linux, not necessarily in the kernel space.
Suggested topics:
- KernelCI: Maestro, kci-dev, kci-deploy, kci-gitlab, new dashboard, KCIDB-ng
- Improve sanitizers: KFENCE, KCSAN, KASAN, UBSAN
- Using Clang for better testing coverage: Now that the kernel fully
supports building with Clang, how can all that work be leveraged into
using Clang's features?
- Consolidating toolchains: reference collection for increased
reproducibility and quality control.
- How to spread KUnit throughout the kernel?
- Building and testing in-kernel Rust code.
- Identify missing features that will provide assurance in safety
critical systems.
- Which test coverage infrastructures are most effective to provide
evidence for kernel quality assurance? How should it be measured?
- Explore ways to improve testing framework and tests in the kernel
with a specific goal to increase traceability and code coverage.
- Regression Testing for safety: Prioritize configurations and tests
critical and important for quality and dependability.
- Transitioning to test-driven kernel release cycles for mainline and
stable: How to start relying on passing tests before releasing a new
tag?
- Explore how do SBOMs figure into dependability?
- Kernel benchmarking and kernel performance evaluation.
We invite you to submit proposals for discussions.
Proposals can be submitted here,
by 11:59PM UTC on Wednesday, September 10, 2025:
- https://lpc.events/event/19/abstracts/
Please send any inquiries to the MC leads:
Arisu Tachibana <arisu.tachibana(a)miraclelinux.com>
Shuah Khan <shuah(a)kernel.org>
Sasha Levin <sashal(a)kernel.org>
Guillaume Tucker <gtucker(a)gtucker.io>
thank you,
-- Arisu
The rtnetlink test for preferred lifetime of an address is quite flaky.
Problems started around the 6.16 merge window in May. The test fails
with:
FAIL: preferred_lft addresses remaining
and unlike most of our flakes this one fails on the "normal" kernel
builds, not the builds with kernel/configs/debug.config. I suspect
the flakes may be related to power saving, since the expirations
run from a "power efficient" workqueue. Adding a short sleep seems
to decrease the flakes by 8x but they still happen. With this
patch in place we get a flake every couple of weeks, not every
couple of days. Better ideas welcome..
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: liuhangbin(a)gmail.com
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/rtnetlink.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 2e8243a65b50..b9e1497ea27a 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -299,6 +299,11 @@ kci_test_addrlft()
done
sleep 5
+ # Schedule out for a bit, address GC runs from the power efficient WQ
+ # if the long sleep above has put the whole system into sleep state
+ # the WQ may have not had a chance to run.
+ sleep 0.1
+
run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy"
if [ $? -eq 0 ]; then
check_err 1
--
2.50.0
This is series 2a/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Tamir Duberstein (9):
gpu: nova-core: use `kernel::{fmt,prelude::fmt!}`
rust: alloc: use `kernel::{fmt,prelude::fmt!}`
rust: block: use `kernel::{fmt,prelude::fmt!}`
rust: device: use `kernel::{fmt,prelude::fmt!}`
rust: file: use `kernel::{fmt,prelude::fmt!}`
rust: kunit: use `kernel::{fmt,prelude::fmt!}`
rust: pin-init: use `kernel::{fmt,prelude::fmt!}`
rust: seq_file: use `kernel::{fmt,prelude::fmt!}`
rust: sync: use `kernel::{fmt,prelude::fmt!}`
drivers/block/rnull.rs | 2 +-
drivers/gpu/nova-core/gpu.rs | 3 +--
drivers/gpu/nova-core/regs/macros.rs | 6 +++---
rust/kernel/alloc/kbox.rs | 2 +-
rust/kernel/alloc/kvec.rs | 2 +-
rust/kernel/alloc/kvec/errors.rs | 2 +-
rust/kernel/block/mq.rs | 2 +-
rust/kernel/block/mq/gen_disk.rs | 2 +-
rust/kernel/block/mq/raw_writer.rs | 3 +--
rust/kernel/device.rs | 6 +++---
rust/kernel/fs/file.rs | 5 +++--
rust/kernel/init.rs | 4 ++--
rust/kernel/kunit.rs | 8 ++++----
rust/kernel/seq_file.rs | 6 +++---
rust/kernel/sync/arc.rs | 3 +--
scripts/rustdoc_test_gen.rs | 2 +-
16 files changed, 28 insertions(+), 30 deletions(-)
---
base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494
change-id: 20250709-core-cstr-fanout-1-f20611832272
prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1
prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573
prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e
prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa
prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec
prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27
prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This is series 2b/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Tamir Duberstein (10):
gpu: nova-core: use `core::ffi::CStr` method names
rust: auxiliary: use `core::ffi::CStr` method names
rust: configfs: use `core::ffi::CStr` method names
rust: cpufreq: use `core::ffi::CStr` method names
rust: drm: use `core::ffi::CStr` method names
rust: firmware: use `core::ffi::CStr` method names
rust: kunit: use `core::ffi::CStr` method names
rust: miscdevice: use `core::ffi::CStr` method names
rust: net: use `core::ffi::CStr` method names
rust: of: use `core::ffi::CStr` method names
drivers/gpu/drm/drm_panic_qr.rs | 2 +-
rust/kernel/auxiliary.rs | 4 ++--
rust/kernel/configfs.rs | 4 ++--
rust/kernel/cpufreq.rs | 2 +-
rust/kernel/drm/device.rs | 4 ++--
rust/kernel/firmware.rs | 2 +-
rust/kernel/kunit.rs | 6 +++---
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/net/phy.rs | 2 +-
rust/kernel/of.rs | 2 +-
samples/rust/rust_configfs.rs | 2 +-
11 files changed, 16 insertions(+), 16 deletions(-)
---
base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494
change-id: 20250709-core-cstr-fanout-1-f20611832272
prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1
prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573
prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e
prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa
prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec
prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27
prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
As David suggested, currently we don't have a high level test case to
verify the behavior of rmap. This patch set introduce the verification
on rmap by migration.
Patch 1 is a preparation to move ksm related operation into vm_util.
Patch 2 is the new test case.
Currently it covers following four scenarios:
* anonymous page
* shmem page
* pagecache page
* ksm page
Wei Yang (2):
selftests/mm: put general ksm operation into vm_util
selftests/mm: assert rmap behave as expected
MAINTAINERS | 1 +
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 3 +
.../selftests/mm/ksm_functional_tests.c | 76 +--
tools/testing/selftests/mm/rmap.c | 466 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
tools/testing/selftests/mm/vm_util.c | 71 +++
tools/testing/selftests/mm/vm_util.h | 7 +
8 files changed, 563 insertions(+), 66 deletions(-)
create mode 100644 tools/testing/selftests/mm/rmap.c
--
2.34.1
I looked at the fchmodat2() tests since I've been experiencing some
random intermittent segfaults with them in my test systems, while doing
so I noticed these two issues. Unfortunately I didn't figure out the
original yet, unless I managed to fix it unwittingly.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
selftests/fchmodat2: Clean up temporary files and directories
selftests/fchmodat2: Use ksft_finished()
tools/testing/selftests/fchmodat2/fchmodat2_test.c | 166 ++++++++++++++-------
1 file changed, 112 insertions(+), 54 deletions(-)
---
base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
change-id: 20250711-selftests-fchmodat2-c30374c376f8
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Benjamin Berg <benjamin.berg(a)intel.com>
Hi,
This patchset adds signal handling to nolibc. Initially, I would like to
use this for tests. But in the long run, the goal is to use nolibc for
the UML kernel itself. In both cases, signal handling will be needed.
v2 contains some bugfixes and has a better test coverage. Also addressed
are various review comments.
Benjamin
Benjamin Berg (4):
selftests/nolibc: fix EXPECT_NZ macro
selftests/nolibc: validate order of constructor calls
tools/nolibc: add more generic bitmask macros for FD_*
tools/nolibc: add signal support
tools/include/nolibc/arch-arm.h | 7 +
tools/include/nolibc/arch-arm64.h | 3 +
tools/include/nolibc/arch-loongarch.h | 3 +
tools/include/nolibc/arch-m68k.h | 10 ++
tools/include/nolibc/arch-mips.h | 3 +
tools/include/nolibc/arch-powerpc.h | 8 +
tools/include/nolibc/arch-riscv.h | 3 +
tools/include/nolibc/arch-s390.h | 8 +-
tools/include/nolibc/arch-sh.h | 5 +
tools/include/nolibc/arch-sparc.h | 47 ++++++
tools/include/nolibc/arch-x86.h | 13 ++
tools/include/nolibc/signal.h | 103 ++++++++++++
tools/include/nolibc/sys.h | 2 +-
tools/include/nolibc/time.h | 3 +-
tools/include/nolibc/types.h | 81 ++++++----
.../selftests/nolibc/nolibc-test-linkage.c | 17 +-
tools/testing/selftests/nolibc/nolibc-test.c | 150 +++++++++++++++++-
17 files changed, 422 insertions(+), 44 deletions(-)
--
2.50.0
I am submitting a new selftest for the netpoll subsystem specifically
targeting the case where the RX is polling in the TX path, which is
a case that we don't have any test in the tree today. This is done when
netpoll_poll_dev() called, and this test creates a scenario when that is
probably.
The test does the following:
1) Configuring a single RX/TX queue to increase contention on the
interface.
2) Generating background traffic to saturate the network, mimicking
real-world congestion.
3) Sending netconsole messages to trigger netpoll polling and monitor
its behavior.
4) Using dynamic netconsole targets via configfs, with the ability to
delete and recreate targets during the test.
5) Running bpftrace in parallel to verify that netpoll_poll_dev() is
called when expected. If it is called, then the test passes,
otherwise the test is marked as skipped.
In order to achieve it, I stole Jakub's bpftrace helper from [1], and
did some small changes that I found useful to use the helper.
So, this patchset basically contains:
1) The code stolen from Jakub
2) Improvements on bpftrace() helper
3) The selftest itself
Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1]
---
Changes in v6:
- Remove the network toggled (Jakub)
- Set ringsize and queue size (Jakub)
- Some other general improvements (Jakub)
- Link to v5: https://lore.kernel.org/r/20250709-netpoll_test-v5-0-b3737895affe@debian.org
Changes in v5:
- Rebased on top of net-next.
- Calling bpftrace_stop using the defer helper. (Willem)
- Link to v4: https://lore.kernel.org/r/20250702-netpoll_test-v4-0-cec227e85639@debian.org
Changes in v4:
- Make the test XFail if it doesn't hit the function we are looking for
- Toggle the interface while the traffic is flowing.
- Bumped the number of messages from 10 to 40 per iterations.
* This is hitting ~15 times per run on my vng test.
- Decreased the time from 15 seconds to 10 seconds, given that if
it didn't hit the function in 10 seconds, 5 seconds extra will not
help.
- Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org
Changes in v3:
- Make pylint happy (Simon)
- Remove the unnecessary patch in bpftrace to raise an exception when it
fails. (Jakub)
- Improved the bpftrace code (Willem)
- Stop sending messages if bpftrace is not alive anymore.
- Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org
Changes in v2:
- Stole Jakub's helper to run bpftrace
- Removed the DEBUG option and moved logs to logging
- Change the code to have a higher chance of calling netpoll_poll_dev().
In my current configuration, it is hitting multiple times during the
test.
- Save and restore TX/RX queue size (Jakub)
- Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org
---
Breno Leitao (2):
selftests: drv-net: Strip '@' prefix from bpftrace map keys
selftests: net: add netpoll basic functionality test
Jakub Kicinski (1):
selftests: drv-net: add helper/wrapper for bpftrace
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/py/__init__.py | 3 +-
.../testing/selftests/drivers/net/netpoll_basic.py | 411 +++++++++++++++++++++
tools/testing/selftests/net/lib/py/utils.py | 35 ++
4 files changed, 449 insertions(+), 1 deletion(-)
---
base-commit: 0f26870a989bf69957ed69d10c7ffc57ca5a7f52
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v22.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v22 (11-Jul-2025)
- Fix the issue when user provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>)
v21 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>)
- Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>)
- Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>)
- Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>)
- Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>)
- Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>)
- Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>)
- Fix some typos and format issues of dualpi2 attributes
v20 (21-Jun-2025)
- Add one more commit to fix warning and style check on tdc.sh reported by shellcheck
- Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>)
v19 (14-Jun-2025)
- Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>)
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (5):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Fix warning and style check on tdc.sh
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 151 ++-
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1171 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 6 +-
9 files changed, 1665 insertions(+), 5 deletions(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
This patch series add tests to validate XDP native support for PASS,
DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment.
For adjustment tests, validate support for both the extension and
shrinking cases across various packet sizes and offset values.
The pass criteria for head/tail adjustment tests require that at-least
one adjustment value works for at-least one packet size. This ensure
that the variability in maximum supported head/tail adjustment offset
across different drivers is being incorporated.
The results reported in this series are based on fbnic. However, the
series is tested against multiple other drivers including netdevism.
Note: The XDP support for fbnic will be added later.
---
Change-log:
V3:
P1: Remove exception handling upon xdp attachment failure for better
debuggability
P4: Handle the bound issue with the use of bpf_xdp_load_bytes()
V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com
V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.…
Mohsin Bashir (4):
selftests: drv-net: Test XDP_PASS/DROP support
selftests: drv-net: Test XDP_TX support
selftests: drv-net: Test tail-adjustment support
selftests: drv-net: Test head-adjustment support
tools/testing/selftests/drivers/net/Makefile | 1 +
tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++
.../selftests/net/lib/xdp_native.bpf.c | 538 ++++++++++++++
3 files changed, 1195 insertions(+)
create mode 100755 tools/testing/selftests/drivers/net/xdp.py
create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c
--
2.47.1
New version (unchanged for patches 1-3), with a test added so we can
detect this.
Followup of https://lore.kernel.org/linux-input/c75433e0-9b47-4072-bbe8-b1d14ea97b13@ro…
This initial series attempt at fixing the various bugs discovered by
Alan regarding __hid_request().
Syzbot managed to create a report descriptor which presents a feature
request of size 0 (still trying to extract it) and this exposed the fact
that __hid_request() was incorrectly handling the case when the report
ID is not used.
Send a first batch of fixes now so we get the feedback from syzbot ASAP.
Note: in the series, I also mentioned that the report of size 0 should
be stripped out of the HID device, but I'm not entirely sure this would
be a good idea in the end.
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Changes in v2:
- added Tested-by from syzbot (https://lore.kernel.org/r/686e9113.050a0220.385921.0008.GAE@google.com)
- added a python test for it
- Link to v1: https://lore.kernel.org/r/20250709-report-size-null-v1-0-194912215cbc@kerne…
---
Benjamin Tissoires (4):
HID: core: ensure the allocated report buffer can contain the reserved report ID
HID: core: ensure __hid_request reserves the report ID as the first byte
HID: core: do not bypass hid_hw_raw_request
selftests/hid: add a test case for the recent syzbot underflow
drivers/hid/hid-core.c | 19 +++++--
tools/testing/selftests/hid/tests/test_mouse.py | 70 +++++++++++++++++++++++++
2 files changed, 84 insertions(+), 5 deletions(-)
---
base-commit: 1f988d0788f50d8464f957e793fab356e2937369
change-id: 20250709-report-size-null-37619ea20288
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
Currently testing of userspace and in-kernel API use two different
frameworks. kselftests for the userspace ones and Kunit for the
in-kernel ones. Besides their different scopes, both have different
strengths and limitations:
Kunit:
* Tests are normal kernel code.
* They use the regular kernel toolchain.
* They can be packaged and distributed as modules conveniently.
Kselftests:
* Tests are normal userspace code
* They need a userspace toolchain.
A kernel cross toolchain is likely not enough.
* A fair amout of userland is required to run the tests,
which means a full distro or handcrafted rootfs.
* There is no way to conveniently package and run kselftests with a
given kernel image.
* The kselftests makefiles are not as powerful as regular kbuild.
For example they are missing proper header dependency tracking or more
complex compiler option modifications.
Therefore kunit is much easier to run against different kernel
configurations and architectures.
This series aims to combine kselftests and kunit, avoiding both their
limitations. It works by compiling the userspace kselftests as part of
the regular kernel build, embedding them into the kunit kernel or module
and executing them from there. If the kernel toolchain is not fit to
produce userspace because of a missing libc, the kernel's own nolibc can
be used instead.
The structured TAP output from the kselftest is integrated into the
kunit KTAP output transparently, the kunit parser can parse the combined
logs together.
Further room for improvements:
* Call each test in its completely dedicated namespace
* Handle additional test files besides the test executable through
archives. CPIO, cramfs, etc.
* Compatibility with kselftest_harness.h (in progress)
* Expose the blobs in debugfs
* Provide some convience wrappers around compat userprogs
* Figure out a migration path/coexistence solution for
kunit UAPI and tools/testing/selftests/
Output from the kunit example testcase, note the output of
"example_uapi_tests".
$ ./tools/testing/kunit/kunit.py run --kunitconfig lib/kunit example
...
Running tests with:
$ .kunit/linux kunit.filter_glob=example kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[11:53:53] ================== example (10 subtests) ===================
[11:53:53] [PASSED] example_simple_test
[11:53:53] [SKIPPED] example_skip_test
[11:53:53] [SKIPPED] example_mark_skipped_test
[11:53:53] [PASSED] example_all_expect_macros_test
[11:53:53] [PASSED] example_static_stub_test
[11:53:53] [PASSED] example_static_stub_using_fn_ptr_test
[11:53:53] [PASSED] example_priv_test
[11:53:53] =================== example_params_test ===================
[11:53:53] [SKIPPED] example value 3
[11:53:53] [PASSED] example value 2
[11:53:53] [PASSED] example value 1
[11:53:53] [SKIPPED] example value 0
[11:53:53] =============== [PASSED] example_params_test ===============
[11:53:53] [PASSED] example_slow_test
[11:53:53] ======================= (4 subtests) =======================
[11:53:53] [PASSED] procfs
[11:53:53] [PASSED] userspace test 2
[11:53:53] [SKIPPED] userspace test 3: some reason
[11:53:53] [PASSED] userspace test 4
[11:53:53] ================ [PASSED] example_uapi_test ================
[11:53:53] ===================== [PASSED] example =====================
[11:53:53] ============================================================
[11:53:53] Testing complete. Ran 16 tests: passed: 11, skipped: 5
[11:53:53] Elapsed time: 67.543s total, 1.823s configuring, 65.655s building, 0.058s running
Based on v6.16-rc1.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v4:
- Move Kconfig.nolibc from tools/ to init/
- Drop generic userprogs nolibc integration
- Drop generic blob framework
- Pick up review tags from David
- Extend new kunit TAP parser tests
- Add MAINTAINERS entry
- Allow CONFIG_KUNIT_UAPI=m
- Split /proc validation into dedicated UAPI test
- Trim recipient list a bit
- Use KUNIT_FAIL_AND_ABORT() over KUNIT_FAIL()
- Link to v3: https://lore.kernel.org/r/20250611-kunit-kselftests-v3-0-55e3d148cbc6@linut…
Changes in v3:
- Reintroduce CONFIG_CC_CAN_LINK_STATIC
- Enable CONFIG_ARCH_HAS_NOLIBC for m68k and SPARC
- Properly handle 'clean' target for userprogs
- Use ramfs over tmpfs to reduce dependencies
- Inherit userprogs byte order and ABI from kernel
- Drop now unnecessary "#ifndef NOLIBC"
- Pick up review tags
- Drop usage of __private in blob.h,
sparse complains and it is not really necessary
- Fix execution on loongarch when using clang
- Drop userprogs libgcc handling, it was ugly and is not yet necessary
- Link to v2: https://lore.kernel.org/r/20250407-kunit-kselftests-v2-0-454114e287fd@linut…
Changes in v2:
- Rebase onto v6.15-rc1
- Add documentation and kernel docs
- Resolve invalid kconfig breakages
- Drop already applied patch "kbuild: implement CONFIG_HEADERS_INSTALL for Usermode Linux"
- Drop userprogs CONFIG_WERROR integration, it doesn't need to be part of this series
- Replace patch prefix "kconfig" with "kbuild"
- Rename kunit_uapi_run_executable() to kunit_uapi_run_kselftest()
- Generate private, conflict-free symbols in the blob framework
- Handle kselftest exit codes
- Handle SIGABRT
- Forward output also to kunit debugfs log
- Install a fd=0 stdin filedescriptor
- Link to v1: https://lore.kernel.org/r/20250217-kunit-kselftests-v1-0-42b4524c3b0a@linut…
---
Thomas Weißschuh (15):
kbuild: userprogs: avoid duplication of flags inherited from kernel
kbuild: userprogs: also inherit byte order and ABI from kernel
kbuild: doc: add label for userprogs section
init: re-add CONFIG_CC_CAN_LINK_STATIC
init: add nolibc build support
fs,fork,exit: export symbols necessary for KUnit UAPI support
kunit: tool: Add test for nested test result reporting
kunit: tool: Don't overwrite test status based on subtest counts
kunit: tool: Parse skipped tests from kselftest.h
kunit: Always descend into kunit directory during build
kunit: qemu_configs: loongarch: Enable LSX/LSAX
kunit: Introduce UAPI testing framework
kunit: uapi: Add example for UAPI tests
kunit: uapi: Introduce preinit executable
kunit: uapi: Validate usability of /proc
Documentation/dev-tools/kunit/api/index.rst | 5 +
Documentation/dev-tools/kunit/api/uapi.rst | 14 +
Documentation/kbuild/makefiles.rst | 2 +
MAINTAINERS | 11 +
Makefile | 7 +-
fs/exec.c | 2 +
fs/file.c | 1 +
fs/filesystems.c | 2 +
fs/fs_struct.c | 1 +
fs/pipe.c | 2 +
include/kunit/uapi.h | 77 ++++++
init/Kconfig | 7 +
init/Kconfig.nolibc | 15 +
init/Makefile.nolibc | 13 +
kernel/exit.c | 3 +
kernel/fork.c | 2 +
lib/Makefile | 4 -
lib/kunit/Kconfig | 14 +
lib/kunit/Makefile | 27 +-
lib/kunit/kunit-example-test.c | 15 +
lib/kunit/kunit-example-uapi.c | 22 ++
lib/kunit/kunit-test-uapi.c | 51 ++++
lib/kunit/kunit-test.c | 23 +-
lib/kunit/kunit-uapi.c | 305 +++++++++++++++++++++
lib/kunit/uapi-preinit.c | 63 +++++
tools/testing/kunit/kunit_parser.py | 13 +-
tools/testing/kunit/kunit_tool_test.py | 11 +
tools/testing/kunit/qemu_configs/loongarch.py | 2 +
.../test_is_test_passed-failure-nested.log | 10 +
.../test_data/test_is_test_passed-kselftest.log | 3 +-
30 files changed, 714 insertions(+), 13 deletions(-)
---
base-commit: 9d5898b413d17510b2a41664a42390a2c79f8bf4
change-id: 20241015-kunit-kselftests-56273bc40442
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
I am submitting a new selftest for the netpoll subsystem specifically
targeting the case where the RX is polling in the TX path, which is
a case that we don't have any test in the tree today. This is done when
netpoll_poll_dev() called, and this test creates a scenario when that is
probably.
The test does the following:
1) Configuring a single RX/TX queue to increase contention on the
interface.
2) Generating background traffic to saturate the network, mimicking
real-world congestion.
3) Sending netconsole messages to trigger netpoll polling and monitor
its behavior.
4) Using dynamic netconsole targets via configfs, with the ability to
delete and recreate targets during the test.
5) Running bpftrace in parallel to verify that netpoll_poll_dev() is
called when expected. If it is called, then the test passes,
otherwise the test is marked as skipped.
In order to achieve it, I stole Jakub's bpftrace helper from [1], and
did some small changes that I found useful to use the helper.
So, this patchset basically contains:
1) The code stolen from Jakub
2) Improvements on bpftrace() helper
3) The selftest itself
Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1]
---
Changes in v5:
- Rebased on top of net-next.
- Calling bpftrace_stop using the defer helper. (Willem)
- Link to v4: https://lore.kernel.org/r/20250702-netpoll_test-v4-0-cec227e85639@debian.org
Changes in v4:
- Make the test XFail if it doesn't hit the function we are looking for
- Toggle the interface while the traffic is flowing.
- Bumped the number of messages from 10 to 40 per iterations.
* This is hitting ~15 times per run on my vng test.
- Decreased the time from 15 seconds to 10 seconds, given that if
it didn't hit the function in 10 seconds, 5 seconds extra will not
help.
- Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org
Changes in v3:
- Make pylint happy (Simon)
- Remove the unnecessary patch in bpftrace to raise an exception when it
fails. (Jakub)
- Improved the bpftrace code (Willem)
- Stop sending messages if bpftrace is not alive anymore.
- Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org
Changes in v2:
- Stole Jakub's helper to run bpftrace
- Removed the DEBUG option and moved logs to logging
- Change the code to have a higher chance of calling netpoll_poll_dev().
In my current configuration, it is hitting multiple times during the
test.
- Save and restore TX/RX queue size (Jakub)
- Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org
---
Breno Leitao (2):
selftests: drv-net: Strip '@' prefix from bpftrace map keys
selftests: net: add netpoll basic functionality test
Jakub Kicinski (1):
selftests: drv-net: add helper/wrapper for bpftrace
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/py/__init__.py | 3 +-
.../testing/selftests/drivers/net/netpoll_basic.py | 366 +++++++++++++++++++++
tools/testing/selftests/net/lib/py/utils.py | 35 ++
4 files changed, 404 insertions(+), 1 deletion(-)
---
base-commit: 0234362d0af4649bc2ff745e94d06d0c6f0a46ce
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: Jann Horn <jannh(a)google.com>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Miguel Ojeda <ojeda(a)kernel.org>
To: Alex Gaynor <alex.gaynor(a)gmail.com>
To: Boqun Feng <boqun.feng(a)gmail.com>
To: Gary Guo <gary(a)garyguo.net>
To: Björn Roy Baron <bjorn3_gh(a)protonmail.com>
To: Benno Lossin <benno.lossin(a)proton.me>
To: Andreas Hindborg <a.hindborg(a)kernel.org>
To: Alice Ryhl <aliceryhl(a)google.com>
To: Trevor Gross <tmgross(a)umich.edu>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Cc: rust-for-linux(a)vger.kernel.org
changelog
---------
v18:
- rebased on 6.16-rc1
- uprobe handling clears ELP in sstatus image in pt_regs
- vdso was missing shadow stack elf note for object files.
added that. Additional asm file for vdso needed the elf marker
flag. toolchain should complain if `-fcf-protection=full` and
marker is missing for object generated from asm file. Asked
toolchain folks to fix this. Although no reason to gate the merge
on that.
- Split up compile options for march and fcf-protection in vdso
Makefile
- CONFIG_RISCV_USER_CFI option is moved under "Kernel features" menu
Added `arch/riscv/configs/hardening.config` fragment which selects
CONFIG_RISCV_USER_CFI
v17:
- fixed warnings due to empty macros in usercfi.h (reported by alexg)
- fixed prefixes in commit titles reported by alexg
- took below uprobe with fcfi v2 patch from Zong Li and squashed it with
"riscv/traps: Introduce software check exception and uprobe handling"
https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/
v16:
- If FWFT is not implemented or returns error for shadow stack activation, then
no_usercfi is set to disable shadow stack. Although this should be picked up
by extension validation and activation. Fixed this bug for zicfilp and zicfiss
both. Thanks to Charlie Jenkins for reporting this.
- If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by
Charlie Jenkins.
- Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to
keep it off till we have more hardware availibility with RVA23 profile and
zimop/zcmop implemented. Else this will start breaking people's workflow
- Includes the fix if "!RV64 and !SBI" then definitions for FWFT in
asm-offsets.c error.
v15:
- Toolchain has been updated to include `-fcf-protection` flag. This
exists for x86 as well. Updated kernel patches to compile vDSO and
selftest to compile with `fcf-protection=full` flag.
- selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI.
- Patch to enable shadow stack for kernel wasn't hidden behind
CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that.
v14:
- rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches
Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants.
- Took Radim's suggestions on bitfields.
- Placed cfi_state at the end of thread_info block so that current situation
is not disturbed with respect to member fields of thread_info in single
cacheline.
v13:
- cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses
riscv_has_extension_unlikely()
- uses nops(count) to create nop slide
- RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it
- changed ternaries to simply use implicit casting to convert to bool.
- kernel command line allows to disable zicfilp and zicfiss independently.
updated kernel-parameters.txt.
- ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace
kselftest.
- cosmetic and grammatical changes to documentation.
v12:
- It seems like I had accidently squashed arch agnostic indirect branch
tracking prctl and riscv implementation of those prctls. Split them again.
- set_shstk_status/set_indir_lp_status perform CSR writes only when CPU
support is available. As suggested by Zong Li.
- Some minor clean up in kselftests as suggested by Zong Li.
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v18:
- Link to v17: https://lore.kernel.org/r/20250604-v5_user_cfi_series-v17-0-4565c2cf869f@ri…
Changes in v17:
- Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri…
Changes in v16:
- Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri…
Changes in v15:
- changelog posted just below cover letter
- Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri…
Changes in v14:
- changelog posted just below cover letter
- Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri…
Changes in v13:
- changelog posted just below cover letter
- Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri…
Changes in v12:
- changelog posted just below cover letter
- Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri…
Changes in v11:
- changelog posted just below cover letter
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Deepak Gupta (25):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv/mm: manufacture shadow stack pte
riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs
riscv/mm: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception and uprobe handling
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: kernel command line option to opt out of user cfi
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad and shadow stack note
Documentation/admin-guide/kernel-parameters.txt | 8 +
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 179 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 21 +
arch/riscv/Makefile | 5 +-
arch/riscv/configs/hardening.config | 4 +
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 12 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 26 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 1 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 95 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 34 ++
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 10 +
arch/riscv/kernel/cpufeature.c | 27 +
arch/riscv/kernel/entry.S | 33 +-
arch/riscv/kernel/head.S | 27 +
arch/riscv/kernel/process.c | 27 +-
arch/riscv/kernel/ptrace.c | 95 ++++
arch/riscv/kernel/signal.c | 148 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 54 ++
arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 11 +-
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/kernel/vdso/vgetrandom-chacha.S | 5 +-
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 16 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 2 +
include/uapi/linux/prctl.h | 27 +
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 16 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++
tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 27 +
56 files changed, 2383 insertions(+), 31 deletions(-)
---
base-commit: a2a05801de77ca5122fc34e3eb84d6359ef70389
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
This patch series add tests to validate XDP native support for PASS,
DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment.
For adjustment tests, validate support for both the extension and
shrinking cases across various packet sizes and offset values.
The pass criteria for head/tail adjustment tests require that at-least
one adjustment value works for at-least one packet size. This ensure
that the variability in maximum supported head/tail adjustment offset
across different drivers is being incorporated.
The results reported in this series are based on fbnic. However, the
series is tested against multiple other drivers including netdevism.
Note: The XDP support for fbnic will be added later.
---
Change-log:
V2:
- Remove unused and libxdp-devel headers
- Fix reverse xmas tree in xdp_native.bpf.c
- Reorder headers in xdp_native.bpf.c
V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.…
Mohsin Bashir (5):
selftests: drv-net: Add bpftool util
selftests: drv-net: Test XDP_PASS/DROP support
selftests: drv-net: Test XDP_TX support
selftests: drv-net: Test tail-adjustment support
selftests: drv-net: Test head-adjustment support
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/py/__init__.py | 2 +-
tools/testing/selftests/drivers/net/xdp.py | 663 ++++++++++++++++++
tools/testing/selftests/net/lib/py/utils.py | 4 +
.../selftests/net/lib/xdp_native.bpf.c | 524 ++++++++++++++
5 files changed, 1193 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/drivers/net/xdp.py
create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c
--
2.47.1
Code using IS_ENABLED(CONFIG_MODULES) as a C expression may need access
to the module structure definitions to compile.
Make sure these structure definitions are always visible.
This will conflict with commit 6bb37af62634 ("module: Move modprobe_path
and modules_disabled ctl_tables into the module subsys") from the sysctl
tree, but the resolution is trivial.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Pick up tags from v1
- Keep MODULE_ARCH_INIT and 'struct module' definitions together
- Link to v1: https://lore.kernel.org/r/20250612-kunit-ifdef-modules-v1-0-fdccd42dcff8@li…
---
Thomas Weißschuh (3):
module: move 'struct module_use' to internal.h
module: make structure definitions always visible
kunit: test: Drop CONFIG_MODULE ifdeffery
include/linux/module.h | 29 +++++++++++------------------
kernel/module/internal.h | 7 +++++++
lib/kunit/test.c | 8 --------
3 files changed, 18 insertions(+), 26 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250611-kunit-ifdef-modules-0fefd13ae153
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Code using IS_ENABLED(CONFIG_MODULES) as a C expression may need access
to the module structure definitions to compile.
Make sure these structure definitions are always visible.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Thomas Weißschuh (3):
module: move 'struct module_use' to internal.h
module: make structure definitions always visible
kunit: test: Drop CONFIG_MODULE ifdeffery
include/linux/module.h | 30 ++++++++++++------------------
kernel/module/internal.h | 7 +++++++
lib/kunit/test.c | 8 --------
3 files changed, 19 insertions(+), 26 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250611-kunit-ifdef-modules-0fefd13ae153
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
It is currently impossible to enable ipv6 forwarding on a per-interface
basis like in ipv4. To enable forwarding on an ipv6 interface we need to
enable it on all interfaces and disable it on the other interfaces using
a netfilter rule. This is especially cumbersome if you have lots of
interface and only want to enable forwarding on a few. According to the
sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding
for all interfaces, while the interface-specific
`net.ipv6.conf.<interface>.forwarding` configures the interface
Host/Router configuration.
Introduce a new sysctl flag `force_forwarding`, which can be set on every
interface. The ip6_forwarding function will then check if the global
forwarding flag OR the force_forwarding flag is active and forward the
packet.
To preserver backwards-compatibility reset the flag (on all interfaces)
to 0 if the net.ipv6.conf.all.forwarding flag is set to 0.
Add a short selftest that checks if a packet gets forwarded with and
without `force_forwarding`.
[0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Signed-off-by: Gabriel Goller <g.goller(a)proxmox.com>
---
v5:
* update conf/all/forwarding docs
* simplified backwards-compat comment
* remove ASSERT_RTNL as it's guaranteed by __in6_dev_get_rtnl_net()
already
* cange ip6_forward logic so that it doesn't depend on the idev
existing
* move WRITE_ONCE inside device lock
v4: https://lore.kernel.org/netdev/20250703160154.560239-1-g.goller@proxmox.com/
* actually write the sysctl value to the table
* use ASSERT_RTNL() when forwarding the sysctl change
* remove useless comments in function body
* simplify forwarding and force_forwarding check in ip6_output.c
* fix code backticks in Documentation (double instead of single)
* add selftests
v3: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/
* remove forwarding=0 setting force_forwarding=0 globally.
* add min and max (0 and 1) value to sysctl.
v2: https://lore.kernel.org/netdev/20250701140423.487411-1-g.goller@proxmox.com/
* rename from `do_forwarding` to `force_forwarding`.
* add global `force_forwarding` flag which will enable
`force_forwarding` on every interface like the
`ipv4.all.forwarding` flag.
* `forwarding`=0 will disable global and per-interface
`force_forwarding`.
* export option as NETCONFA_FORCE_FORWARDING.
v1: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/
Documentation/networking/ip-sysctl.rst | 9 +-
include/linux/ipv6.h | 1 +
include/uapi/linux/ipv6.h | 1 +
include/uapi/linux/netconf.h | 1 +
include/uapi/linux/sysctl.h | 1 +
net/ipv6/addrconf.c | 83 ++++++++++++++
net/ipv6/ip6_output.c | 3 +-
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/ipv6_force_forwarding.sh | 105 ++++++++++++++++++
9 files changed, 202 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/net/ipv6_force_forwarding.sh
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 0f1251cce314..6d92bae0257a 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -2281,8 +2281,8 @@ conf/all/disable_ipv6 - BOOLEAN
conf/all/forwarding - BOOLEAN
Enable global IPv6 forwarding between all interfaces.
- IPv4 and IPv6 work differently here; e.g. netfilter must be used
- to control which interfaces may forward packets and which not.
+ IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must
+ be used to control which interfaces may forward packets.
This also sets all interfaces' Host/Router setting
'forwarding' to the specified value. See below for details.
@@ -2292,6 +2292,11 @@ conf/all/forwarding - BOOLEAN
proxy_ndp - BOOLEAN
Do proxy ndp.
+force_forwarding - BOOLEAN
+ Enable forwarding on this interface only -- regardless of the setting on
+ ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0,
+ the ``force_forwarding`` flag will be reset on all interfaces.
+
fwmark_reflect - BOOLEAN
Controls the fwmark of kernel-generated IPv6 reply packets that are not
associated with a socket for example, TCP RSTs or ICMPv6 echo replies).
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 5aeeed22f35b..d975a86f29be 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -17,6 +17,7 @@ struct ipv6_devconf {
__s32 hop_limit;
__s32 mtu6;
__s32 forwarding;
+ __s32 force_forwarding;
__s32 disable_policy;
__s32 proxy_ndp;
__cacheline_group_end(ipv6_devconf_read_txrx);
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index cf592d7b630f..d4d3ae774b26 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -199,6 +199,7 @@ enum {
DEVCONF_NDISC_EVICT_NOCARRIER,
DEVCONF_ACCEPT_UNTRACKED_NA,
DEVCONF_ACCEPT_RA_MIN_LFT,
+ DEVCONF_FORCE_FORWARDING,
DEVCONF_MAX
};
diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h
index fac4edd55379..1c8c84d65ae3 100644
--- a/include/uapi/linux/netconf.h
+++ b/include/uapi/linux/netconf.h
@@ -19,6 +19,7 @@ enum {
NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
NETCONFA_INPUT,
NETCONFA_BC_FORWARDING,
+ NETCONFA_FORCE_FORWARDING,
__NETCONFA_MAX
};
#define NETCONFA_MAX (__NETCONFA_MAX - 1)
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 8981f00204db..63d1464cb71c 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -573,6 +573,7 @@ enum {
NET_IPV6_ACCEPT_RA_FROM_LOCAL=26,
NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27,
NET_IPV6_RA_DEFRTR_METRIC=28,
+ NET_IPV6_FORCE_FORWARDING=29,
__NET_IPV6_MAX
};
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ba2ec7c870cc..92acf44febd1 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -239,6 +239,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
.ndisc_evict_nocarrier = 1,
.ra_honor_pio_life = 0,
.ra_honor_pio_pflag = 0,
+ .force_forwarding = 0,
};
static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -303,6 +304,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
.ndisc_evict_nocarrier = 1,
.ra_honor_pio_life = 0,
.ra_honor_pio_pflag = 0,
+ .force_forwarding = 0,
};
/* Check if link is ready: is it up and is a valid qdisc available */
@@ -857,6 +859,9 @@ static void addrconf_forward_change(struct net *net, __s32 newf)
idev = __in6_dev_get_rtnl_net(dev);
if (idev) {
int changed = (!idev->cnf.forwarding) ^ (!newf);
+ /* Disabling all.forwarding sets 0 to force_forwarding for all interfaces */
+ if (newf == 0)
+ WRITE_ONCE(idev->cnf.force_forwarding, newf);
WRITE_ONCE(idev->cnf.forwarding, newf);
if (changed)
@@ -5719,6 +5724,7 @@ static void ipv6_store_devconf(const struct ipv6_devconf *cnf,
array[DEVCONF_ACCEPT_UNTRACKED_NA] =
READ_ONCE(cnf->accept_untracked_na);
array[DEVCONF_ACCEPT_RA_MIN_LFT] = READ_ONCE(cnf->accept_ra_min_lft);
+ array[DEVCONF_FORCE_FORWARDING] = READ_ONCE(cnf->force_forwarding);
}
static inline size_t inet6_ifla6_size(void)
@@ -6747,6 +6753,76 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write
return ret;
}
+static void addrconf_force_forward_change(struct net *net, __s32 newf)
+{
+ struct net_device *dev;
+ struct inet6_dev *idev;
+
+ for_each_netdev(net, dev) {
+ idev = __in6_dev_get_rtnl_net(dev);
+ if (idev) {
+ int changed = (!idev->cnf.force_forwarding) ^ (!newf);
+
+ WRITE_ONCE(idev->cnf.force_forwarding, newf);
+ if (changed) {
+ inet6_netconf_notify_devconf(dev_net(dev), RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ dev->ifindex, &idev->cnf);
+ }
+ }
+ }
+}
+
+static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct inet6_dev *idev = ctl->extra1;
+ struct ctl_table tmp_ctl = *ctl;
+ struct net *net = ctl->extra2;
+ int *valp = ctl->data;
+ int new_val = *valp;
+ int old_val = *valp;
+ loff_t pos = *ppos;
+ int ret;
+
+ tmp_ctl.extra1 = SYSCTL_ZERO;
+ tmp_ctl.extra2 = SYSCTL_ONE;
+ tmp_ctl.data = &new_val;
+
+ ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos);
+
+ if (write && old_val != new_val) {
+ if (!rtnl_net_trylock(net))
+ return restart_syscall();
+
+ WRITE_ONCE(*valp, new_val);
+
+ if (valp == &net->ipv6.devconf_dflt->force_forwarding) {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ NETCONFA_IFINDEX_DEFAULT,
+ net->ipv6.devconf_dflt);
+ } else if (valp == &net->ipv6.devconf_all->force_forwarding) {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ NETCONFA_IFINDEX_ALL,
+ net->ipv6.devconf_all);
+
+ addrconf_force_forward_change(net, new_val);
+ } else {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ idev->dev->ifindex,
+ &idev->cnf);
+ }
+ rtnl_net_unlock(net);
+ }
+
+ if (ret)
+ *ppos = pos;
+ return ret;
+}
+
static int minus_one = -1;
static const int two_five_five = 255;
static u32 ioam6_if_id_max = U16_MAX;
@@ -7217,6 +7293,13 @@ static const struct ctl_table addrconf_sysctl[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_TWO,
},
+ {
+ .procname = "force_forwarding",
+ .data = &ipv6_devconf.force_forwarding,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = addrconf_sysctl_force_forwarding,
+ },
};
static int __addrconf_sysctl_register(struct net *net, char *dev_name,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7bd29a9ff0db..3853090d7282 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -509,7 +509,8 @@ int ip6_forward(struct sk_buff *skb)
u32 mtu;
idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif));
- if (READ_ONCE(net->ipv6.devconf_all->forwarding) == 0)
+ if (!READ_ONCE(net->ipv6.devconf_all->forwarding) &&
+ (!idev || !READ_ONCE(idev->cnf.force_forwarding)))
goto error;
if (skb->pkt_type != PACKET_HOST)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 332f387615d7..f64ec8a15a77 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -112,6 +112,7 @@ TEST_PROGS += skf_net_off.sh
TEST_GEN_FILES += skf_net_off
TEST_GEN_FILES += tfo
TEST_PROGS += tfo_passive.sh
+TEST_PROGS += ipv6_force_forwarding.sh
# YNL files, must be before "include ..lib.mk"
YNL_GEN_FILES := busy_poller netlink-dumps
diff --git a/tools/testing/selftests/net/ipv6_force_forwarding.sh b/tools/testing/selftests/net/ipv6_force_forwarding.sh
new file mode 100644
index 000000000000..62adc9d4afc9
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_force_forwarding.sh
@@ -0,0 +1,105 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test IPv6 force_forwarding interface property
+#
+# This test verifies that the force_forwarding property works correctly:
+# - When global forwarding is disabled, packets are not forwarded normally
+# - When force_forwarding is enabled on an interface, packets are forwarded
+# regardless of the global forwarding setting
+
+source lib.sh
+
+cleanup() {
+ cleanup_ns $ns1 $ns2 $ns3
+}
+
+trap cleanup EXIT
+
+setup_test() {
+ # Create three namespaces: sender, router, receiver
+ setup_ns ns1 ns2 ns3
+
+ # Create veth pairs: ns1 <-> ns2 <-> ns3
+ ip link add name veth12 type veth peer name veth21
+ ip link add name veth23 type veth peer name veth32
+
+ # Move interfaces to namespaces
+ ip link set veth12 netns $ns1
+ ip link set veth21 netns $ns2
+ ip link set veth23 netns $ns2
+ ip link set veth32 netns $ns3
+
+ # Configure interfaces
+ ip -n $ns1 addr add 2001:db8:1::1/64 dev veth12
+ ip -n $ns2 addr add 2001:db8:1::2/64 dev veth21
+ ip -n $ns2 addr add 2001:db8:2::1/64 dev veth23
+ ip -n $ns3 addr add 2001:db8:2::2/64 dev veth32
+
+ # Bring up interfaces
+ ip -n $ns1 link set veth12 up
+ ip -n $ns2 link set veth21 up
+ ip -n $ns2 link set veth23 up
+ ip -n $ns3 link set veth32 up
+
+ # Add routes
+ ip -n $ns1 route add 2001:db8:2::/64 via 2001:db8:1::2
+ ip -n $ns3 route add 2001:db8:1::/64 via 2001:db8:2::1
+
+ # Disable global forwarding
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.all.forwarding=0
+}
+
+test_force_forwarding() {
+ local ret=0
+
+ echo "TEST: force_forwarding functionality"
+
+ # Check if force_forwarding sysctl exists
+ if ! ip netns exec $ns2 test -f /proc/sys/net/ipv6/conf/veth21/force_forwarding; then
+ echo "SKIP: force_forwarding not available"
+ return $ksft_skip
+ fi
+
+ # Test 1: Without force_forwarding, ping should fail
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=0
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=0
+
+ if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then
+ echo "FAIL: ping succeeded when forwarding disabled"
+ ret=1
+ else
+ echo "PASS: forwarding disabled correctly"
+ fi
+
+ # Test 2: With force_forwarding enabled, ping should succeed
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=1
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=1
+
+ if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then
+ echo "PASS: force_forwarding enabled forwarding"
+ else
+ echo "FAIL: ping failed with force_forwarding enabled"
+ ret=1
+ fi
+
+ return $ret
+}
+
+echo "IPv6 force_forwarding test"
+echo "=========================="
+
+setup_test
+test_force_forwarding
+ret=$?
+
+if [ $ret -eq 0 ]; then
+ echo "OK"
+ exit 0
+elif [ $ret -eq $ksft_skip ]; then
+ echo "SKIP"
+ exit $ksft_skip
+else
+ echo "FAIL"
+ exit 1
+fi
--
2.39.5
Show precise rejected function name when attaching to __noreturn and
__btf_id functions.
Add selftest for attaching tracing to __btf_id functions.
---
KaFai Wan (3):
bpf: Show precise rejected function when attaching to __noreturn
functions
bpf: Show precise rejected function when attaching to __btf_id
functions
selftests/bpf: Add selftest for attaching tracing to __btf_id
functions
kernel/bpf/verifier.c | 5 ++++-
.../selftests/bpf/prog_tests/tracing_btf_ids.c | 16 ++++++++++++++++
.../selftests/bpf/progs/fexit_noreturns.c | 2 +-
.../selftests/bpf/progs/tracing_btf_ids.c | 15 +++++++++++++++
4 files changed, 36 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_btf_ids.c
create mode 100644 tools/testing/selftests/bpf/progs/tracing_btf_ids.c
--
2.43.0
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA merges or would otherwise
result in VMA iterator invalidation.
Lorenzo Stoakes (10):
mm/mremap: perform some simple cleanups
mm/mremap: refactor initial parameter sanity checks
mm/mremap: put VMA check and prep logic into helper function
mm/mremap: cleanup post-processing stage of mremap
mm/mremap: use an explicit uffd failure path for mremap
mm/mremap: check remap conditions earlier
mm/mremap: move remap_is_valid() into check_prep_vma()
mm/mremap: clean up mlock populate behaviour
mm/mremap: permit mremap() move of multiple VMAs
tools/testing/selftests: extend mremap_test to test multi-VMA mremap
fs/userfaultfd.c | 15 +-
include/linux/userfaultfd_k.h | 1 +
mm/mremap.c | 502 ++++++++++++++---------
tools/testing/selftests/mm/mremap_test.c | 145 ++++++-
4 files changed, 462 insertions(+), 201 deletions(-)
--
2.50.0
Various KUnit tests require PCI infrastructure to work.
All normal platforms enable PCI by default, but UML does not.
Enabling PCI from .kunitconfig files is problematic as it would not be
portable. So in commit 6fc3a8636a7b ("kunit: tool: Enable virtio/PCI by default on UML")
PCI was enabled by way of CONFIG_UML_PCI_OVER_VIRTIO=y.
However CONFIG_UML_PCI_OVER_VIRTIO requires additional configuration of
CONFIG_UML_PCI_OVER_VIRTIO_DEVICE_ID or will otherwise trigger a WARN() in
virtio_pcidev_init(). However there is no one correct value for
UML_PCI_OVER_VIRTIO_DEVICE_ID which could be used by default.
This warning is confusing when debugging test failures.
On the other hand, the functionality of CONFIG_UML_PCI_OVER_VIRTIO is not
used at all, given that it is completely non-functional as indicated by
the WARN() in question. Instead it is only used as a way to enable
CONFIG_UML_PCI which itself is not directly configurable.
Instead of going through CONFIG_UML_PCI_OVER_VIRTIO, introduce a custom
configuration option which enables CONFIG_UML_PCI without triggering
warnings or building dead code.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
lib/kunit/Kconfig | 7 +++++++
tools/testing/kunit/configs/arch_uml.config | 5 ++---
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index a97897edd9642f3e5df7fdd9dee26ee5cf00d6a4..c8ca155521b2455a221ddbec3f6fc55662c83475 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -93,4 +93,11 @@ config KUNIT_AUTORUN_ENABLED
In most cases this should be left as Y. Only if additional opt-in
behavior is needed should this be set to N.
+config KUNIT_UML_PCI
+ bool "KUnit UML PCI Support"
+ depends on UML
+ select UML_PCI
+ help
+ Enables the PCI subsystem on UML for use by KUnit tests.
+
endif # KUNIT
diff --git a/tools/testing/kunit/configs/arch_uml.config b/tools/testing/kunit/configs/arch_uml.config
index 54ad8972681a2cc724e6122b19407188910b9025..28edf816aa70e6f408d9486efff8898df79ee090 100644
--- a/tools/testing/kunit/configs/arch_uml.config
+++ b/tools/testing/kunit/configs/arch_uml.config
@@ -1,8 +1,7 @@
# Config options which are added to UML builds by default
-# Enable virtio/pci, as a lot of tests require it.
-CONFIG_VIRTIO_UML=y
-CONFIG_UML_PCI_OVER_VIRTIO=y
+# Enable pci, as a lot of tests require it.
+CONFIG_KUNIT_UML_PCI=y
# Enable FORTIFY_SOURCE for wider checking.
CONFIG_FORTIFY_SOURCE=y
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250626-kunit-uml-pci-a2b687553746
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
From: Benjamin Berg <benjamin.berg(a)intel.com>
Hi,
This patchset adds signal handling to nolibc. Initially, I would like to
use this for tests. But in the long run, the goal is to use nolibc for
the UML kernel itself. In both cases, signal handling will be needed.
Benjamin
Benjamin Berg (3):
tools/nolibc: show failed run if test process crashes
tools/nolibc: add more generic BITSET_* macros for FD_*
tools/nolibc: add signal support
tools/include/nolibc/arch-arm.h | 7 ++
tools/include/nolibc/arch-arm64.h | 3 +
tools/include/nolibc/arch-loongarch.h | 3 +
tools/include/nolibc/arch-m68k.h | 10 ++
tools/include/nolibc/arch-mips.h | 3 +
tools/include/nolibc/arch-powerpc.h | 8 ++
tools/include/nolibc/arch-riscv.h | 3 +
tools/include/nolibc/arch-s390.h | 8 +-
tools/include/nolibc/arch-sh.h | 5 +
tools/include/nolibc/arch-sparc.h | 47 ++++++++
tools/include/nolibc/arch-x86.h | 13 +++
tools/include/nolibc/signal.h | 103 ++++++++++++++++++
tools/include/nolibc/sys.h | 2 +-
tools/include/nolibc/time.h | 3 +-
tools/include/nolibc/types.h | 67 ++++++------
.../testing/selftests/nolibc/Makefile.nolibc | 3 +-
tools/testing/selftests/nolibc/nolibc-test.c | 67 ++++++++++++
17 files changed, 319 insertions(+), 36 deletions(-)
--
2.50.0
I got the following warning when writing other tests:
+ handle_test_result_pass 'bond 802.3ad' '(lacp_active off)'
+ local 'test_name=bond 802.3ad'
+ shift
+ local 'opt_str=(lacp_active off)'
+ shift
+ log_test_result 'bond 802.3ad' '(lacp_active off)' ' OK '
+ local 'test_name=bond 802.3ad'
+ shift
+ local 'opt_str=(lacp_active off)'
+ shift
+ local 'result= OK '
+ shift
+ local retmsg=
+ shift
/net/tools/testing/selftests/net/forwarding/../lib.sh: line 315: shift: shift count out of range
This happens because an extra shift is executed even after all arguments
have been consumed. Remove the last shift in log_test_result() to avoid
this warning.
Fixes: a923af1ceee7 ("selftests: forwarding: Convert log_test() to recognize RET values")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
tools/testing/selftests/net/lib.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 006fdadcc4b9..86a216e9aca8 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -312,7 +312,7 @@ log_test_result()
local test_name=$1; shift
local opt_str=$1; shift
local result=$1; shift
- local retmsg=$1; shift
+ local retmsg=$1
printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
if [[ $retmsg ]]; then
--
2.46.0
In 7e9b231c402a ("KVM: selftests: Add a KVM_IRQFD test to verify
uniqueness requirements") we added a test for the newly added irqfd
support but since this feature works with eventfds it won't work unless
the kernel has been built wth eventfd support. Add CONFIG_EVENTFD to
the list of required options for the KVM selftests.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/kvm/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/kvm/config b/tools/testing/selftests/kvm/config
index 8835fed09e9f..96d874b239eb 100644
--- a/tools/testing/selftests/kvm/config
+++ b/tools/testing/selftests/kvm/config
@@ -1,5 +1,6 @@
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
CONFIG_KVM_AMD=y
+CONFIG_EVENTFD=y
CONFIG_USERFAULTFD=y
CONFIG_IDLE_PAGE_TRACKING=y
---
base-commit: 7e9b231c402a297251b3e6e0f5cc16cef7dd3ce5
change-id: 20250709-kvm-selftests-eventfd-config-123bb022fa04
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The kunit test that checks the longests symbol length [1], has triggered
warnings in some pilelines when symbol prefixes are used [2][3]. The test
will to depend on !PREFIX_SYMBOLS and !CFI_CLANG as sujested in [4] and
on !GCOV_KERNEL.
[1] https://lore.kernel.org/rust-for-linux/CABVgOSm=5Q0fM6neBhxSbOUHBgNzmwf2V22…
[2] https://lore.kernel.org/all/20250328112156.2614513-1-arnd@kernel.org/T/#u
[3] https://lore.kernel.org/rust-for-linux/bbd03b37-c4d9-4a92-9be2-75aaf8c19815…
[4] https://lore.kernel.org/linux-kselftest/20250427200916.GA1661412@ax162/T/#t
Reviewed-by: Rae Moar <rmoar(a)google.com>
Signed-off-by: Sergio González Collado <sergio.collado(a)gmail.com>
Acked-by: Randy Dunlap <rdunlap(a)infradead.org>
Tested-by: Randy Dunlap <rdunlap(a)infradead.org>
---
lib/Kconfig.debug | 1 +
lib/tests/longest_symbol_kunit.c | 3 +--
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ebe33181b6e6..4a75a52803b6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2885,6 +2885,7 @@ config FORTIFY_KUNIT_TEST
config LONGEST_SYM_KUNIT_TEST
tristate "Test the longest symbol possible" if !KUNIT_ALL_TESTS
depends on KUNIT && KPROBES
+ depends on !PREFIX_SYMBOLS && !CFI_CLANG && !GCOV_KERNEL
default KUNIT_ALL_TESTS
help
Tests the longest symbol possible
diff --git a/lib/tests/longest_symbol_kunit.c b/lib/tests/longest_symbol_kunit.c
index e3c28ff1807f..9b4de3050ba7 100644
--- a/lib/tests/longest_symbol_kunit.c
+++ b/lib/tests/longest_symbol_kunit.c
@@ -3,8 +3,7 @@
* Test the longest symbol length. Execute with:
* ./tools/testing/kunit/kunit.py run longest-symbol
* --arch=x86_64 --kconfig_add CONFIG_KPROBES=y --kconfig_add CONFIG_MODULES=y
- * --kconfig_add CONFIG_RETPOLINE=n --kconfig_add CONFIG_CFI_CLANG=n
- * --kconfig_add CONFIG_MITIGATION_RETPOLINE=n
+ * --kconfig_add CONFIG_CPU_MITIGATIONS=n --kconfig_add CONFIG_GCOV_KERNEL=n
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
base-commit: 772b78c2abd85586bb90b23adff89f7303c704c7
--
2.39.2
This patch set improves the documentation and selftests for XDP Rx metadata
handling. The first patch clarifies the documentation around XDP metadata
layout and METADATA_SIZE. The second patch enhances the BPF selftests to
make XDP metadata handling more robust across different NICs.
Prior to this patch set, the XDP program might accidentally overwrite the
device-reserved metadata.
V3:
- update doc and commit msg accordingly.
V2: https://lore.kernel.org/netdev/20250702030349.3275368-1-yoong.siang.song@in…
- unconditionally do bpf_xdp_adjust_meta with -XDP_METADATA_SIZE (Stanislav)
V1: https://lore.kernel.org/netdev/20250701042940.3272325-1-yoong.siang.song@in…
Song Yoong Siang (2):
doc: enhance explanation of XDP Rx metadata layout and METADATA_SIZE
selftests/bpf: Enhance XDP Rx metadata handling
Documentation/networking/xdp-rx-metadata.rst | 36 +++++++++++++++----
.../selftests/bpf/prog_tests/xdp_metadata.c | 2 +-
.../selftests/bpf/progs/xdp_hw_metadata.c | 2 +-
.../selftests/bpf/progs/xdp_metadata.c | 2 +-
tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +-
tools/testing/selftests/bpf/xdp_metadata.h | 7 ++++
6 files changed, 41 insertions(+), 10 deletions(-)
--
2.34.1
┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ │ │ PCI Endpoint │ │ PCI Host │
│ │ │ │ │ │
│ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │
│ │ │ │ │ │
│ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │
│ Controller │ │ update doorbell register address│ │ │
│ │ │ for BAR │ │ │
│ │ │ │ │ 3. Write BAR<n>│
│ │◄──┼───────────────────────────────────┼───┤ │
│ │ │ │ │ │
│ ├──►│ 4.Irq Handle │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────┘ └───────────────────────────────────┘ └────────────────┘
This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/
Original patch only target to vntb driver. But actually it is common
method.
This patches add new API to pci-epf-core, so any EP driver can use it.
Previous v2 discussion here.
https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/
Changes in v20:
- remove set epf of_node's patch and only support one epf now.
- move imx6's patch to first
- detail change see each patches' change log
- Link to v19: https://lore.kernel.org/r/20250609-ep-msi-v19-0-77362eaa48fa@nxp.com
Changes in v19:
- irq part already in v6.16-rc1, only missed pcie/dts part
- rebase to v6.16-rc1
- update commit message for patch IMMUTABLE check.
- Link to v18: https://lore.kernel.org/r/20250414-ep-msi-v18-0-f69b49917464@nxp.com
Changes in v18:
- pci-ep.yaml: sort property order, fix maxvalue to 0x7ffff for msi-map-mask and
iommu-map-mask
- Link to v17: https://lore.kernel.org/r/20250407-ep-msi-v17-0-633ab45a31d0@nxp.com
Changes in v17:
- move document part to pci-ep.yaml
- Link to v16: https://lore.kernel.org/r/20250404-ep-msi-v16-0-d4919d68c0d0@nxp.com
Changes in v16:
- remove arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file
because there are better patches, which under review.
- Add document for pcie-ep msi-map usage
- other change to see each patch's change log
About IMMUTABLE (No change for this part, tglx provide feedback)
> - This IMMUTABLE thing serves no purpose, because you don't randomly
> plug this end-point block on any MSI controller. They come as part
> of an SoC.
"Yes and no. The problem is that the EP implementation is meant to be a
generic library and while GIC-ITS guarantees immutability of the
address/data pair after setup, there are architectures (x86, loongson,
riscv) where the base MSI controller does not and immutability is only
achieved when interrupt remapping is enabled. The latter can be disabled
at boot-time and then the EP implementation becomes a lottery across
affinity changes.
That was my concern about this library implementation and that's why I
asked for a mechanism to ensure that the underlying irqdomain provides a
immutable address/data pair.
So it does not matter for GIC-ITS, but in the larger picture it matters.
Thanks,
tglx
"
So it does not matter for GIC-ITS, but in the larger picture it matters.
- Link to v15: https://lore.kernel.org/r/20250211-ep-msi-v15-0-bcacc1f2b1a9@nxp.com
Changes in v15:
- rebase to v6.14-rc1
- fix build issue find by kernel test robot
- Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com
Changes in v14:
Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As
a result, the approach has been reverted to the v9 method. However, there
are several improvements:
MSI now supports msi-map in addition to msi-parent.
- The struct device: id is used as the endpoint function (EPF) device
identity to map to the stream ID (sideband information).
- The EPC device tree source (DTS) utilizes msi-map to provide such
information.
- The EPF device's of_node is set to the EPC controller’s node. This
approach is commonly used for multi-function device (MFD) platform child
devices, allowing them to inherit properties from the MFD device’s DTS,
such as reset-cells and gpio-cells. This method is well-suited for the
current case, as the EPF is inherently created/binded to the EPC and
should inherit the EPC’s DTS node properties.
Additionally:
Since the basic IMX95 LUT support has already been merged into the
mainline, a DTS and driver increment patch is added to complete the
solution. The patch is rebased onto the latest linux-next tree and
aligned with the new pcitest framework.
- Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com
Changes in v13:
- Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI
- Change request id as func | vfunc << 3
- Remove IRQ_DOMAIN_MSI_IMMUTABLE
Thomas Gleixner:
I hope capture all your points in review comments. If missed, let me know.
- Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com
Changes in v12:
- Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function
irq_domain_msi_is_immuatble().
- split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches
- Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com
Changes in v11:
- Change to use MSI_FLAG_MSG_IMMUTABLE
- Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com
Changes in v10:
Thomas Gleixner:
There are big change in pci-ep-msi.c. I am sure if go on the
corrent path. The key improvement is remove only 1 function devices's
limitation.
I use new patch for imutable check, which relative additional
feature compared to base enablement patch.
- Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info
- Remove only support 1 endpoint function limiation.
- Create one MSI domain for each endpoint function devices.
- Use "msi-map" in pci ep controler node, instead of of msi-parent. first
argument is
(func_no << 8 | vfunc_no)
- Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com
Changes in v9
- Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering
- Remove API pci_epf_align_inbound_addr_lo_hi
- Move doorbell_alloc in to doorbell_enable function.
- Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com
Changes in v8:
- update helper function name to pci_epf_align_inbound_addr()
- Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com
Changes in v7:
- Add helper function pci_epf_align_addr();
- Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com
Changes in v6:
- change doorbell_addr to doorbell_offset
- use round_down()
- add Niklas's test by tag
- rebase to pci/endpoint
- Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com
Changes in v5:
- Move request_irq to epf test function driver for more flexiable user case
- Add fixed size bar handler
- Some minor improvememtn to see each patches's changelog.
- Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com
Changes in v4:
- Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs()
- Use new method to avoid compatible problem.
Add new command DOORBELL_ENABLE and DOORBELL_DISABLE.
pcitest -B send DOORBELL_ENABLE first, EP test function driver try to
remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old
driver don't support new command, so failure return, not side effect.
After test, DOORBELL_DISABLE command send out to recover original map, so
pcitest bar test can pass as normal.
- Other detail change see each patches's change log
- Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com
Change from v2 to v3
- Fixed manivannan's comments
- Move common part to pci-ep-msi.c and pci-ep-msi.h
- rebase to 6.12-rc1
- use RevID to distingiush old version
mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1
echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts
echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid
echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid
echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid
^^^^^^ to enable platform msi support.
ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep
- use new device ID, which identify support doorbell to avoid broken
compatility.
Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices
keep the same behavior as before.
EP side RC with old driver RC with new driver
PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled
Other device ID doorbell disabled* doorbell disabled*
* Behavior remains unchanged.
Change from v1 to v2
- Add missed patch for endpont/pci-epf-test.c
- Move alloc and free to epc driver from epf.
- Provide general help function for EPC driver to alloc platform msi irq.
- Fixed manivannan's comments.
Signed-off-by: Frank Li <Frank.Li(a)nxp.com>
---
Frank Li (9):
PCI: imx6: Add helper function imx_pcie_add_lut_by_rid()
PCI: imx6: Add LUT configuration for MSI/IOMMU in Endpoint mode
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment
PCI: endpoint: pci-epf-test: Add doorbell test support
misc: pci_endpoint_test: Add doorbell test case
selftests: pci_endpoint: Add doorbell test case
arm64: dts: imx95: Add msi-map for pci-ep device
Documentation/PCI/endpoint/pci-test-howto.rst | 14 +++
arch/arm64/boot/dts/freescale/imx95.dtsi | 1 +
drivers/misc/pci_endpoint_test.c | 85 ++++++++++++-
drivers/pci/controller/dwc/pci-imx6.c | 25 ++--
drivers/pci/endpoint/Kconfig | 8 ++
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/functions/pci-epf-test.c | 136 +++++++++++++++++++++
drivers/pci/endpoint/pci-ep-msi.c | 98 +++++++++++++++
drivers/pci/endpoint/pci-epf-core.c | 44 +++++++
include/linux/pci-ep-msi.h | 28 +++++
include/linux/pci-epf.h | 18 +++
include/uapi/linux/pcitest.h | 1 +
.../selftests/pci_endpoint/pci_endpoint_test.c | 28 +++++
13 files changed, 478 insertions(+), 9 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20241010-ep-msi-8b4cab33b1be
Best regards,
--
Frank Li <Frank.Li(a)nxp.com>
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA merges or would otherwise
result in VMA iterator invalidation.
v2:
* Squashed uffd stub fix into series.
* Propagated tags, thanks!
* Fixed param naming in patch 4 as per Vlastimil.
* Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per
Liam.
* Correctly return -EFAULT if no VMAs in input range.
* Account for get_unmapped_area() disregarding MAP_FIXED and returning an
altered address.
* Added additional explanatatory comment to the remap_move() function.
v1:
https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (10):
mm/mremap: perform some simple cleanups
mm/mremap: refactor initial parameter sanity checks
mm/mremap: put VMA check and prep logic into helper function
mm/mremap: cleanup post-processing stage of mremap
mm/mremap: use an explicit uffd failure path for mremap
mm/mremap: check remap conditions earlier
mm/mremap: move remap_is_valid() into check_prep_vma()
mm/mremap: clean up mlock populate behaviour
mm/mremap: permit mremap() move of multiple VMAs
tools/testing/selftests: extend mremap_test to test multi-VMA mremap
fs/userfaultfd.c | 15 +-
include/linux/userfaultfd_k.h | 5 +
mm/mremap.c | 528 ++++++++++++++---------
tools/testing/selftests/mm/mremap_test.c | 145 ++++++-
4 files changed, 492 insertions(+), 201 deletions(-)
--
2.50.0