From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find DUALPI2 iproute2 patch v8.
v8 (09-May-25)
- Update pkt_sched.h with the one in nex-next
- Correct a typo in the comment within pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
v7 (05-May-25)
- Align pkt_sched.h with the v14 version of net-next due to spec modification in tc.yaml
- Reorganize dualpi2_print_opt() to match the order in tc.yaml
- Remove credit-queue in PRINT_JSON
v6 (26-Apr-25)
- Update JSON file output due to spec modification in tc.yaml of net-next
v5 (25-Mar-25)
- Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>)
- Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>)
v4 (16-Mar-25)
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step marking.
v3 (21-Feb-25)
- Add memlimit to the dualpi2 attribute, and add memory_used, max_memory_used, and memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update the manual to align with the latest implementation and clarify the queue naming and default unit
- Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2
v2 (23-Oct-24)
- Rename get_float in dualpi2 to get_float_min_max in utils.c
- Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>)
For more details of DualPI2, please refer IETF RFC9332
(https://datatracker.ietf.org/doc/html/rfc9332).
Best Regards,
Chia-Yu
Chia-Yu Chang (1):
tc: add dualpi2 scheduler module
bash-completion/tc | 11 +-
include/uapi/linux/pkt_sched.h | 68 +++++
include/utils.h | 2 +
ip/iplink_can.c | 14 -
lib/utils.c | 30 ++
man/man8/tc-dualpi2.8 | 249 +++++++++++++++
tc/Makefile | 1 +
tc/q_dualpi2.c | 534 +++++++++++++++++++++++++++++++++
8 files changed, 894 insertions(+), 15 deletions(-)
create mode 100644 man/man8/tc-dualpi2.8
create mode 100644 tc/q_dualpi2.c
--
2.34.1
This series addresses a regression in ethtool flow steering where rules
targeting the default RSS context (context 0) were incorrectly rejected.
The default RSS context always exists but is not stored in the rss_ctx
xarray like additional contexts. The current validation logic was
checking for the existence of context 0 in this array, causing valid
flow steering rules to be rejected.
This prevented configurations such as:
- High priority rules directing specific traffic to the default context
- Low priority catch-all rules directing remaining traffic to additional
contexts
Patch 1 fixes the validation logic to skip the existence check for
context 0.
Patch 2 adds a selftest that verifies this behavior.
Changelog -
v1->v2: https://lore.kernel.org/all/20250225071348.509432-1-gal@nvidia.com/
* Reworded commit message.
* Added a selftest.
Gal Pressman (2):
net: ethtool: Don't check if RSS context exists in case of context 0
selftests: drv-net: rss_ctx: Add test for ntuple rules targeting
default RSS context
net/ethtool/ioctl.c | 3 +-
.../selftests/drivers/net/hw/rss_ctx.py | 59 ++++++++++++++++++-
2 files changed, 60 insertions(+), 2 deletions(-)
--
2.40.1
Cong reported an issue where running 'test_sockmap' in the current
bpf-next tree results in an error [1].
The specific test case that triggered the error is a combined test
involving ktls and bpf_msg_pop_data().
Root Cause:
When sending plaintext data, we initially calculated the corresponding
ciphertext length. However, if we later reduced the plaintext data length
via socket policy, we failed to recalculate the ciphertext length.
This results in transmitting buffers containing uninitialized data during
ciphertext transmission.
This causes uninitialized bytes to be appended after a complete
"Application Data" packet, leading to errors on the receiving end when
parsing TLS record.
This issue has existed for a long time but was only exposed after the
following test code was merged.
commit 47eae080410b ("selftests/bpf: Add more tests for test_txmsg_push_pop in test_sockmap")
Although we already had tests for pop data before this commit, the
pop data length was insufficient (less than 5 bytes). This meant that the
corrupted TLS records with data length <5 bytes were cached without being
parsed, resulting in no error being triggered.
After this fix, all tests pass.
1/ 6 sockmap::txmsg test passthrough:OK
2/ 6 sockmap::txmsg test redirect:OK
3/ 2 sockmap::txmsg test redirect wait send mem:OK
4/ 6 sockmap::txmsg test drop:OK
5/ 6 sockmap::txmsg test ingress redirect:OK
6/ 7 sockmap::txmsg test skb:OK
7/12 sockmap::txmsg test apply:OK
8/12 sockmap::txmsg test cork:OK
9/ 3 sockmap::txmsg test hanging corks:OK
10/11 sockmap::txmsg test push_data:OK
11/17 sockmap::txmsg test pull-data:OK
12/ 9 sockmap::txmsg test pop-data:OK
13/ 6 sockmap::txmsg test push/pop data:OK
14/ 1 sockmap::txmsg test ingress parser:OK
15/ 1 sockmap::txmsg test ingress parser2:OK
16/ 6 sockhash::txmsg test passthrough:OK
17/ 6 sockhash::txmsg test redirect:OK
18/ 2 sockhash::txmsg test redirect wait send mem:OK
19/ 6 sockhash::txmsg test drop:OK
20/ 6 sockhash::txmsg test ingress redirect:OK
21/ 7 sockhash::txmsg test skb:OK
22/12 sockhash::txmsg test apply:OK
23/12 sockhash::txmsg test cork:OK
24/ 3 sockhash::txmsg test hanging corks:OK
25/11 sockhash::txmsg test push_data:OK
26/17 sockhash::txmsg test pull-data:OK
27/ 9 sockhash::txmsg test pop-data:OK
28/ 6 sockhash::txmsg test push/pop data:OK
29/ 1 sockhash::txmsg test ingress parser:OK
30/ 1 sockhash::txmsg test ingress parser2:OK
31/ 6 sockhash:ktls:txmsg test passthrough:OK
32/ 6 sockhash:ktls:txmsg test redirect:OK
33/ 2 sockhash:ktls:txmsg test redirect wait send mem:OK
34/ 6 sockhash:ktls:txmsg test drop:OK
35/ 6 sockhash:ktls:txmsg test ingress redirect:OK
36/ 7 sockhash:ktls:txmsg test skb:OK
37/12 sockhash:ktls:txmsg test apply:OK
38/12 sockhash:ktls:txmsg test cork:OK
39/ 3 sockhash:ktls:txmsg test hanging corks:OK
40/11 sockhash:ktls:txmsg test push_data:OK
41/17 sockhash:ktls:txmsg test pull-data:OK
42/ 9 sockhash:ktls:txmsg test pop-data:OK
43/ 6 sockhash:ktls:txmsg test push/pop data:OK
44/ 1 sockhash:ktls:txmsg test ingress parser:OK
45/ 0 sockhash:ktls:txmsg test ingress parser2:OK
Pass: 45 Fail: 0
[1]: https://lore.kernel.org/bpf/CAM_iQpU7=4xjbefZoxndKoX9gFFMOe7FcWMq5tHBsymbrn…
---
v2 -> v1: Removed WARN_ON() check and added Reviewed-by tag.
https://lore.kernel.org/bpf/20250605145529.3q3aqr6iygpa6xg6@gmail.com/
Jiayuan Chen (2):
bpf,ktls: Fix data corruption when using bpf_msg_pop_data() in ktls
selftests/bpf: Add test to cover ktls with bpf_msg_pop_data
net/tls/tls_sw.c | 13 +++
.../selftests/bpf/prog_tests/sockmap_ktls.c | 91 +++++++++++++++++++
.../selftests/bpf/progs/test_sockmap_ktls.c | 4 +
3 files changed, 108 insertions(+)
--
2.47.1
When running the memfd_secret test run_vmtests.sh unconditionally tries
to confgiure the YAMA LSM's ptrace_scope configuration, leading to an error
if YAMA is not in the running kernel:
# ./run_vmtests.sh: line 432: /proc/sys/kernel/yama/ptrace_scope: No such file or directory
# # ----------------------
# # running ./memfd_secret
# # ----------------------
Check that this file is present before trying to write to it.
The indentation here is a bit odd, and it doesn't seem great that we
configure but don't restore ptrace_scope.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index dddd1dd8af14..33fc7fafa8f9 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -429,7 +429,9 @@ CATEGORY="vma_merge" run_test ./merge
if [ -x ./memfd_secret ]
then
-(echo 0 > /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix
+if [ -f /proc/sys/kernel/yama/ptrace_scope ]; then
+ (echo 0 > /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix
+fi
CATEGORY="memfd_secret" run_test ./memfd_secret
fi
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250605-selftest-mm-enable-yama-1541c2d2ddcd
Best regards,
--
Mark Brown <broonie(a)kernel.org>
A collection of non-functional updates from David Hildenbrand's review.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (4):
kselftest/mm: Clarify errors for pipe()
selftests/mm: Convert some cow error reports to ksft_perror()
selftests/mm: Don't compare return values to in cow
selftests/mm: Add messages about test errors to the cow tests
tools/testing/selftests/mm/cow.c | 44 +++++++++++++++++++++++++---------------
1 file changed, 28 insertions(+), 16 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250603-selftest-mm-cow-tweaks-0199c5132b3e
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Fixes and cleanups for various issues in the vDSO selftests.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v3:
- Rebase on v6.16-rc1
- Preserve vgetrandom_put_state()
- Pick up vdso_standalone_test_x86 into this series
- Link to v2: https://lore.kernel.org/r/20250505-selftests-vdso-fixes-v2-0-3bc86e42f242@l…
Changes in v2:
- Refer to -Wstrict-prototypes over -Wold-style-prototypes
- Pick up Acks
- Enable fixed warnings in Makefile
- Link to v1: https://lore.kernel.org/r/20250502-selftests-vdso-fixes-v1-0-fb5d640a4f78@l…
---
Thomas Weißschuh (9):
selftests: vDSO: chacha: Correctly skip test if necessary
selftests: vDSO: clock_getres: Drop unused include of err.h
selftests: vDSO: vdso_test_getrandom: Drop unused include of linux/compiler.h
selftests: vDSO: vdso_test_getrandom: Avoid -Wunused
selftests: vDSO: vdso_config: Avoid -Wunused-variables
selftests: vDSO: enable -Wall
selftests: vDSO: vdso_test_correctness: Fix -Wstrict-prototypes
selftests: vDSO: vdso_test_getrandom: Always print TAP header
selftests: vDSO: vdso_standalone_test_x86: Replace source file with symlink
tools/testing/selftests/vDSO/Makefile | 2 +-
tools/testing/selftests/vDSO/vdso_config.h | 2 +
.../selftests/vDSO/vdso_standalone_test_x86.c | 59 +---------------------
tools/testing/selftests/vDSO/vdso_test_chacha.c | 3 +-
.../selftests/vDSO/vdso_test_clock_getres.c | 1 -
.../testing/selftests/vDSO/vdso_test_correctness.c | 2 +-
tools/testing/selftests/vDSO/vdso_test_getrandom.c | 10 ++--
7 files changed, 13 insertions(+), 66 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250423-selftests-vdso-fixes-d2ce74142359
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Hi,
While running the nolibc tests I discovered that they build a kernel in
the current directory, including overwriting the existing .config. This
is rather suprising for the selftests build system - it usually wouldn't
do a kernel build at all - and might be annoying for users.
KUnit deals with this by doing it's kernel build in a .kunit directory,
it'd probably be good to do something similar for nolibc.
Thanks,
Mark
Initially netpoll and netconsole were created together, and some
functions are in the wrong file. Seperate netconsole-only functions
in netconsole, avoiding exports.
1. Expose netpoll logging macros in the public header to enable consistent
log formatting across netpoll consumers.
2. Relocate netconsole-specific functions from netpoll to the netconsole
module where they are actually used, reducing unnecessary coupling.
3. Remove unnecessary function exports
4. Rename netpoll parsing functions in netconsole to better reflect their
specific usage.
5. Create a test to check that cmdline works fine. This was in my todo
list since [1], this was a good time to add it here to make sure this
patchset doesn't regress.
PS: The code was split in a way that it is easy to review. When copying
the functions from netpoll to netconsole, I do not change than other
than adding `static`. This will make checkpatch unhappy, but, further
patches will address the issues. It is done this way to make it easy for
reviewers.
Link: https://lore.kernel.org/netdev/Z36TlACdNMwFD7wv@dev-ushankar.dev.purestorag… [1]
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Breno Leitao (7):
netpoll: remove __netpoll_cleanup from exported API
netpoll: expose netpoll logging macros in public header
netpoll: relocate netconsole-specific functions to netconsole module
netpoll: move netpoll_print_options to netconsole
netconsole: rename functions to better reflect their purpose
netconsole: improve code style in parser function
selftest: netconsole: add test for cmdline configuration
drivers/net/netconsole.c | 137 ++++++++++++++++++++-
include/linux/netpoll.h | 10 +-
net/core/netpoll.c | 136 +-------------------
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 39 +++++-
.../selftests/drivers/net/netcons_cmdline.sh | 52 ++++++++
6 files changed, 228 insertions(+), 147 deletions(-)
---
base-commit: 2c7e4a2663a1ab5a740c59c31991579b6b865a26
change-id: 20250603-rework-c175cad8d22e
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Most of the packetdrill tests have not flaked once last week.
Add the few which did to the XFAIL list.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: willemb(a)google.com
CC: matttbe(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
Every time I sit down to add more I plan to just XFAIL all of packetdrill
on slow machines, but then I convince myself otherwise. One last time?
---
tools/testing/selftests/net/packetdrill/ksft_runner.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/net/packetdrill/ksft_runner.sh b/tools/testing/selftests/net/packetdrill/ksft_runner.sh
index ef8b25a606d8..c5b01e1bd4c7 100755
--- a/tools/testing/selftests/net/packetdrill/ksft_runner.sh
+++ b/tools/testing/selftests/net/packetdrill/ksft_runner.sh
@@ -39,11 +39,15 @@ if [[ -n "${KSFT_MACHINE_SLOW}" ]]; then
# xfail tests that are known flaky with dbg config, not fixable.
# still run them for coverage (and expect 100% pass without dbg).
declare -ar xfail_list=(
+ "tcp_blocking_blocking-connect.pkt"
+ "tcp_blocking_blocking-read.pkt"
"tcp_eor_no-coalesce-retrans.pkt"
"tcp_fast_recovery_prr-ss.*.pkt"
+ "tcp_sack_sack-route-refresh-ip-tos.pkt"
"tcp_slow_start_slow-start-after-win-update.pkt"
"tcp_timestamping.*.pkt"
"tcp_user_timeout_user-timeout-probe.pkt"
+ "tcp_zerocopy_cl.*.pkt"
"tcp_zerocopy_epoll_.*.pkt"
"tcp_tcp_info_tcp-info-.*-limited.pkt"
)
--
2.49.0
As titled, adding version file to kselftest installation dir, so the user
of the tarball can know which kernel version the tarball belongs to.
Signed-off-by: Tianyi Cui <1997cui(a)gmail.com>
---
tools/testing/selftests/Makefile | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index a0a6ba47d600..246e9863b45b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -291,6 +291,12 @@ ifdef INSTALL_PATH
$(MAKE) -s --no-print-directory OUTPUT=$$BUILD_TARGET COLLECTION=$$TARGET \
-C $$TARGET emit_tests >> $(TEST_LIST); \
done;
+ @if git describe HEAD > /dev/null 2>&1; then \
+ git describe HEAD > $(INSTALL_PATH)/VERSION; \
+ printf "Version saved to $(INSTALL_PATH)/VERSION\n"; \
+ else \
+ printf "Unable to get version from git describe\n"; \
+ fi
else
$(error Error: set INSTALL_PATH to use install)
endif
--
2.47.1
During performance analysis of console subsystem latency, I discovered that
netconsole registers console handlers even when no active targets exist.
These orphaned console handlers are invoked on every printk() call, get
the lock, iterate through empty target lists, and consume CPU cycles
without performing any useful work.
This patch series addresses the inefficiency by:
1. Implementing dynamic console registration/unregistration based on target
availability, ensuring console handlers are only active when needed
2. Adding automatic cleanup of unused console registrations when targets
are disabled or removed
3. Extending the selftest suite to cover non-extended console format,
which was previously untested
The optimization reduces printk() overhead by eliminating unnecessary
function calls and list traversals when netconsole targets are not
configured, improving overall system performance during heavy logging
scenarios.
---
Changes in v3:
- Set CON_ENABLED before re-enabling the console, to fix a selftest that
was failing, as reported by Jakub on v2.
- Link to v2: https://lore.kernel.org/r/20250602-netcons_ext-v2-0-ef88d999326d@debian.org
Changes in v2:
- Added selftests to test the new mechanism
- Unregister the console if the last target got disabled
- Sending to net-next instead of net (Jakub)
- Link to v1: https://lore.kernel.org/r/20250528-netcons_ext-v1-1-69f71e404e00@debian.org
---
Breno Leitao (4):
netconsole: Only register console drivers when targets are configured
netconsole: Add automatic console unregistration on target removal
selftests: netconsole: Do not exit from inside the validation function
selftests: netconsole: Add support for basic netconsole target format
drivers/net/netconsole.c | 67 +++++++++++++++++++---
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 27 +++++++--
.../testing/selftests/drivers/net/netcons_basic.sh | 50 ++++++++++------
3 files changed, 112 insertions(+), 32 deletions(-)
---
base-commit: 2c7e4a2663a1ab5a740c59c31991579b6b865a26
change-id: 20250528-netcons_ext-572982619bea
Best regards,
--
Breno Leitao <leitao(a)debian.org>
This started with a patch that enabled `clippy::ptr_as_ptr`. Benno
Lossin suggested I also look into `clippy::ptr_cast_constness` and I
discovered `clippy::as_ptr_cast_mut`. This series now enables all 3
lints. It also enables `clippy::as_underscore` which ensures other
pointer casts weren't missed.
As a later addition, `clippy::cast_lossless` and `clippy::ref_as_ptr`
are also enabled.
This series depends on "rust: retain pointer mut-ness in
`container_of!`"[1].
Link: https://lore.kernel.org/all/20250409-container-of-mutness-v1-1-64f472b94534… [1]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v10:
- Move fragment from "rust: enable `clippy::ptr_cast_constness` lint" to
"rust: enable `clippy::ptr_as_ptr` lint". (Boqun Feng)
- Replace `(...).into()` with `T::from(...)` where the destination type
isn't obvious in "rust: enable `clippy::cast_lossless` lint". (Boqun
Feng)
- Link to v9: https://lore.kernel.org/r/20250416-ptr-as-ptr-v9-0-18ec29b1b1f3@gmail.com
Changes in v9:
- Replace ref-to-ptr coercion using `let` bindings with
`core::ptr::from_{ref,mut}`. (Boqun Feng).
- Link to v8: https://lore.kernel.org/r/20250409-ptr-as-ptr-v8-0-3738061534ef@gmail.com
Changes in v8:
- Use coercion to go ref -> ptr.
- rustfmt.
- Rebase on v6.15-rc1.
- Extract first commit to its own series as it is shared with other
series.
- Link to v7: https://lore.kernel.org/r/20250325-ptr-as-ptr-v7-0-87ab452147b9@gmail.com
Changes in v7:
- Add patch to enable `clippy::ref_as_ptr`.
- Link to v6: https://lore.kernel.org/r/20250324-ptr-as-ptr-v6-0-49d1b7fd4290@gmail.com
Changes in v6:
- Drop strict provenance patch.
- Fix URLs in doc comments.
- Add patch to enable `clippy::cast_lossless`.
- Rebase on rust-next.
- Link to v5: https://lore.kernel.org/r/20250317-ptr-as-ptr-v5-0-5b5f21fa230a@gmail.com
Changes in v5:
- Use `pointer::addr` in OF. (Boqun Feng)
- Add documentation on stubs. (Benno Lossin)
- Mark stubs `#[inline]`.
- Pick up Alice's RB on a shared commit from
https://lore.kernel.org/all/Z9f-3Aj3_FWBZRrm@google.com/.
- Link to v4: https://lore.kernel.org/r/20250315-ptr-as-ptr-v4-0-b2d72c14dc26@gmail.com
Changes in v4:
- Add missing SoB. (Benno Lossin)
- Use `without_provenance_mut` in alloc. (Boqun Feng)
- Limit strict provenance lints to the `kernel` crate to avoid complex
logic in the build system. This can be revisited on MSRV >= 1.84.0.
- Rebase on rust-next.
- Link to v3: https://lore.kernel.org/r/20250314-ptr-as-ptr-v3-0-e7ba61048f4a@gmail.com
Changes in v3:
- Fixed clippy warning in rust/kernel/firmware.rs. (kernel test robot)
Link: https://lore.kernel.org/all/202503120332.YTCpFEvv-lkp@intel.com/
- s/as u64/as bindings::phys_addr_t/g. (Benno Lossin)
- Use strict provenance APIs and enable lints. (Benno Lossin)
- Link to v2: https://lore.kernel.org/r/20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com
Changes in v2:
- Fixed typo in first commit message.
- Added additional patches, converted to series.
- Link to v1: https://lore.kernel.org/r/20250307-ptr-as-ptr-v1-1-582d06514c98@gmail.com
---
Tamir Duberstein (6):
rust: enable `clippy::ptr_as_ptr` lint
rust: enable `clippy::ptr_cast_constness` lint
rust: enable `clippy::as_ptr_cast_mut` lint
rust: enable `clippy::as_underscore` lint
rust: enable `clippy::cast_lossless` lint
rust: enable `clippy::ref_as_ptr` lint
Makefile | 6 ++++++
drivers/gpu/drm/drm_panic_qr.rs | 2 +-
rust/bindings/lib.rs | 3 +++
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 ++--
rust/kernel/block/mq/operations.rs | 2 +-
rust/kernel/block/mq/request.rs | 6 +++---
rust/kernel/device.rs | 4 ++--
rust/kernel/device_id.rs | 4 ++--
rust/kernel/devres.rs | 19 ++++++++++---------
rust/kernel/dma.rs | 6 +++---
rust/kernel/error.rs | 2 +-
rust/kernel/firmware.rs | 3 ++-
rust/kernel/fs/file.rs | 2 +-
rust/kernel/io.rs | 18 +++++++++---------
rust/kernel/kunit.rs | 11 +++++++----
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/net/phy.rs | 4 ++--
rust/kernel/of.rs | 6 +++---
rust/kernel/pci.rs | 11 +++++++----
rust/kernel/platform.rs | 4 +++-
rust/kernel/print.rs | 6 +++---
rust/kernel/seq_file.rs | 2 +-
rust/kernel/str.rs | 14 +++++++-------
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/time/hrtimer/pin.rs | 2 +-
rust/kernel/time/hrtimer/pin_mut.rs | 2 +-
rust/kernel/uaccess.rs | 4 ++--
rust/kernel/workqueue.rs | 12 ++++++------
rust/uapi/lib.rs | 3 +++
31 files changed, 96 insertions(+), 74 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
prerequisite-change-id: 20250409-container-of-mutness-b153dab4388d:v1
prerequisite-patch-id: 53d5889db599267f87642bb0ae3063c29bc24863
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Some small fixes for arch_timer_edge_cases that I stumbled upon
while debugging failures for this selftest on ampere-one.
Changes since v1:
* determine effective counter width based on suggestions from Marc
Changes since v2:
* new patch to fix xval initialization
I've done tests with this on various machines - no issues during
several hundreds of test runs.
v1: https://lore.kernel.org/kvmarm/20250509143312.34224-1-sebott@redhat.com/
v2: https://lore.kernel.org/kvmarm/20250527142434.25209-1-sebott@redhat.com/
Sebastian Ott (4):
KVM: arm64: selftests: fix help text for arch_timer_edge_cases
KVM: arm64: selftests: fix thread migration in arch_timer_edge_cases
KVM: arm64: selftests: arch_timer_edge_cases - fix xval init
KVM: arm64: selftests: arch_timer_edge_cases - determine effective counter width
.../kvm/arm64/arch_timer_edge_cases.c | 39 ++++++++++++-------
1 file changed, 25 insertions(+), 14 deletions(-)
base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
--
2.49.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v17.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v17 (25-May-2025, Resent at 10-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (4):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 156 +++
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1146 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 1 +
9 files changed, 1645 insertions(+)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
This improves the expressiveness of unprivileged BPF by inserting
speculation barriers instead of rejecting the programs.
The approach was previously presented at LPC'24 [1] and RAID'24 [2].
To mitigate the Spectre v1 (PHT) vulnerability, the kernel rejects
potentially-dangerous unprivileged BPF programs as of
commit 9183671af6db ("bpf: Fix leakage under speculation on mispredicted
branches"). In [2], we have analyzed 364 object files from open source
projects (Linux Samples and Selftests, BCC, Loxilb, Cilium, libbpf
Examples, Parca, and Prevail) and found that this affects 31% to 54% of
programs.
To resolve this in the majority of cases this patchset adds a fall-back
for mitigating Spectre v1 using speculation barriers. The kernel still
optimistically attempts to verify all speculative paths but uses
speculation barriers against v1 when unsafe behavior is detected. This
allows for more programs to be accepted without disabling the BPF
Spectre mitigations (e.g., by setting cpu_mitigations_off()).
For this, it relies on the fact that speculation barriers generally
prevent all later instructions from executing if the speculation was not
correct (not only loads). See patch 7 ("bpf: Fall back to nospec for
Spectre v1") for a detailed description and references to the relevant
vendor documentation (AMD and Intel x86-64, ARM64, and PowerPC).
In [1] we have measured the overhead of this approach relative to having
mitigations off and including the upstream Spectre v4 mitigations. For
event tracing and stack-sampling profilers, we found that mitigations
increase BPF program execution time by 0% to 62%. For the Loxilb network
load balancer, we have measured a 14% slowdown in SCTP performance but
no significant slowdown for TCP. This overhead only applies to programs
that were previously rejected.
I reran the expressiveness-evaluation with v6.14 and made sure the main
results still match those from [1] and [2] (which used v6.5).
Main design decisions are:
* Do not use separate bytecode insns for v1 and v4 barriers (inspired by
Daniel Borkmann's question at LPC). This simplifies the verifier
significantly and has the only downside that performance on PowerPC is
not as high as it could be.
* Allow archs to still disable v1/v4 mitigations separately by setting
bpf_jit_bypass_spec_v1/v4(). This has the benefit that archs can
benefit from improved BPF expressiveness / performance if they are not
vulnerable (e.g., ARM64 for v4 in the kernel).
* Do not remove the empty BPF_NOSPEC implementation for backends for
which it is unknown whether they are vulnerable to Spectre v1.
[1] https://lpc.events/event/18/contributions/1954/ ("Mitigating
Spectre-PHT using Speculation Barriers in Linux eBPF")
[2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and
Precise Spectre Defenses for Untrusted Linux Kernel Extensions")
Changes:
* v3 -> v4:
- Remove insn parameter from do_check_insn() and extract
process_bpf_exit_full as a function as requested by Eduard
- Investigate apparent sanitize_check_bounds() bug reported by
Kartikeya (does appear to not be a bug but only confusing code),
sent separate patch to document it and add an assert
- Remove already-merged commit 1 ("selftests/bpf: Fix caps for
__xlated/jited_unpriv")
- Drop former commit 10 ("bpf: Allow nospec-protected var-offset stack
access") as it did not include a test and there are other places
where var-off is rejected. Also, none of the tested real-world
programs used var-off in the paper. Therefore keep the old behavior
for now and potentially prepare a patch that converts all cases
later if required.
- Add link to AMD lfence and PowerPC speculation barrier (ori 31,31,0)
documentation
- Move detailed barrier documentation to commit 7 ("bpf: Fall back to
nospec for Spectre v1")
- Link to v3: https://lore.kernel.org/all/20250501073603.1402960-1-luis.gerhorst@fau.de/
* v2 -> v3:
- Fix
https://lore.kernel.org/oe-kbuild-all/202504212030.IF1SLhz6-lkp@intel.com/
and similar by moving the bpf_jit_bypass_spec_v1/v4() prototypes out
of the #ifdef CONFIG_BPF_SYSCALL. Decided not to move them to
filter.h (where similar bpf_jit_*() prototypes live) as they would
still have to be duplicated in bpf.h to be usable to
bpf_bypass_spec_v1/v4() (unless including filter.h in bpf.h is an
option).
- Fix
https://lore.kernel.org/oe-kbuild-all/202504220035.SoGveGpj-lkp@intel.com/
by moving the variable declarations out of the switch-case.
- Build touched C files with W=2 and bpf config on x86 to check that
there are no other warnings introduced.
- Found 3 more checkpatch warnings that can be fixed without degrading
readability.
- Rebase to bpf-next 2025-05-01
- Link to v2: https://lore.kernel.org/bpf/20250421091802.3234859-1-luis.gerhorst@fau.de/
* v1 -> v2:
- Drop former commits 9 ("bpf: Return PTR_ERR from push_stack()") and 11
("bpf: Fall back to nospec for spec path verification") as suggested
by Alexei. This series therefore no longer changes push_stack() to
return PTR_ERR.
- Add detailed explanation of how lfence works internally and how it
affects the algorithm.
- Add tests checking that nospec instructions are inserted in expected
locations using __xlated_unpriv as suggested by Eduard (also,
include a fix for __xlated_unpriv)
- Add a test for the mitigations from the description of
commit 9183671af6db ("bpf: Fix leakage under speculation on
mispredicted branches")
- Remove unused variables from do_check[_insn]() as suggested by
Eduard.
- Remove INSN_IDX_MODIFIED to improve readability as suggested by
Eduard. This also causes the nospec_result-check to run (and fail)
for jumping-ops. Add a warning to assert that this check must never
succeed in that case.
- Add details on the safety of patch 10 ("bpf: Allow nospec-protected
var-offset stack access") based on the feedback on v1.
- Rebase to bpf-next-250420
- Link to v1: https://lore.kernel.org/all/20250313172127.1098195-1-luis.gerhorst@fau.de/
* RFC -> v1:
- rebase to bpf-next-250313
- tests: mark expected successes/new errors
- add bpt_jit_bypass_spec_v1/v4() to avoid #ifdef in
bpf_bypass_spec_v1/v4()
- ensure that nospec with v1-support is implemented for archs for
which GCC supports speculation barriers, except for MIPS
- arm64: emit speculation barrier
- powerpc: change nospec to include v1 barrier
- discuss potential security (archs that do not impl. BPF nospec) and
performance (only PowerPC) regressions
- Link to RFC: https://lore.kernel.org/bpf/20250224203619.594724-1-luis.gerhorst@fau.de/
Luis Gerhorst (9):
bpf: Move insn if/else into do_check_insn()
bpf: Return -EFAULT on misconfigurations
bpf: Return -EFAULT on internal errors
bpf, arm64, powerpc: Add bpf_jit_bypass_spec_v1/v4()
bpf, arm64, powerpc: Change nospec to include v1 barrier
bpf: Rename sanitize_stack_spill to nospec_result
bpf: Fall back to nospec for Spectre v1
selftests/bpf: Add test for Spectre v1 mitigation
bpf: Fall back to nospec for sanitization-failures
arch/arm64/net/bpf_jit.h | 5 +
arch/arm64/net/bpf_jit_comp.c | 28 +-
arch/powerpc/net/bpf_jit_comp64.c | 80 ++-
include/linux/bpf.h | 11 +-
include/linux/bpf_verifier.h | 3 +-
include/linux/filter.h | 2 +-
kernel/bpf/core.c | 32 +-
kernel/bpf/verifier.c | 633 ++++++++++--------
tools/testing/selftests/bpf/progs/bpf_misc.h | 4 +
.../selftests/bpf/progs/verifier_and.c | 8 +-
.../selftests/bpf/progs/verifier_bounds.c | 66 +-
.../bpf/progs/verifier_bounds_deduction.c | 45 +-
.../selftests/bpf/progs/verifier_map_ptr.c | 20 +-
.../selftests/bpf/progs/verifier_movsx.c | 16 +-
.../selftests/bpf/progs/verifier_unpriv.c | 65 +-
.../bpf/progs/verifier_value_ptr_arith.c | 101 ++-
.../selftests/bpf/verifier/dead_code.c | 3 +-
tools/testing/selftests/bpf/verifier/jmp32.c | 33 +-
tools/testing/selftests/bpf/verifier/jset.c | 10 +-
19 files changed, 755 insertions(+), 410 deletions(-)
base-commit: cd2e103d57e5615f9bb027d772f93b9efd567224
--
2.49.0
The mm selftests are timing out with the current 180-second limit.
Testing shows that run_vmtests.sh takes approximately 11 minutes
(664 seconds) to complete.
Increase the timeout to 900 seconds (15 minutes) to provide sufficient
buffer for the tests to complete successfully.
Signed-off-by: Shivank Garg <shivankg(a)amd.com>
---
tools/testing/selftests/mm/settings | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/settings b/tools/testing/selftests/mm/settings
index a953c96aa16e..e2206265f67c 100644
--- a/tools/testing/selftests/mm/settings
+++ b/tools/testing/selftests/mm/settings
@@ -1 +1 @@
-timeout=180
+timeout=900
--
2.43.0
Hello everyone,
The schedule for the Automated Testing Summit (ATS) 2025 is now live!
You can now explore the full program and speaker list at:
🔗 https://ats25.sched.com/
This year’s ATS will be packed with talks and discussions focused on scaling test infrastructure, improving collaboration across projects, and pushing the boundaries of automation in the Linux ecosystem.
📍 ATS 2025 will take place as a co-located event at the Open Source Summit North America, on June 26th in Denver, CO.
If you haven’t yet registered, you can do so here:
🔗 https://events.linuxfoundation.org/open-source-summit-north-america/feature…
You can attend in person or virtually. We look forward to seeing you there!
Best regards,
The KernelCI Team
--
Gustavo Padovan
Collabora Ltd.
The test file for the IR decoder used single-line comments at the top
to document its purpose and licensing, which is inconsistent with the style
used throughout the Linux kernel.
in this patch i converted the file header to a proper multi-line comment block
(/*) that aligns with standard kernel practices. This improves
readability, consistency across selftests, and ensures the license and
documentation are clearly visible in a familiar format.
No functional changes have been made.
Signed-off-by: Abdelrahman Fekry <Abdelrahmanfekry375(a)gmail.com>
---
tools/testing/selftests/ir/ir_loopback.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/ir/ir_loopback.c b/tools/testing/selftests/ir/ir_loopback.c
index f4a15cbdd5ea..2de4a6296f35 100644
--- a/tools/testing/selftests/ir/ir_loopback.c
+++ b/tools/testing/selftests/ir/ir_loopback.c
@@ -1,14 +1,15 @@
// SPDX-License-Identifier: GPL-2.0
-// test ir decoder
-//
-// Copyright (C) 2018 Sean Young <sean(a)mess.org>
-
-// When sending LIRC_MODE_SCANCODE, the IR will be encoded. rc-loopback
-// will send this IR to the receiver side, where we try to read the decoded
-// IR. Decoding happens in a separate kernel thread, so we will need to
-// wait until that is scheduled, hence we use poll to check for read
-// readiness.
-
+/* Copyright (C) 2018 Sean Young <sean(a)mess.org>
+ *
+ * Selftest for IR decoder
+ *
+ *
+ * When sending LIRC_MODE_SCANCODE, the IR will be encoded. rc-loopback
+ * will send this IR to the receiver side, where we try to read the decoded
+ * IR. Decoding happens in a separate kernel thread, so we will need to
+ * wait until that is scheduled, hence we use poll to check for read
+ * readiness.
+*/
#include <linux/lirc.h>
#include <errno.h>
#include <stdio.h>
--
2.25.1
IDT event delivery has a debug hole in which it does not generate #DB
upon returning to userspace before the first userspace instruction is
executed if the Trap Flag (TF) is set.
FRED closes this hole by introducing a software event flag, i.e., bit
17 of the augmented SS: if the bit is set and ERETU would result in
RFLAGS.TF = 1, a single-step trap will be pending upon completion of
ERETU.
However I overlooked properly setting and clearing the bit in different
situations. Thus when FRED is enabled, if the Trap Flag (TF) is set
without an external debugger attached, it can lead to an infinite loop
in the SIGTRAP handler. To avoid this, the software event flag in the
augmented SS must be cleared, ensuring that no single-step trap remains
pending when ERETU completes.
This patch set combines the fix [1] and its corresponding selftest [2]
(requested by Dave Hansen) into one patch set.
[1] https://lore.kernel.org/lkml/20250523050153.3308237-1-xin@zytor.com/
[2] https://lore.kernel.org/lkml/20250530230707.2528916-1-xin@zytor.com/
This patch set is based on tip/x86/urgent branch.
Link to v5 of this patch set:
https://lore.kernel.org/lkml/20250606174528.1004756-1-xin@zytor.com/
Changes in v6:
*) Replace a "sub $128, %rsp" with "add $-128, %rsp" (hpa).
*) Declared loop_count_on_same_ip inside sigtrap() (Sohil).
*) s/sigtrap/SIGTRAP (Sohil).
*) Add TB from Sohil to the first patch.
Xin Li (Intel) (2):
x86/fred/signal: Prevent immediate repeat of single step trap on
return from SIGTRAP handler
selftests/x86: Add a test to detect infinite SIGTRAP handler loop
arch/x86/include/asm/sighandling.h | 22 +++++
arch/x86/kernel/signal_32.c | 4 +
arch/x86/kernel/signal_64.c | 4 +
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/sigtrap_loop.c | 101 +++++++++++++++++++++
5 files changed, 132 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c
base-commit: dd2922dcfaa3296846265e113309e5f7f138839f
--
2.49.0
I had cause to look at the vfork() support for GCS and realised that we
don't have any direct test coverage, this series does so by adding
vfork() to nolibc and then using that in basic-gcs to provide some
simple vfork() coverage.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
tools/nolibc: Provide vfork()
kselftest/arm64: Add a test for vfork() with GCS
tools/include/nolibc/sys.h | 29 ++++++++++++
tools/testing/selftests/arm64/gcs/basic-gcs.c | 63 +++++++++++++++++++++++++++
2 files changed, 92 insertions(+)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250528-arm64-gcs-vfork-exit-4a7daf7652ee
Best regards,
--
Mark Brown <broonie(a)kernel.org>
IDT event delivery has a debug hole in which it does not generate #DB
upon returning to userspace before the first userspace instruction is
executed if the Trap Flag (TF) is set.
FRED closes this hole by introducing a software event flag, i.e., bit
17 of the augmented SS: if the bit is set and ERETU would result in
RFLAGS.TF = 1, a single-step trap will be pending upon completion of
ERETU.
However I overlooked properly setting and clearing the bit in different
situations. Thus when FRED is enabled, if the Trap Flag (TF) is set
without an external debugger attached, it can lead to an infinite loop
in the SIGTRAP handler. To avoid this, the software event flag in the
augmented SS must be cleared, ensuring that no single-step trap remains
pending when ERETU completes.
This patch set combines the fix [1] and its corresponding selftest [2]
(requested by Dave Hansen) into one patch set.
[1] https://lore.kernel.org/lkml/20250523050153.3308237-1-xin@zytor.com/
[2] https://lore.kernel.org/lkml/20250530230707.2528916-1-xin@zytor.com/
This patch set is based on tip/x86/urgent branch as of today.
Link to v4 of this patch set:
https://lore.kernel.org/lkml/20250605181020.590459-1-xin@zytor.com/
Changes in v5:
*) Accurately rephrase the shortlog (hpa).
*) Do "sub $-128, %rsp" rather than "add $128, %rsp", which is more
efficient in code size (hpa).
*) Add TB from Sohil.
*) Add Cc: stable(a)vger.kernel.org to all patches.
Xin Li (Intel) (2):
x86/fred/signal: Prevent immediate repeat of single step trap on
return from SIGTRAP handler
selftests/x86: Add a test to detect infinite sigtrap handler loop
arch/x86/include/asm/sighandling.h | 22 +++++
arch/x86/kernel/signal_32.c | 4 +
arch/x86/kernel/signal_64.c | 4 +
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/sigtrap_loop.c | 98 ++++++++++++++++++++++
5 files changed, 129 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c
base-commit: dd2922dcfaa3296846265e113309e5f7f138839f
--
2.49.0
Hello,
this series is a revival of Xu Kuhoai's work to enable larger arguments
count for BPF programs on ARM64 ([1]). His initial series received some
positive feedback, but lacked some specific case handling around
arguments alignment (see AAPCS64 C.14 rule in section 6.8.2, [2]). There
as been another attempt from Puranjay Mohan, which was unfortunately
missing the same thing ([3]). Since there has been some time between
those series and this new one, I chose to send it as a new series
rather than a new revision of the existing series.
To support the increased argument counts and arguments larger than
registers size (eg: structures), the trampoline does the following:
- for bpf programs: arguments are retrieved from both registers and the
function stack, and pushed in the trampoline stack as an array of u64
to generate the programs context. It is then passed by pointer to the
bpf programs
- when the trampoline is in charge of calling the original function: it
restores the registers content, and generates a new stack layout for
the additional arguments that do not fit in registers.
This new attempt is based on Xu's series and aims to handle the
missing alignment concern raised in the reviews discussions. The main
novelties are then around arguments alignments:
- the first commit is exposing some new info in the BTF function model
passed to the JIT compiler to allow it to deduce the needed alignment
when configuring the trampoline stack
- the second commit is taken from Xu's series, and received the
following modifications:
- the calc_aux_args computes an expected alignment for each argument
- the calc_aux_args computes two different stack space sizes: the one
needed to store the bpf programs context, and the original function
stacked arguments (which needs alignment). Those stack sizes are in
bytes instead of "slots"
- when saving/restoring arguments for bpf program or for the original
function, make sure to align the load/store accordingly, when
relevant
- a few typos fixes and some rewording, raised by the review on the
original series
- the last commit introduces some explicit tests that ensure that the
needed alignment is enforced by the trampoline
I marked the series as RFC because it appears that the new tests trigger
some failures in CI on x86 and s390, despite the series not touching any
code related to those architectures. Some very early investigation/gdb
debugging on the x86 side seems to hint that it could be related to the
same missing alignment too (based on section 3.2.3 in [4], and so the
x86 trampoline would need the same alignment handling ?). For s390 it
looks less clear, as all values captured from the bpf test program are
set to 0 in the CI output, and I don't have the proper setup yet to
check the low level details. I am tempted to isolate those new tests
(which were actually useful to spot real issues while tuning the ARM64
trampoline) and add them to the relevant DENYLIST files for x86/s390,
but I guess this is not the right direction, so I would gladly take a
second opinion on this.
[1] https://lore.kernel.org/all/20230917150752.69612-1-xukuohai@huaweicloud.com…
[2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#id82
[3] https://lore.kernel.org/bpf/20240705125336.46820-1-puranjay@kernel.org/
[4] https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com>
---
Alexis Lothoré (eBPF Foundation) (3):
bpf: add struct largest member size in func model
bpf/selftests: add tests to validate proper arguments alignment on ARM64
bpf/selftests: enable tracing tests for ARM64
Xu Kuohai (1):
bpf, arm64: Support up to 12 function arguments
arch/arm64/net/bpf_jit_comp.c | 235 ++++++++++++++++-----
include/linux/bpf.h | 1 +
kernel/bpf/btf.c | 25 +++
tools/testing/selftests/bpf/DENYLIST.aarch64 | 3 -
.../selftests/bpf/prog_tests/tracing_struct.c | 23 ++
tools/testing/selftests/bpf/progs/tracing_struct.c | 10 +-
.../selftests/bpf/progs/tracing_struct_many_args.c | 67 ++++++
.../testing/selftests/bpf/test_kmods/bpf_testmod.c | 50 +++++
8 files changed, 357 insertions(+), 57 deletions(-)
---
base-commit: 91e7eb701b4bc389e7ddfd80ef6e82d1a6d2d368
change-id: 20250220-many_args_arm64-8bd3747e6948
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Currently each of the iommu page table formats duplicates all of the logic
to maintain the page table and perform map/unmap/etc operations. There are
several different versions of the algorithms between all the different
formats. The io-pgtable system provides an interface to help isolate the
page table code from the iommu driver, but doesn't provide tools to
implement the common algorithms.
This makes it very hard to improve the state of the pagetable code under
the iommu domains as any proposed improvement needs to alter a large
number of different driver code paths. Combined with a lack of software
based testing this makes improvement in this area very hard.
iommufd wants several new page table operations:
- More efficient map/unmap operations, using iommufd's batching logic
- unmap that returns the physical addresses into a batch as it progresses
- cut that allows splitting areas so large pages can have holes
poked in them dynamically (ie guestmemfd hitless shared/private
transitions)
- More agressive freeing of table memory to avoid waste
- Fragmenting large pages so that dirty tracking can be more granular
- Reassembling large pages so that VMs can run at full IO performance
in migration/dirty tracking error flows
- KHO integration for kernel live upgrade
Together these are algorithmically complex enough to be a very significant
task to go and implement in all the page table formats we support. Just
the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86
PAE / AMDv1 / VT-D SS / RISCV)
Instead of doing the duplicated work, this series takes the first step to
consolidate the algorithms into one places. In spirit it is similar to the
work Christoph did a few years back to pull the redundant get_user_pages()
implementations out of the arch code into core MM. This unlocked a great
deal of improvement in that space in the following years. I would like to
see the same benefit in iommu as well.
My first RFC showed a bigger picture with all most all formats and more
algorithms. This series reorganizes that to be narrowly focused on just
enough to convert the AMD driver to use the new mechanism.
kunit tests are provided that allow good testing of the algorithms and all
formats on x86, nothing is arch specific.
AMD is one of the simpler options as the HW is quite uniform with few
different options/bugs while still requiring the complicated contiguous
pages support. The HW also has a very simple range based invalidation
approach that is easy to implement.
The AMD v1 and AMD v2 page table formats are implemented bit for bit
identical to the current code, tested using a compare kunit test that
checks against the io-pgtable version (on github, see below).
Updating the AMD driver to replace the io-pgtable layer with the new stuff
is fairly straightforward now. The layering is fixed up in the new version
so that all the invalidation goes through function pointers.
Several small fixing patches have come out of this as I've been fixing the
problems that the test suite uncovers in the current code, and
implementing the fixed version in iommupt.
On performance, there is a quite wide variety of implementation designs
across all the drivers. Looking at some key performance across
the main formats:
iommu_map():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 53,66 , 51,63 , 19.19 (AMDV1)
256*2^12, 386,1909 , 367,1795 , 79.79
256*2^21, 362,1633 , 355,1556 , 77.77
2^12, 56,62 , 52,59 , 11.11 (AMDv2)
256*2^12, 405,1355 , 357,1292 , 72.72
256*2^21, 393,1160 , 358,1114 , 67.67
2^12, 55,65 , 53,62 , 14.14 (VTD second stage)
256*2^12, 391,518 , 332,512 , 35.35
256*2^21, 383,635 , 336,624 , 46.46
2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit)
256*2^12, 380,389 , 361,369 , 2.02
256*2^21, 358,419 , 345,400 , 13.13
iommu_unmap():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 69,88 , 65,85 , 23.23 (AMDv1)
256*2^12, 353,6498 , 331,6029 , 94.94
256*2^21, 373,6014 , 360,5706 , 93.93
2^12, 71,72 , 66,69 , 4.04 (AMDv2)
256*2^12, 228,891 , 206,871 , 76.76
256*2^21, 254,721 , 245,711 , 65.65
2^12, 69,87 , 65,82 , 20.20 (VTD second stage)
256*2^12, 210,321 , 200,315 , 36.36
256*2^21, 255,349 , 238,342 , 30.30
2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit)
256*2^12, 521,357 , 447,346 , -29.29
256*2^21, 489,358 , 433,345 , -25.25
* Above numbers include additional patches to remove the iommu_pgsize()
overheads. gcc 13.3.0, i7-12700
This version provides fairly consistent performance across formats. ARM
unmap performance is quite different because this version supports
contiguous pages and uses a very different algorithm for unmapping. Though
why it is so worse compared to AMDv1 I haven't figured out yet.
The per-format commits include a more detailed chart.
There is a second branch:
https://github.com/jgunthorpe/linux/commits/iommu_pt_all
Containing supporting work and future steps:
- ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats
- VT-D second stage format
- DART v1 & v2 format
- Draft of a iommufd 'cut' operation to break down huge pages
- Draft of support for a DMA incoherent HW page table walker
- A compare test that checks the iommupt formats against the iopgtable
interface, including updating AMD to have a working iopgtable and patches
to make VT-D have an iopgtable for testing.
- A performance test to micro-benchmark map and unmap against iogptable
My strategy is to go one by one for the drivers:
- AMD driver conversion
- RISCV page table and driver
- Intel VT-D driver and VTDSS page table
- ARM SMMUv3
And concurrently work on the algorithm side:
- debugfs content dump, like VT-D has
- Cut support
- Increase/Decrease page size support
- map/unmap batching
- KHO
As we make more algorithm improvements the value to convert the drivers
increases.
This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt
v1:
- AMD driver only, many code changes
RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/
Alejandro Jimenez (1):
iommu/amd: Use the generic iommu page table
Jason Gunthorpe (14):
genpt: Generic Page Table base API
genpt: Add Documentation/ files
iommupt: Add the basic structure of the iommu implementation
iommupt: Add the AMD IOMMU v1 page table format
iommupt: Add iova_to_phys op
iommupt: Add unmap_pages op
iommupt: Add map_pages op
iommupt: Add read_and_clear_dirty op
iommupt: Add a kunit test for Generic Page Table
iommupt: Add a mock pagetable format for iommufd selftest to use
iommufd: Change the selftest to use iommupt instead of xarray
iommupt: Add the x86 64 bit page table format
iommu/amd: Remove AMD io_pgtable support
iommupt: Add a kunit test for the IOMMU implementation
.clang-format | 1 +
Documentation/driver-api/generic_pt.rst | 105 ++
Documentation/driver-api/index.rst | 1 +
drivers/iommu/Kconfig | 2 +
drivers/iommu/Makefile | 1 +
drivers/iommu/amd/Kconfig | 5 +-
drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 1 -
drivers/iommu/amd/amd_iommu_types.h | 109 +-
drivers/iommu/amd/io_pgtable.c | 560 --------
drivers/iommu/amd/io_pgtable_v2.c | 370 ------
drivers/iommu/amd/iommu.c | 493 ++++---
drivers/iommu/generic_pt/.kunitconfig | 13 +
drivers/iommu/generic_pt/Kconfig | 72 ++
drivers/iommu/generic_pt/fmt/Makefile | 26 +
drivers/iommu/generic_pt/fmt/amdv1.h | 407 ++++++
drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 +
drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 +
drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 +
drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 +
drivers/iommu/generic_pt/fmt/iommu_template.h | 48 +
drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 12 +
drivers/iommu/generic_pt/fmt/x86_64.h | 241 ++++
drivers/iommu/generic_pt/iommu_pt.h | 1146 +++++++++++++++++
drivers/iommu/generic_pt/kunit_generic_pt.h | 721 +++++++++++
drivers/iommu/generic_pt/kunit_iommu.h | 183 +++
drivers/iommu/generic_pt/kunit_iommu_pt.h | 451 +++++++
drivers/iommu/generic_pt/pt_common.h | 351 +++++
drivers/iommu/generic_pt/pt_defs.h | 312 +++++
drivers/iommu/generic_pt/pt_fmt_defaults.h | 193 +++
drivers/iommu/generic_pt/pt_iter.h | 638 +++++++++
drivers/iommu/generic_pt/pt_log2.h | 130 ++
drivers/iommu/io-pgtable.c | 4 -
drivers/iommu/iommufd/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_test.h | 11 +-
drivers/iommu/iommufd/selftest.c | 439 +++----
include/linux/generic_pt/common.h | 166 +++
include/linux/generic_pt/iommu.h | 264 ++++
include/linux/io-pgtable.h | 2 -
tools/testing/selftests/iommu/iommufd.c | 60 +-
tools/testing/selftests/iommu/iommufd_utils.h | 12 +
41 files changed, 6046 insertions(+), 1574 deletions(-)
create mode 100644 Documentation/driver-api/generic_pt.rst
delete mode 100644 drivers/iommu/amd/io_pgtable.c
delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c
create mode 100644 drivers/iommu/generic_pt/.kunitconfig
create mode 100644 drivers/iommu/generic_pt/Kconfig
create mode 100644 drivers/iommu/generic_pt/fmt/Makefile
create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c
create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h
create mode 100644 drivers/iommu/generic_pt/iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/pt_common.h
create mode 100644 drivers/iommu/generic_pt/pt_defs.h
create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h
create mode 100644 drivers/iommu/generic_pt/pt_iter.h
create mode 100644 drivers/iommu/generic_pt/pt_log2.h
create mode 100644 include/linux/generic_pt/common.h
create mode 100644 include/linux/generic_pt/iommu.h
base-commit: db37090502f67e46541e53b91f00bbd565c96bd0
--
2.43.0
Some unit tests intentionally trigger warning backtraces by passing bad
parameters to kernel API functions. Such unit tests typically check the
return value from such calls, not the existence of the warning backtrace.
Such intentionally generated warning backtraces are neither desirable
nor useful for a number of reasons:
- They can result in overlooked real problems.
- A warning that suddenly starts to show up in unit tests needs to be
investigated and has to be marked to be ignored, for example by
adjusting filter scripts. Such filters are ad hoc because there is
no real standard format for warnings. On top of that, such filter
scripts would require constant maintenance.
One option to address the problem would be to add messages such as
"expected warning backtraces start/end here" to the kernel log.
However, that would again require filter scripts, might result in
missing real problematic warning backtraces triggered while the test
is running, and the irrelevant backtrace(s) would still clog the
kernel log.
Solve the problem by providing a means to identify and suppress specific
warning backtraces while executing test code. Support suppressing multiple
backtraces while at the same time limiting changes to generic code to the
absolute minimum.
Overview:
Patch#1 Introduces the suppression infrastructure.
Patch#2 Mitigate the impact at WARN*() sites.
Patch#3 Adds selftests to validate the functionality.
Patch#4 Demonstrates real-world usage in the DRM subsystem.
Patch#5 Documents the new API and usage guidelines.
Design Notes:
The objective is to suppress unwanted WARN*() generated messages.
Although most major architectures share common bug handling via `lib/bug.c`
and `report_bug()`, some minor or legacy architectures still rely on their
own platform-specific handling. This divergence must be considered in any
such feature. Additionally, a key challenge in implementing this feature is
the fragmentation of `WARN*()` messages emission: specific part in the
macro, common with BUG*() part in the exception handler.
As a result, any intervention to suppress the message must occur before the
illegal instruction.
Lessons from the Previous Attempt
In earlier iterations, suppression logic was added inside the
`__report_bug()` function to intercept WARN*() messages not producing
messages in the macro.
To implement the check in the check in the bug handler code, two strategies
were considered:
* Strategy #1: Use `kallsyms` to infer the originating functionid, namely
a pointer to the function. Since in any case, the user interface relies
on function names, they must be translated in addresses at suppression-
time or at check-time.
Assuming to translate at suppression-time, the `kallsyms` subsystem needs
to be used to determine the symbol address from the name, and again to
produce the functionid from `bugaddr`. This approach proved unreliable
due to compiler-induced transformations such as inlining, cloning, and
code fragmentation. Attempts to preventing them is also unconvenient
because several `WARN()` sites are in functions intentionally declared
as `__always_inline`.
* Strategy #2: Store function name `__func__` in `struct bug_entry` in
the `__bug_table`. This implementation was used in the previous version.
However, `__func__` is a compiler-generated symbol, which complicates
relocation and linking in position-independent code. Workarounds such as
storing offsets from `.rodata` or embedding string literals directly into
the table would have significantly either increased complexity or
increase the __bug_table size.
Additionally, architectures not using the unified `BUG()` path would
still require ad-hoc handling. Because current WARN*() message production
strategy, a few WARN*() macros still need a check to suppress the part of
the message produced in the macro itself.
Current Proposal: Check Directly in the `WARN()` Macros.
This avoids the need for function symbol resolution or ELF section
modification.
Suppression is implemented directly in the `WARN*()` macros.
A helper function, `__kunit_is_suppressed_warning()`, is used to determine
whether suppression applies. It is marked as `noinstr`, since some `WARN*()`
sites reside in non-instrumentable sections. As it uses `strcmp`, a
`noinstr` version of `strcmp` was introduced.
The implementation is deliberately simple and avoids architecture-specific
optimizations to preserve portability. Since this mechanism compares
function names and is intended for test usage only, performance is not a
primary concern.
This series is based on the RFC patch and subsequent discussion at
https://patchwork.kernel.org/project/linux-kselftest/patch/02546e59-1afe-4b…
and offers a more comprehensive solution of the problem discussed there.
Changes since RFC:
- Introduced CONFIG_KUNIT_SUPPRESS_BACKTRACE
- Minor cleanups and bug fixes
- Added support for all affected architectures
- Added support for counting suppressed warnings
- Added unit tests using those counters
- Added patch to suppress warning backtraces in dev_addr_lists tests
Changes since v1:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
[I retained those tags since there have been no functional changes]
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option, enabled by
default.
Changes since v2:
- Rebased to v6.9-rc2
- Added comments to drm warning suppression explaining why it is needed.
- Added patch to move conditional code in arch/sh/include/asm/bug.h
to avoid kerneldoc warning
- Added architecture maintainers to Cc: for architecture specific patches
- No functional changes
Changes since v3:
- Rebased to v6.14-rc6
- Dropped net: "kunit: Suppress lock warning noise at end of dev_addr_lists tests"
since 3db3b62955cd6d73afde05a17d7e8e106695c3b9
- Added __kunit_ and KUNIT_ prefixes.
- Tested on interessed architectures.
Changes since v4:
- Rebased to v6.15-rc7
- Dropped all code in __report_bug()
- Moved all checks in WARN*() macros.
- Dropped all architecture specific code.
- Made __kunit_is_suppressed_warning nice to noinstr functions.
Alessandro Carminati (2):
bug/kunit: Core support for suppressing warning backtraces
bug/kunit: Suppressing warning backtraces reduced impact on WARN*()
sites
Guenter Roeck (3):
Add unit tests to verify that warning backtrace suppression works.
drm: Suppress intentional warning backtraces in scaling unit tests
kunit: Add documentation for warning backtrace suppression API
Documentation/dev-tools/kunit/usage.rst | 30 ++++++-
drivers/gpu/drm/tests/drm_rect_test.c | 16 ++++
include/asm-generic/bug.h | 48 +++++++----
include/kunit/bug.h | 62 ++++++++++++++
include/kunit/test.h | 1 +
lib/kunit/Kconfig | 9 ++
lib/kunit/Makefile | 9 +-
lib/kunit/backtrace-suppression-test.c | 105 ++++++++++++++++++++++++
lib/kunit/bug.c | 54 ++++++++++++
9 files changed, 316 insertions(+), 18 deletions(-)
create mode 100644 include/kunit/bug.h
create mode 100644 lib/kunit/backtrace-suppression-test.c
create mode 100644 lib/kunit/bug.c
--
2.34.1
The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.
HW QUEUE introduced by this series as a part of the vIOMMU infrastructure
represents a HW accelerated queue/buffer for VM to use exclusively, e.g.
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
each of which allows its IOMMU HW to directly access a queue memory owned
by a guest VM and allows a guest OS to control the HW queue direclty, to
avoid VM Exit overheads to improve the performance.
Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC
allowing VMM to forward the IOMMU-specific queue info, such as queue base
address, size, and etc.
Meanwhile, a guest-owned queue needs the guest kernel to control the queue
by reading/writing its consumer and producer indexes, via MMIO acceses to
the hardware MMIO registers. Introduce an mmap infrastructure for iommufd
to support passing through a piece of MMIO region from the host physical
address space to the guest physical address space. The mmap info (offset/
length) used by an mmap syscall must be pre-allocated and returned to the
user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC
call. Thus, it requires a driver-specific user data support in the vIOMMU
allocation flow.
As a real-world use case, this series implements a HW QUEUE support in the
tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it
is also the Tegra CMDQV series Part-2 (user-space support), reworked from
Previous RFCv1:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
// Unmap latencies from "dma_map_benchmark -g @granule -t @threads",
// by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq"
@granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0
4KB 1 35.7 us 5.3 us
16KB 1 41.8 us 6.8 us
64KB 1 68.9 us 9.9 us
128KB 1 109.0 us 12.6 us
256KB 1 187.1 us 18.0 us
4KB 2 96.9 us 6.8 us
16KB 2 97.8 us 7.5 us
64KB 2 151.5 us 10.7 us
128KB 2 257.8 us 12.7 us
256KB 2 443.0 us 17.9 us
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v5
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v5
Changelog
v5
* Rebase on v6.15-rc6
* Add Reviewed-by from Jason and Kevin
* Correct typos in kdoc and update commit logs
* [iommufd] Add a cosmetic fix
* [iommufd] Drop unused num_pfns
* [iommufd] Drop unnecessary check
* [iommufd] Reorder patch sequence
* [iommufd] Use io_remap_pfn_range()
* [iommufd] Use success oriented flow
* [iommufd] Fix max_npages calculation
* [iommufd] Add more selftest coverage
* [iommufd] Drop redundant static_assert
* [iommufd] Fix mmap pfn range validation
* [iommufd] Reject unmap on pinned iovas
* [iommufd] Drop redundant vm_flags_set()
* [iommufd] Drop iommufd_struct_destroy()
* [iommufd] Drop redundant queue iova test
* [iommufd] Use "mmio_addr" and "mmio_pfn"
* [iommufd] Rename to "nesting_parent_iova"
* [iommufd] Make iopt_pin_pages call option
* [iommufd] Add ictx comparison in depend()
* [iommufd] Add iommufd_object_alloc_ucmd()
* [iommufd] Move kcalloc() after validations
* [iommufd] Replace ictx setting with WARN_ON
* [iommufd] Make hw_info's type bidirectional
* [smmu] Add supported_vsmmu_type in impl_ops
* [smmu] Drop impl report in smmu vendor struct
* [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV
* [tegra] Replace "number of VINTFs" with a note
* [tegra] Drop the redundant lvcmdq pointer setting
* [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA
* [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op
v4
https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc5
* Add Reviewed-by from Vasant
* Rename "vQUEUE" to "HW QUEUE"
* Use "offset" and "length" for all mmap-related variables
* [iommufd] Use u64 for guest PA
* [iommufd] Fix typo in uAPI doc
* [iommufd] Rename immap_id to offset
* [iommufd] Drop the partial-size mmap support
* [iommufd] Do not replace WARN_ON with WARN_ON_ONCE
* [iommufd] Use "u64 base_addr" for queue base address
* [iommufd] Use u64 base_pfn/num_pfns for immap structure
* [iommufd] Correct the size passed in to mtree_alloc_range()
* [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops
v3
https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu, Pranjal, and Alok
* Revise kdocs, uAPI docs, and commit logs
* Rename "vCMDQ" back to "vQUEUE" for AMD cases
* [tegra] Add tegra241_vcmdq_hw_flush_timeout()
* [tegra] Rename vsmmu_alloc to alloc_vintf_user
* [tegra] Use writel for SID replacement registers
* [tegra] Move mmap removal call to vsmmu_destroy op
* [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user()
* [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED()
* [iommufd] Add an object-type "owner" to immap structure
* [iommufd] Drop the ictx input in the new for-driver APIs
* [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle
* [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers
* [iommufd] Rename iommufd_ctx_alloc/free_mmap to
_iommufd_alloc/destroy_mmap
v2
https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason
* [smmu] Fix vsmmu initial value
* [smmu] Support impl for hw_info
* [tegra] Rename "slot" to "vsid"
* [tegra] Update kdocs and commit logs
* [tegra] Map/unmap LVCMDQ dynamically
* [tegra] Refcount the previous LVCMDQ
* [tegra] Return -EEXIST if LVCMDQ exists
* [tegra] Simplify VINTF cleanup routine
* [tegra] Use vmid and s2_domain in vsmmu
* [tegra] Rename "mmap_pgoff" to "immap_id"
* [tegra] Add more addr and length validation
* [iommufd] Add more narrative to mmap's kdoc
* [iommufd] Add iommufd_struct_depend/undepend()
* [iommufd] Rename vcmdq_free op to vcmdq_destroy
* [iommufd] Fix bug in iommu_copy_struct_to_user()
* [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
* [iommufd] Test the queue memory for its contiguity
* [iommufd] Return -ENXIO if address or length fails
* [iommufd] Do not change @min_last in mock_viommu_alloc()
* [iommufd] Generalize TEGRA241_VCMDQ data in core structure
* [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
* [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (29):
iommufd: Apply obvious cosmetic fixes
iommufd: Introduce iommufd_object_alloc_ucmd helper
iommu: Apply the new iommufd_object_alloc_ucmd helper
iommu: Add iommu_copy_struct_to_user helper
iommu: Pass in a driver-level user data structure to viommu_alloc op
iommufd/viommu: Allow driver-specific user data for a vIOMMU object
iommufd/selftest: Support user_data in mock_viommu_alloc
iommufd/selftest: Add coverage for viommu data
iommufd: Do not unmap an owned iopt_area
iommufd: Abstract iopt_pin_pages and iopt_unpin_pages helpers
iommufd/driver: Let iommufd_viommu_alloc helper save ictx to
viommu->ictx
iommufd/viommu: Add driver-allocated vDEVICE support
iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct
iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers
iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC
iommufd: Add mmap interface
iommufd/selftest: Add coverage for the new mmap interface
Documentation: userspace-api: iommufd: Update HW QUEUE
iommu: Allow an input type in hw_info op
iommufd: Allow an input data_type via iommu_hw_info
iommufd/selftest: Update hw_info coverage for an input data_type
iommu/arm-smmu-v3-iommufd: Add vsmmu_alloc impl op
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/tegra241-cmdqv: Simplify deinit flow in
tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 28 +-
drivers/iommu/iommufd/io_pagetable.h | 15 +-
drivers/iommu/iommufd/iommufd_private.h | 41 +-
drivers/iommu/iommufd/iommufd_test.h | 20 +
include/linux/iommu.h | 53 +-
include/linux/iommufd.h | 221 +++++++-
include/uapi/linux/iommufd.h | 150 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 91 +++-
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 33 +-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 496 +++++++++++++++++-
drivers/iommu/intel/iommu.c | 4 +
drivers/iommu/iommufd/device.c | 137 +----
drivers/iommu/iommufd/driver.c | 97 ++++
drivers/iommu/iommufd/eventq.c | 14 +-
drivers/iommu/iommufd/hw_pagetable.c | 6 +-
drivers/iommu/iommufd/io_pagetable.c | 106 +++-
drivers/iommu/iommufd/iova_bitmap.c | 1 -
drivers/iommu/iommufd/main.c | 80 ++-
drivers/iommu/iommufd/pages.c | 19 +-
drivers/iommu/iommufd/selftest.c | 158 +++++-
drivers/iommu/iommufd/viommu.c | 146 +++++-
tools/testing/selftests/iommu/iommufd.c | 146 +++++-
.../selftests/iommu/iommufd_fail_nth.c | 15 +-
Documentation/userspace-api/iommufd.rst | 12 +
24 files changed, 1794 insertions(+), 295 deletions(-)
--
2.43.0
The bulk of these changes modify the cow and gup_longterm tests to
report unique and stable names for each test, bringing them into line
with the expectations of tooling that works with kselftest. The string
reported as a test result is used by tooling to both deduplicate tests
and track tests between test runs, using the same string for multiple
tests or changing the string depending on test result causes problems
for user interfaces and automation such as bisection.
It was suggested that converting to use kselftest_harness.h would be a
good way of addressing this, however that really wants the set of tests
to run to be known at compile time but both test programs dynamically
enumarate the set of huge page sizes the system supports and test each.
Refactoring to handle this would be even more invasive than these
changes which are large but straightforward and repetitive.
A version of the main gup_longterm cleanup was previously sent
separately, this version factors out the helpers for logging the start
of the test since the cow test looks very similar.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Typo fixes.
- Link to v1: https://lore.kernel.org/r/20250522-selftests-mm-cow-dedupe-v1-0-713cee2fdd6…
---
Mark Brown (4):
selftests/mm: Use standard ksft_finished() in cow and gup_longterm
selftests/mm: Add helper for logging test start and results
selftests/mm: Report unique test names for each cow test
selftests/mm: Fix test result reporting in gup_longterm
tools/testing/selftests/mm/cow.c | 340 +++++++++++++++++++-----------
tools/testing/selftests/mm/gup_longterm.c | 158 ++++++++------
tools/testing/selftests/mm/vm_util.h | 20 ++
3 files changed, 334 insertions(+), 184 deletions(-)
---
base-commit: a5806cd506af5a7c19bcd596e4708b5c464bfd21
change-id: 20250521-selftests-mm-cow-dedupe-33dcab034558
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Commit 846742f7e32f ("selftests: drv-net: add a warning for
bkg + shell + terminate") added a warning for bkg() used
with terminate=True. The tso test was missed as we didn't
have it running anywhere in NIPA. Add exit_wait=True, to avoid:
# Warning: combining shell and terminate is risky!
# SIGTERM may not reach the child on zsh/ksh!
getting printed twice for every variant.
Fixes: 0d0f4174f6c8 ("selftests: drv-net: add a simple TSO test")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: willemb(a)google.com
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/drivers/net/hw/tso.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/hw/tso.py b/tools/testing/selftests/drivers/net/hw/tso.py
index e1ecb92f79d9..150d6db241a0 100755
--- a/tools/testing/selftests/drivers/net/hw/tso.py
+++ b/tools/testing/selftests/drivers/net/hw/tso.py
@@ -39,7 +39,7 @@ from lib.py import bkg, cmd, defer, ethtool, ip, rand_port, wait_port_listen
port = rand_port()
listen_cmd = f"socat -{ipver} -t 2 -u TCP-LISTEN:{port},reuseport /dev/null,ignoreeof"
- with bkg(listen_cmd, host=cfg.remote) as nc:
+ with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as nc:
wait_port_listen(port, host=cfg.remote)
if ipver == "4":
--
2.49.0
Add missing config options for the tso.py test, specifically
to make sure the kernel is built with vxlan and gre tunnels.
I noticed this while adding a TSO-capable device QEMU to the CI.
Previously we only run virtio tests and it doesn't report LSO
stats on the QEMU we have.
Fixes: 0d0f4174f6c8 ("selftests: drv-net: add a simple TSO test")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v2:
- drop NET_IP_TUNNEL
v1: https://lore.kernel.org/20250602231640.314556-1-kuba@kernel.org
CC: shuah(a)kernel.org
CC: willemb(a)google.com
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/drivers/net/hw/config | 5 +++++
1 file changed, 5 insertions(+)
create mode 100644 tools/testing/selftests/drivers/net/hw/config
diff --git a/tools/testing/selftests/drivers/net/hw/config b/tools/testing/selftests/drivers/net/hw/config
new file mode 100644
index 000000000000..88ae719e6f8f
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/config
@@ -0,0 +1,5 @@
+CONFIG_IPV6=y
+CONFIG_IPV6_GRE=y
+CONFIG_NET_IPGRE=y
+CONFIG_NET_IPGRE_DEMUX=y
+CONFIG_VXLAN=y
--
2.49.0
Cong reported an issue where running 'test_sockmap' in the current
bpf-next tree results in an error [1].
The specific test case that triggered the error is a combined test
involving ktls and bpf_msg_pop_data().
Root Cause:
When sending plaintext data, we initially calculated the corresponding
ciphertext length. However, if we later reduced the plaintext data length
via socket policy, we failed to recalculate the ciphertext length.
This results in transmitting buffers containing uninitialized data during
ciphertext transmission.
This causes uninitialized bytes to be appended after a complete
"Application Data" packet, leading to errors on the receiving end when
parsing TLS record.
This issue has existed for a long time but was only exposed after the
following test code was merged.
commit 47eae080410b ("selftests/bpf: Add more tests for test_txmsg_push_pop in test_sockmap")
Although we already had tests for pop data before this commit, the
pop data length was insufficient (less than 5 bytes). This meant that the
corrupted TLS records with data length <5 bytes were cached without being
parsed, resulting in no error being triggered.
After this fix, all tests pass.
1/ 6 sockmap::txmsg test passthrough:OK
2/ 6 sockmap::txmsg test redirect:OK
3/ 2 sockmap::txmsg test redirect wait send mem:OK
4/ 6 sockmap::txmsg test drop:OK
5/ 6 sockmap::txmsg test ingress redirect:OK
6/ 7 sockmap::txmsg test skb:OK
7/12 sockmap::txmsg test apply:OK
8/12 sockmap::txmsg test cork:OK
9/ 3 sockmap::txmsg test hanging corks:OK
10/11 sockmap::txmsg test push_data:OK
11/17 sockmap::txmsg test pull-data:OK
12/ 9 sockmap::txmsg test pop-data:OK
13/ 6 sockmap::txmsg test push/pop data:OK
14/ 1 sockmap::txmsg test ingress parser:OK
15/ 1 sockmap::txmsg test ingress parser2:OK
16/ 6 sockhash::txmsg test passthrough:OK
17/ 6 sockhash::txmsg test redirect:OK
18/ 2 sockhash::txmsg test redirect wait send mem:OK
19/ 6 sockhash::txmsg test drop:OK
20/ 6 sockhash::txmsg test ingress redirect:OK
21/ 7 sockhash::txmsg test skb:OK
22/12 sockhash::txmsg test apply:OK
23/12 sockhash::txmsg test cork:OK
24/ 3 sockhash::txmsg test hanging corks:OK
25/11 sockhash::txmsg test push_data:OK
26/17 sockhash::txmsg test pull-data:OK
27/ 9 sockhash::txmsg test pop-data:OK
28/ 6 sockhash::txmsg test push/pop data:OK
29/ 1 sockhash::txmsg test ingress parser:OK
30/ 1 sockhash::txmsg test ingress parser2:OK
31/ 6 sockhash:ktls:txmsg test passthrough:OK
32/ 6 sockhash:ktls:txmsg test redirect:OK
33/ 2 sockhash:ktls:txmsg test redirect wait send mem:OK
34/ 6 sockhash:ktls:txmsg test drop:OK
35/ 6 sockhash:ktls:txmsg test ingress redirect:OK
36/ 7 sockhash:ktls:txmsg test skb:OK
37/12 sockhash:ktls:txmsg test apply:OK
38/12 sockhash:ktls:txmsg test cork:OK
39/ 3 sockhash:ktls:txmsg test hanging corks:OK
40/11 sockhash:ktls:txmsg test push_data:OK
41/17 sockhash:ktls:txmsg test pull-data:OK
42/ 9 sockhash:ktls:txmsg test pop-data:OK
43/ 6 sockhash:ktls:txmsg test push/pop data:OK
44/ 1 sockhash:ktls:txmsg test ingress parser:OK
45/ 0 sockhash:ktls:txmsg test ingress parser2:OK
Pass: 45 Fail: 0
[1]: https://lore.kernel.org/bpf/CAM_iQpU7=4xjbefZoxndKoX9gFFMOe7FcWMq5tHBsymbrn…
Jiayuan Chen (2):
bpf,ktls: Fix data corruption when using bpf_msg_pop_data() in ktls
selftests/bpf: Add test to cover ktls with bpf_msg_pop_data
net/tls/tls_sw.c | 15 +++
.../selftests/bpf/prog_tests/sockmap_ktls.c | 91 +++++++++++++++++++
.../selftests/bpf/progs/test_sockmap_ktls.c | 4 +
3 files changed, 110 insertions(+)
--
2.47.1
Some small fixes for arch_timer_edge_cases that I stumbled upon
while debugging failures for this selftest on ampere-one.
Changes since v1: modified patch 3 based on suggestions from Marc.
I've done some tests with this on various machines - seems to be all
good, however on ampere-one I now hit this in 10% of the runs:
==== Test Assertion Failure ====
arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000)
pid=166657 tid=166657 errno=4 - Interrupted system call
1 0x0000000000404db3: test_run at arch_timer_edge_cases.c:933
2 0x0000000000401f9f: main at arch_timer_edge_cases.c:1062
3 0x0000ffffaedd625b: ?? ??:0
4 0x0000ffffaedd633b: ?? ??:0
5 0x00000000004020af: _start at ??:?
timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
This is not new, it was just hidden behind the other failure. I'll
try to figure out what this is about (seems to be independent of
the wait time)..
Sebastian Ott (3):
KVM: arm64: selftests: fix help text for arch_timer_edge_cases
KVM: arm64: selftests: fix thread migration in arch_timer_edge_cases
KVM: arm64: selftests: arch_timer_edge_cases - determine effective counter width
.../kvm/arm64/arch_timer_edge_cases.c | 37 ++++++++++++-------
1 file changed, 24 insertions(+), 13 deletions(-)
base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
--
2.49.0
Hello.
Running the mm selftests from the kernel's root directory
on an x86_64 debian machine using:
make defconfig
sudo make kselftest TARGETS=mm
the tests run normally till we reach one which stalls
for 180 seconds and times out according to the following logs:
```
-----------------------------------------------
running ./charge_reserved_hugetlb.sh -cgroup-v2
-----------------------------------------------
CLEANUP DONE
CLEANUP DONE
Test normal case.
private=, populate=, method=0, reserve=
nr hugepages = 10
writing cgroup limit: 20971520
writing reseravation limit: 20971520
Starting:
hugetlb_usage=0
reserved_usage=0
expect_failure is 0
Putting task in cgroup 'hugetlb_cgroup_test'
Method is 0
>>> write_hugetlb_memory.sh: line 22: ./write_to_hugetlbfs: No such file or directory <<<
Waiting for hugetlb memory reservation to reach size 10485760.
0
Waiting for hugetlb memory reservation to reach size 10485760.
0
...
Waiting for hugetlb memory reservation to reach size 10485760.
0
Waiting for hugetlb memory reservation to reach size 10485760.
0
not ok 1 selftests: mm: run_vmtests.sh # TIMEOUT 180 seconds
make[3]: Leaving directory '/linux/tools/testing/selftests/mm'
```
Logs show that the executable "write_to_hugetlbfs" is missing, causing
the test to hang waiting for hugepage reservations.
The executable not found means it was not built by the Make system.
It is mentioned in Makefile:136-142, and only built if ARCH is 64-bit
```
ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390))
TEST_GEN_FILES += va_high_addr_switch
ifneq ($(ARCH),riscv64)
TEST_GEN_FILES += virtual_address_range
endif
TEST_GEN_FILES += write_to_hugetlbfs
endif
```
So, for some reason, the top-level Makefile provides ARCH as x86.
My proposed solution is similar to existing virtual_address_range check
that is to check for the binary, and if it is not found, skip these 2
test cases: charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh
since they directly and indirectly depend on write_to_hugetlbfs binary.
This is just a workaround, the root issue of different ARCH detection
when running tests from the kernel root directory should still be
addressed. I am not sure how to approach it and open for your suggestions.
Note that this issue does not happen when ran from selftests/mm using
something like
sudo make -C tools/testing/selftests/mm
because then mm/Makefile's ARCH detection runs correctly (x86_64)
Kindly review and share your thoughts.
Signed-off-by: Khaled Elnaggar <khaledelnaggarlinux(a)gmail.com>
---
tools/testing/selftests/mm/run_vmtests.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index dddd1dd8af14..cdbcfdb62f8a 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -375,8 +375,13 @@ CATEGORY="process_mrelease" run_test ./mrelease_test
CATEGORY="mremap" run_test ./mremap_test
CATEGORY="hugetlb" run_test ./thuge-gen
+
+# the following depend on write_to_hugetlbfs binary
+if [ -x ./write_to_hugetlbfs ]; then
CATEGORY="hugetlb" run_test ./charge_reserved_hugetlb.sh -cgroup-v2
CATEGORY="hugetlb" run_test ./hugetlb_reparenting_test.sh -cgroup-v2
+fi
+
if $RUN_DESTRUCTIVE; then
nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
enable_soft_offline=$(cat /proc/sys/vm/enable_soft_offline)
--
2.47.2
Overview:
This series implements a new PMU scheme on ARM, a partitioned PMU
that exists alongside the existing emulated PMU and may be enabled by
the kernel command line kvm.reserved_host_counters or by the vcpu
ioctl KVM_ARM_PARTITION_PMU. This is a continuation of the RFC posted
earlier this year. [1]
The high level overview and reason for the name is that this
implementation takes advantage of recent CPU features to partition the
PMU counters into a host-reserved set and a guest-reserved set. Guests
are allowed untrapped hardware access to the most frequently used PMU
registers and features for the guest-reserved counters only.
This untrapped hardware access significantly reduces the overhead of
using performance monitoring capabilities such as the `perf` tool
inside a guest VM. Register accesses that aren't trapping to KVM mean
less time spent in the host kernel and more time on the workloads
guests care about. This optimization especially shines during high
`perf` sample rates or large numbers of events that require
multiplexing hardware counters.
Performance:
For example, the following tests were carried out on identical ARM
machines with 10 general purpose counters with identical guest images
run on QEMU, the only difference being my PMU implementation or the
existing one. Some arguments have been simplified here to clarify the
purpose of the test:
1) time perf record -e ${FIFTEEN_HW_EVENTS} -F 1000 -- \
gzip -c tmpfs/random.64M.img >/dev/null
On emulated PMU this command took 4.143s real time with 0.159s system
time. On partitioned PMU this command took 3.139s real time with
0.110s system time, runtime reductions of 24.23% and 30.82%.
2) time perf stat -dd -- \
automated_specint2017.sh
On emulated PMU this benchmark completed in 3789.16s real time with
224.45s system time and a final benchmark score of 4.28. On
partitioned PMU this benchmark completed in 3525.67s real time with
15.98s system time and a final benchmark score of 4.56. That is a
6.95% reduction in runtime, 92.88% reduction in system time, and
6.54% improvement in overall benchmark score.
Seeing these improvements on something as lightweight as perf stat is
remarkable and implies there would have been a much greater
improvement with perf record. I did not test that because I was not
confident it would even finish in a reasonable time on the emulated
PMU
Test 3 was slightly different, I ran the workload in a VM with a
single VCPU pinned to a physical CPU and analyzed from the host where
the physical CPU spent its time using mpstat.
3) perf record -e ${FIFTEEN_HW_EVENTS} -F 4000 -- \
stress-ng --cpu 0 --timeout 30
Over a period of 30s the cpu running with the emulated PMU spent
34.96% of the time in the host kernel and 55.85% of the time in the
guest. The cpu running the partitioned PMU spent 0.97% of its time in
the host kernel and 91.06% of its time in the guest.
Taken together, these tests represent a remarkable performance
improvement for anything perf related using this new PMU
implementation.
Caveats:
Because the most consistent and performant thing to do was untrap
PMCR_EL0, the number of counters visible to the guest via PMCR_EL0.N
is always equal to the value KVM sets for MDCR_EL2.HPMN. Previously
allowed writes to PMCR_EL0.N via {GET,SET}_ONE_REG no longer affect
the guest.
These improvements come at a cost to 7-35 new registers that must be
swapped at every vcpu_load and vcpu_put if the feature is enabled. I
have been informed KVM would like to avoid paying this cost when
possible.
One solution is to make the trapping changes and context swapping lazy
such that the trapping changes and context swapping only take place
after the guest has actually accessed the PMU so guests that never
access the PMU never pay the cost.
This is not done here because it is not crucial to the primary
functionality and I thought review would be more productive as soon as
I had something complete enough for reviewers to easily play with.
However, this or any better ideas are on the table for inclusion in
future re-rolls.
[1] https://lore.kernel.org/kvmarm/20250213180317.3205285-1-coltonlewis@google.…
Colton Lewis (16):
arm64: cpufeature: Add cpucap for HPMN0
arm64: Generate sign macro for sysreg Enums
arm64: cpufeature: Add cpucap for PMICNTR
KVM: arm64: Reorganize PMU functions
KVM: arm64: Introduce method to partition the PMU
perf: arm_pmuv3: Generalize counter bitmasks
perf: arm_pmuv3: Keep out of guest counter partition
KVM: arm64: Set up FGT for Partitioned PMU
KVM: arm64: Writethrough trapped PMEVTYPER register
KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned
KVM: arm64: Writethrough trapped PMOVS register
KVM: arm64: Context switch Partitioned PMU guest registers
perf: pmuv3: Handle IRQs for Partitioned PMU guest counters
KVM: arm64: Inject recorded guest interrupts
KVM: arm64: Add ioctl to partition the PMU when supported
KVM: arm64: selftests: Add test case for partitioned PMU
Marc Zyngier (1):
KVM: arm64: Cleanup PMU includes
Documentation/virt/kvm/api.rst | 16 +
arch/arm/include/asm/arm_pmuv3.h | 24 +
arch/arm64/include/asm/arm_pmuv3.h | 36 +-
arch/arm64/include/asm/kvm_host.h | 208 +++++-
arch/arm64/include/asm/kvm_pmu.h | 82 +++
arch/arm64/kernel/cpufeature.c | 15 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 24 +-
arch/arm64/kvm/debug.c | 13 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 65 +-
arch/arm64/kvm/pmu-emul.c | 629 +----------------
arch/arm64/kvm/pmu-part.c | 358 ++++++++++
arch/arm64/kvm/pmu.c | 630 ++++++++++++++++++
arch/arm64/kvm/sys_regs.c | 54 +-
arch/arm64/tools/cpucaps | 2 +
arch/arm64/tools/gen-sysreg.awk | 1 +
arch/arm64/tools/sysreg | 6 +-
drivers/perf/arm_pmuv3.c | 55 +-
include/kvm/arm_pmu.h | 199 ------
include/linux/perf/arm_pmu.h | 15 +-
include/linux/perf/arm_pmuv3.h | 14 +-
include/uapi/linux/kvm.h | 4 +
tools/include/uapi/linux/kvm.h | 2 +
.../selftests/kvm/arm64/vpmu_counter_access.c | 40 +-
virt/kvm/kvm_main.c | 1 +
25 files changed, 1616 insertions(+), 879 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_pmu.h
create mode 100644 arch/arm64/kvm/pmu-part.c
delete mode 100644 include/kvm/arm_pmu.h
base-commit: 1b85d923ba8c9e6afaf19e26708411adde94fba8
--
2.49.0.1204.g71687c7c1d-goog